Azure SNAT
This post was contributed by Pedro Perez.
Azure’s network infrastructure is quite different than your usual on-premises network as there are different layers of software abstraction that work behind the curtains. I would like to talk today about one of these layers and why you may want to take it into account when troubleshooting a network issue with your application.
Possibly the biggest challenge for us and for our customers in the cloud is to build to scale. Our engineers have designed Azure to be able to work at hyper-scale and simultaneously to be able to accommodate our most technically demanding customers. As you can imagine, this adds complexity to the system. For example, there are multiple IPs associated with each virtual machine. In a non-Azure Resource Manager (ARM) scenario, when you deploy a VM inside a Cloud Service, the VM gets assigned an IP address known as the dynamic IP (DIP), which is not routable outside Azure. The Cloud Service gets assigned an IP address known as the virtual IP (VIP), which is a routable public IP address. Every VM inside the Cloud Service will hide behind the VIP when sending outgoing traffic and can only be accessed through the creation of an endpoint on the VIP that maps to that specific VM.
A VIP could be defined, in a parallelism with traditional on-premises networks, as a NAT IP address. The biggest particularity of the VIP is that it is shared among all the VMs in the same Cloud Service. You can easily control which ports redirect traffic to which VM by leveraging endpoints on the Cloud Service, but how does the translation work for outgoing traffic?
Source NAT
This is where Source NAT (SNAT from now on) comes into play. Any traffic leaving the Cloud Service (i.e. from a VM inside the Cloud Service and not going to another VM in the same Cloud Service) will go through a NAT layer, where it will get SNAT applied. As the name implies, SNAT only changes source information: Source IP address and source port.
Source IP address translation
The source IP address changes from the original DIP to the Cloud Service’s VIP, so traffic can be easily routed. This is a many-to-one mapping where all the VMs inside the Cloud Service will be translated to the Cloud Service’s VIP. At this point we already have a challenge. What would happen if two VMs inside the same Cloud Service create an outgoing connection to the same destination using also the same source port? Remember, a system will differentiate between different TCP (or UDP) connections by looking at the 4-tuple Source IP, Source Port, Destination IP, and Destination Port.
Changing any of the destination information will effectively break the connection, and we can’t change the source IP address as we only have one (the VIP). Therefore, we have to change the source port.
Source port translation
Azure pre-allocates 160 source ports on the VIP for the VMs connections. This pre-allocation is done to speed up establishing new communications and it’s limited to 160 ports to save resources. The initial port is randomly chosen among the high ports range, pre-allocating it and the next 159. We’ve found that these settings work for everyone as long as we follow some best practices when developing applications that talk over the network.
Azure will translate the source port of an outgoing connection to the first one available from those 160 pre-allocated ports.
The first question you might have already may be What happens if I simultaneously use the 160 ports? Well, if none of them has been freed yet the system will assign more ports on a best-effort basis, but as soon as one becomes free it will available again for use.
SNAT Table
All these translations must be stored somewhere, so we can keep track of them when the packets flow to and from our VM. The place where these are stored is called the SNAT Table and it’s the same concept of NAT table you could find in any other networking products like firewalls or routers.
The system will save the original 5-tuple (source IP, source port, destination IP, destination port, protocol - tcp/udp) and the translated 5-tuple where (source IP, source port) have been translated to the VIP and one of the pre-allocated ports.
Removing entries from the table
As in any other NAT table out there, you can’t store these entries forever and there should be rules to remove an entry, the most evident ones are:
- If the connection has been closed with FIN, ACK we will wait a few minutes (2xMSL (Maximum Segment Lifetime) - https://www.rfc-editor.org/rfc/rfc793.txt) before removing the entry.
- If the connection has been closed with a RST we will remove the entry straightaway.
At this point, I'll bet you’ve already spotted an issue here. How do we decide to remove an UDP connection or a TCP connection where the peers just disappeared (e.g. crashed or just stopped responding)?
In that case we’ve got a hardcoded timeout value. Every time a packet for a specific connection goes through the SNAT process, we start a four minute countdown on that connection. If we reach zero before another packet goes through, we delete the SNAT entry from the table as we consider the connection to be finished. This is a very important point: If your application keeps a connection idle for 4 minutes, its entry in the connection table will get deleted. Most applications won’t handle losing a connection they thought was still active, so it is prudent that you manage your connection lifetime wisely and not let connections go idle.
Long-time idle connections considered harmful
Sometimes it’s easier just to show an example to help explaining a complex situation, so here’s an example of what could go wrong and why you shouldn’t keep TCP connections idle. This is how an active HTTP connection would look in the client’s (VM in a Cloud Service), SNAT and server’s connection tables:
Client
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
CLIENT DIP |
12345 |
SERVER VIP |
80 |
ESTABLISHED |
Source port is randomly chosen by the client OS.
SNAT table
SRC IP |
SRC PORT |
DST IP |
DST PORT |
DIP-> VIP |
12345 -> 54321 |
SERVER IP |
80 |
The DIP has been translated into the VIP and source port has been translated to the first available among the 160 pre-allocated ports.
Server
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
VIP |
54321 |
SERVER IP |
80 |
ESTABLISHED |
The server doesn’t know the client’s DIP or the original source port as these are hidden behind the VIP because of the SNAT.
So far, so good.
Let’s now imagine that this connection has been idle for just a bit more than 4 minutes. How would the tables look?
Client
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
CLIENT DIP |
12345 |
SERVER VIP |
80 |
ESTABLISHED |
There are no changes here. The client has the connection ready for when more data is needed, but there’s no data pending from the server.
SNAT table
SRC IP |
SRC PORT |
DST IP |
DST PORT |
REMOVED |
REMOVED |
REMOVED |
REMOVED |
What happened here?!
The SNAT table entry has expired because it has been 4+ minutes idle, so it’s gone from the SNAT table.
Server
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
CLIENT VIP |
54321 |
SERVER IP |
80 |
ESTABLISHED |
As expected, nothing changed on the server. It has sent all the data that the client requested and it’s been 4+ minutes awaiting new requests on that TCP connection.
Now comes the problem. Let’s say the client resumes its operations and decides to request more data from the server using the same connection. Unfortunately, that won’t work because Azure will drop the traffic at the SNAT layer, because the packet does not meet any of these criteria:
- It belongs to an existing connection (nope - doesn't meet this criteria because it had expired so was removed! )
- It is a SYN packet (for new connections) (nope - doesn't meet this criteria since it isn't a SYN packet)
This means that the attempt to connection on this tuple will fail. Ok, this is a problem but not the end of the world, right? The client will just open a new TCP connection (i.e. a SYN packet will go through the SNAT) and send the HTTP request inside that one. That’s correct, but there are situations where we could face another consequence of that SNAT entry expiry.
If the client is opening new connections to the same server and port (SERVER IP:80) fast enough to cycle through the 160 assigned ports (or faster), but not explicitly closing them, port 54321 will be free to use again (Remember: the translation for port 12345->54321 has expired) and we would have run through the original 160 ports in a breeze. Rather sooner than later, port 54321 will be used again for a new translation with a source port other than 12345, but for the same source and destination IP addresses (and same destination port!). Here’s how it will look:
Client
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
CLIENT DIP |
12346 |
SERVER VIP |
80 |
SYN_SENT |
Client decides to create a new connection, so it sends a SYN packet to establish a new connection on SERVER IP:80.
SNAT table
SRC IP |
SRC PORT |
DST IP |
DST PORT |
DIP-> VIP |
12346-> 54321 |
SERVER IP |
80 |
Azure sees the packet and there’s no matching entry on the table, but accepts it as it’s a SYN packet (i.e. new connection). Translates 12346to 54321 since it’s again the first availablefrom the 160 pre-allocated ports.
Server
SRC IP |
SRC PORT |
DST IP |
DST PORT |
TCP STATE |
CLIENT VIP |
54321 |
SERVER IP |
80 |
ESTABLISHED |
The server has already an ESTABLISHED connection, so when it receives a SYN packet from CLIENT VIP:54321 it will ignore and drop it. At this point in time, we’ve ended up with two broken connections: The original that has been idle for 4+ minutes and the new one.
The best way to avoid this issue, and actually many issues on different kind of platforms is to have a sensible keep-alive at the application level (https://msdn.microsoft.com/en-us/library/windows/desktop/ee470551(v=vs.85).aspx). Sending a packet through an idle connection every 30 seconds or 1 minute should be considered as it’ll reset any idle timers both in Azure and in on-premises firewalls.
Need a quick workaround?
There’s a quick workaround you can use in Azure. You can avoid using the VIP for your outgoing traffic (and incoming too) by assigning an Instance-Level IP address, known as PIP. The PIP is assigned to only one instance, thus not needing to use SNAT to accommodate the requests of the different VMs. It still goes through the software load balancer (SNAT), but as there’s no SNAT applied, there’s no SNAT table and you can happily keep your connections idle… Until the SLB kills them (https://azure.microsoft.com/blog/2014/08/14/new-configurable-idle-timeout-for-azure-load-balancer/), but that’s another story. J
Before we go, we should probably also acknowledge another long-standing problem with this design. Since Azure allocates these outgoing ports in batches of 160, it is possible that the creation of a new batch of 160 may not happen fast enough and an outgoing connection attempt will fail. We typically only see this under very high load (almost always load testing), but if you fall victim to this, the solution is the same – use a PIP.
- Anonymous
March 17, 2016
The comment has been removed- Anonymous
May 21, 2016
Thanks for the suggestion. I highlighted the text a different color.
- Anonymous