You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -25,8 +25,10 @@ This article helps administrators diagnose and resolve connectivity problems whe
25
25
26
26
## Problems
27
27
28
-
-[SNAT exhaustion](#snat-exhaustion).
29
-
-[ICMP ping is failing](#icmp-ping-is-failing).
28
+
*[SNAT exhaustion](#snat-exhaustion)
29
+
*[ICMP ping is failing](#icmp-ping-is-failing)
30
+
*[Connectivity failures](#connectivity-failures)
31
+
*[IPv6 coexistence](#ipv6-coexistence)
30
32
31
33
To resolve these problems, follow the steps in the following section.
32
34
@@ -36,36 +38,43 @@ To resolve these problems, follow the steps in the following section.
36
38
37
39
A single [NAT gateway resource](nat-gateway-resource.md) supports from 64,000 up to 1 million concurrent flows. Each IP address provides 64,000 SNAT ports to the available inventory. You can use up to 16 IP addresses per NAT gateway resource. The SNAT mechanism is described [here](nat-gateway-resource.md#source-network-address-translation) in more detail.
38
40
39
-
#### Steps:
41
+
Frequently the root cause of SNAT exhaustion is an anti-pattern for how outbound connectivity is established and managed. Review this section carefully.
42
+
43
+
#### Steps
40
44
41
45
1. Investigate how your application is creating outbound connectivity (for example, code review or packet capture).
42
-
2. Determine if this activity is expected behavior or whether the application is misbehaving. Use [metrics](nat-metrics.md) in Azure Monitor to substantiate your findings.
46
+
2. Determine if this activity is expected behavior or whether the application is misbehaving. Use [metrics](nat-metrics.md) in Azure Monitor to substantiate your findings. Use "Failed" category for SNAT Connections metric.
43
47
3. Evaluate if appropriate patterns are followed.
44
48
4. Evaluate if SNAT port exhaustion should be mitigated with additional IP addresses assigned to NAT gateway resource.
45
49
46
-
#### Design pattern:
50
+
#### Design patterns
51
+
52
+
Always take advantage of connection reuse and connection pooling whenever possible. These patterns will avoid resource exhaustion problems outright and result in predictable, reliable, and scalable behavior. Primitives for these patterns can be found in many development libraries and frameworks.
53
+
54
+
_**Solution:**_ Use appropriate patterns
47
55
48
-
Always take advantage of connection reuse and connection pooling whenever possible. This pattern will avoid resource exhaustion problems outright and result in predictable behavior. Primitives for these patterns can be found in many development libraries and frameworks.
49
56
- Consider [asynchronous polling patterns](https://docs.microsoft.com/azure/architecture/patterns/async-request-reply) for long-running operations to free up connection resources for other operations.
50
57
- Long-lived flows (for example reused TCP connections) should use TCP keepalives or application layer keepalives to avoid intermediate systems timing out.
51
58
- Graceful [retry patterns](https://docs.microsoft.com/azure/architecture/patterns/retry) should be used to avoid aggressive retries/bursts during transient failure or failure recovery.
52
59
Creating a new TCP connection for every HTTP operation (also known as "atomic connections") is an anti-pattern. Atomic connections will prevent your application from scaling well and waste resources. Always pipeline multiple operations into the same connection. Your application will benefit in transaction speed and resource costs. When your application uses transport layer encryption (for example TLS), there's a significant cost associated with the processing of new connections. Review [Azure Cloud Design Patterns](https://docs.microsoft.com/azure/architecture/patterns/) for additional best practice patterns.
53
60
54
-
#### Mitigations
61
+
#### Possible mitigations
55
62
56
-
You can scale outbound connectivity as follows:
63
+
_**Solution:**_ Scale outbound connectivity as follows:
57
64
58
-
| Scenario | Mitigation |
59
-
|---|---|
60
-
| You're experiencing contention for SNAT ports and SNAT port exhaustion during periods of high usage. | Determine if you can add additional public IP address resources or public IP prefix resources. This addition will allow for up to 16 IP addresses in total to your NAT gateway. This addition will provide more inventory for available SNAT ports (64,000 per IP address) and allow you to scale your scenario further.|
61
-
| You've already given 16 IP addresses and still are experiencing SNAT port exhaustion. | Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet. |
65
+
| Scenario |Evidence |Mitigation |
66
+
|---|---|---|
67
+
| You're experiencing contention for SNAT ports and SNAT port exhaustion during periods of high usage. |"Failed" category for SNAT Connections [metric](nat-metrics.md) in Azure Monitor shows transient or persistent failures over time and high connection volume. |Determine if you can add additional public IP address resources or public IP prefix resources. This addition will allow for up to 16 IP addresses in total to your NAT gateway. This addition will provide more inventory for available SNAT ports (64,000 per IP address) and allow you to scale your scenario further.|
68
+
| You've already given 16 IP addresses and still are experiencing SNAT port exhaustion. |Attempt to add additional IP address fails. Total number of IP addresses from public IP address resources or public IP prefix resources exceeds a total of 16. |Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet. Reevaluate your design pattern(s) to optimize based on preceding [guidance](#design-patterns). |
62
69
63
70
>[!NOTE]
64
71
>It is important to understand why SNAT exhaustion occurs. Make sure you are using the right patterns for scalable and reliable scenarios. Adding more SNAT ports to a scenario without understanding the cause of the demand should be a last resort. If you do not understand why your scenario is applying pressure on SNAT port inventory, adding more SNAT ports to the inventory by adding more IP addresses will only delay the same exhaustion failure as your application scales. You may be masking other inefficiencies and anti-patterns.
65
72
66
73
### ICMP ping is failing
67
74
68
-
[Virtual Network NAT](nat-overview.md) supports IPv4 UDP and TCP protocols. ICMP isn't supported and expected to fail. Instead, use TCP connection tests (for example "TCP ping") and UDP-specific application layer tests to validate end to end connectivity.
75
+
[Virtual Network NAT](nat-overview.md) supports IPv4 UDP and TCP protocols. ICMP isn't supported and expected to fail.
76
+
77
+
_**Solution:**_ Instead, use TCP connection tests (for example "TCP ping") and UDP-specific application layer tests to validate end to end connectivity.
69
78
70
79
The following table can be used a starting point for which tools to use to start tests.
71
80
@@ -74,8 +83,86 @@ The following table can be used a starting point for which tools to use to start
74
83
| Linux | nc (generic connection test) | curl (TCP application layer test) | application specific |
75
84
| Windows |[PsPing](https://docs.microsoft.com/sysinternals/downloads/psping)| PowerShell [Invoke-WebRequest](https://docs.microsoft.com/powershell/module/microsoft.powershell.utility/invoke-webrequest)| application specific |
76
85
86
+
### Connectivity failures
87
+
88
+
Connectivity issues with [Virtual Network NAT](nat-overview.md) can be due to several different issues:
89
+
90
+
* transient or persistent [SNAT exhaustion](#snat-exhaustion) of the NAT gateway,
91
+
* transient failures in the Azure infrastructure,
92
+
* transient failures in the path between Azure and the public Internet destination,
93
+
* transient or persistent failures at the public Internet destination.
94
+
95
+
Use tools like the following to validation connectivity. [ICMP ping is not supported](#icmp-ping-is-failing).
96
+
97
+
| Operating system | Generic TCP connection test | TCP application layer test | UDP |
98
+
|---|---|---|---|
99
+
| Linux | nc (generic connection test) | curl (TCP application layer test) | application specific |
100
+
| Windows |[PsPing](https://docs.microsoft.com/sysinternals/downloads/psping)| PowerShell [Invoke-WebRequest](https://docs.microsoft.com/powershell/module/microsoft.powershell.utility/invoke-webrequest)| application specific |
101
+
102
+
#### SNAT exhaustion
103
+
104
+
Review section on [SNAT exhaustion](#snat-exhaustion) in this article.
105
+
106
+
#### Azure infrastructure
107
+
108
+
Even though Azure monitors and operates its infrastructure with great care, transient failures can occur as there is no guarantee that transmissions are lossless. Use design patterns that allow for SYN retransmissions for TCP applications. Use connection timeouts large enough to permit TCP SYN retransmission to reduce transient impacts caused by a lost SYN packet.
109
+
110
+
_**Solution:**_
111
+
112
+
* Check for [SNAT exhaustion](#snat-exhaustion).
113
+
* The configuration parameter in a TCP stack that controls the SYN retransmission behavior is called RTO ([Retransmission Time-Out](https://tools.ietf.org/html/rfc793)). The RTO value is adjustable but typically 1 second or higher by default with exponential back-off. If your application's connection time-out is too short (for example 1 second), you may see sporadic connection timeouts. Increase the application connection time-out.
114
+
* If you observe longer, unexpected timeouts with default application behaviors, open a support case for further troubleshooting.
115
+
116
+
We do not recommend artificially reducing the TCP connection timeout or tuning the RTO parameter.
117
+
118
+
#### public Internet transit
119
+
120
+
The probability of transient failures increases with a longer path to the destination and more intermediate systems. It is expected that transient failures can increase in frequency over [Azure infrastructure](#azure-infrastructure).
121
+
122
+
Follow the same guidance as preceding [Azure infrastructure](#azure-infrastructure) section.
123
+
124
+
#### Internet endpoint
125
+
126
+
The preceding sections apply in addition to considerations related to the Internet endpoint your communication is established with. Other factors that can impact connectivity success are:
127
+
128
+
* traffic management on destination side, including
129
+
- API rate limiting imposed by the destination side
130
+
- Volumetric DDoS mitigations or transport layer traffic shaping
131
+
* firewall or other components at the destination
132
+
133
+
Usually packet captures at the source as well as destination (if available) are required to determine what is taking place.
134
+
135
+
_**Solution:**_
136
+
137
+
* Check for [SNAT exhaustion](#snat-exhaustion).
138
+
* Validate connectivity to an endpoint in the same region or elsewhere for comparison.
139
+
* If you're creating high volume or transaction rate testing, explore if reducing the rate reduces the occurrence of failures.
140
+
* If changing rate impacts the rate of failures, check if API rate limits or other constraints on the destination side might have been reached.
141
+
* If your investigation is inconclusive, open a support case for further troubleshooting.
142
+
143
+
#### TCP Resets received
144
+
145
+
If you observe TCP Resets (TCP RST packets) received on the source VM, they can be generated by the NAT gateway on the private side for flows that are not recognized as in progress. One possible reason is the TCP connection has idle timed out. You can adjust the idle timeout from 4 minutes to up to 120 minutes.
146
+
147
+
TCP Resets are not generated on the public side of NAT gateway resources. If you receive TCP Resets on the destination side, they are generated by the source VM's stack and not the NAT gateway resource.
* Open a support case for further troubleshooting if necessary.
153
+
154
+
### IPv6 coexistence
155
+
156
+
[Virtual Network NAT](nat-overview.md) supports IPv4 UDP and TCP protocols and deployment on a [subnet with IPv6 prefix is not supported](nat-overview.md#limitations).
157
+
158
+
_**Solution:**_ Deploy NAT gateway on a subnet without IPv6 prefix.
159
+
160
+
You can indicate interest in additional capabilities through [Virtual Network NAT UserVoice](https://aka.ms/natuservoice).
161
+
77
162
## Next steps
78
163
79
-
- Learn about [Virtual Network NAT](nat-overview.md)
80
-
- Learn about [NAT gateway resource](nat-gateway-resource.md)
81
-
- Learn about [metrics and alerts for NAT gateway resources](nat-metrics.md).
164
+
* Learn about [Virtual Network NAT](nat-overview.md)
165
+
* Learn about [NAT gateway resource](nat-gateway-resource.md)
166
+
* Learn about [metrics and alerts for NAT gateway resources](nat-metrics.md).
167
+
*[Tell us what to build next for Virtual Network NAT in UserVoice](https://aka.ms/natuservoice).
0 commit comments