Skip to content

Commit 63e1cdc

Browse files
Merge pull request #235685 from asudbring/linux-fixes
[Doc-a-thon] Linux doc-athon fixes Troubleshoot Azure NAT Gateway connectivity
2 parents 8de4158 + 7f219ac commit 63e1cdc

File tree

1 file changed

+27
-19
lines changed

1 file changed

+27
-19
lines changed

articles/virtual-network/nat-gateway/troubleshoot-nat-connectivity.md

Lines changed: 27 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
---
22
title: Troubleshoot Azure NAT Gateway connectivity
3-
titleSuffix: Azure Virtual Network
4-
description: Troubleshoot connectivity issues with NAT gateway.
3+
description: Troubleshoot connectivity issues with a NAT gateway.
54
author: asudbring
65
ms.service: virtual-network
76
ms.subservice: nat
87
ms.custom: ignite-2022
98
ms.topic: troubleshooting
10-
ms.date: 08/29/2022
9+
ms.date: 04/24/2023
1110
ms.author: allensu
1211
---
1312

@@ -27,36 +26,40 @@ SNAT exhaustion issues with NAT gateway typically have to do with the configurat
2726

2827
Each public IP address provides 64,512 SNAT ports for connecting outbound with NAT gateway. From those available SNAT ports, NAT gateway can support up to 50,000 concurrent connections to the same destination endpoint. If outbound connections are dropping because SNAT ports are being exhausted, then NAT gateway may not be scaled out enough to handle the workload. More public IP addresses on NAT gateway may be required in order to provide more SNAT ports for outbound connectivity.
2928

30-
The table below describes two common outbound connectivity failure scenarios due to scalability issues and how to validate and mitigate these issues:
29+
The following table describes two common outbound connectivity failure scenarios due to scalability issues and how to validate and mitigate these issues:
3130

3231
| Scenario | Evidence |Mitigation |
3332
|---|---|---|
34-
| You're experiencing contention for SNAT ports and SNAT port exhaustion during periods of high usage. | You run the following [metrics](nat-metrics.md) in Azure Monitor: **Total SNAT Connection Count**: "Sum" aggregation shows high connection volume. For **SNAT Connection Count**, "Failed" connection state shows transient or persistent failures over time. **Dropped Packets**: "Sum" aggregation shows packets dropping consistent with high connection volume and connection failures. | Add more public IP addresses or public IP prefixes as need (assign up to 16 IP addresses in total to your NAT gateway). This addition will provide more SNAT port inventory and allow you to scale your scenario further. |
35-
| You've already assigned 16 IP addresses to your NAT gateway and still are experiencing SNAT port exhaustion. | Attempt to add more IP addresses fails. Total number of IP addresses from public IP address or public IP prefix resources exceeds a total of 16. | Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet. |
33+
| You're experiencing contention for SNAT ports and SNAT port exhaustion during periods of high usage. | You run the following [metrics](nat-metrics.md) in Azure Monitor: **Total SNAT Connection Count**: "Sum" aggregation shows high connection volume. For **SNAT Connection Count**, "Failed" connection state shows transient or persistent failures over time. **Dropped Packets**: "Sum" aggregation shows packets dropping consistent with high connection volume and connection failures. | Add more public IP addresses or public IP prefixes as need (assign up to 16 IP addresses in total to your NAT gateway). This addition provides more SNAT port inventory and allow you to scale your scenario further. |
34+
| You have already assigned 16 IP addresses to your NAT gateway and still are experiencing SNAT port exhaustion. | Attempt to add more IP addresses fails. Total number of IP addresses from public IP address or public IP prefix resources exceeds a total of 16. | Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet. |
3635

3736
>[!NOTE]
38-
>It is important to understand why SNAT exhaustion occurs. Make sure you are using the right patterns for scalable and reliable scenarios. Adding more SNAT ports to a scenario without understanding the cause of the demand should be a last resort. If you do not understand why your scenario is applying pressure on SNAT port inventory, adding more SNAT ports by adding more IP addresses will only delay the same exhaustion failure as your application scales. You may be masking other inefficiencies and anti-patterns. See [best practices for efficient use of outbound connections](#outbound-connectivity-best-practices) for additional guidance.
37+
>It is important to understand why SNAT exhaustion occurs. Make sure you are using the right patterns for scalable and reliable scenarios. Adding more SNAT ports to a scenario without understanding the cause of the demand should be a last resort. If you do not understand why your scenario is applying pressure on SNAT port inventory, adding more SNAT ports by adding more IP addresses will only delay the same exhaustion failure as your application scales. You may be masking other inefficiencies and anti-patterns. For more information, see [best practices for efficient use of outbound connections](#outbound-connectivity-best-practices).
3938
4039
### TCP idle timeout timers set higher than the default value
4140

42-
The NAT gateway TCP idle timeout timer is set to 4 minutes by default but is configurable up to 120 minutes. If the timer is set to a higher value than the default, NAT gateway will hold on to flows longer, and can create [extra pressure on SNAT port inventory](./nat-gateway-resource.md#timers). The table below describes a scenario where a long TCP idle timeout timer is causing SNAT exhaustion and provides mitigation steps to take:
41+
The NAT gateway TCP idle timeout timer is set to 4 minutes by default but is configurable up to 120 minutes. If the timer is set to a higher value than the default, NAT gateway holds on to flows longer, and can create [extra pressure on SNAT port inventory](./nat-gateway-resource.md#timers).
42+
43+
The following table describes a scenario where a long TCP idle timeout timer is causing SNAT exhaustion and provides mitigation steps to take:
4344

4445
| Scenario | Evidence | Mitigation |
4546
|---|---|---|
46-
| You want to ensure that TCP connections stay active for long periods of time without going idle and timing out. You increase the TCP idle timeout timer setting. After a period of time, you start to notice that connection failures occur more often. You suspect that you may be exhausting your inventory of SNAT ports since connections are holding on to them longer. | You check the following [NAT gateway metrics](nat-metrics.md) in Azure Monitor to determine if SNAT port exhaustion is happening: **Total SNAT Connection Count**: "Sum" aggregation shows high connection volume. For **SNAT Connection Count**, "Failed" connection state shows transient or persistent failures over time. **Dropped Packets**: "Sum" aggregation shows packets dropping consistent with high connection volume and connection failures. | Some possible steps you can take to resolve SNAT port exhaustion include: </br></br> **Reduce the TCP idle timeout** to a lower value to free up SNAT port inventory earlier. The TCP idle timeout timer can't be set lower than 4 minutes. </br></br> Consider **[asynchronous polling patterns](/azure/architecture/patterns/async-request-reply)** to free up connection resources for other operations. </br></br> **Use TCP keepalives or application layer keepalives** to avoid intermediate systems timing out. For examples, see [.NET examples](/dotnet/api/system.net.servicepoint.settcpkeepalive). </br></br> Make connections to Azure PaaS services over the Azure backbone using **[Private Link](../../private-link/private-link-overview.md)**. This frees up SNAT ports for outbound connections to the internet. |
47+
| You want to ensure that TCP connections stay active for long periods of time without idling and timing out. You increase the TCP idle timeout timer setting. After a period of time, you start to notice that connection failures occur more often. You suspect that you may be exhausting your inventory of SNAT ports since connections are holding on to them longer. | You check the following [NAT gateway metrics](nat-metrics.md) in Azure Monitor to determine if SNAT port exhaustion is happening: **Total SNAT Connection Count**: "Sum" aggregation shows high connection volume. For **SNAT Connection Count**, "Failed" connection state shows transient or persistent failures over time. **Dropped Packets**: "Sum" aggregation shows packets dropping consistent with high connection volume and connection failures. | Some possible steps you can take to resolve SNAT port exhaustion include: </br></br> **Reduce the TCP idle timeout** to a lower value to free up SNAT port inventory earlier. The TCP idle timeout timer can't be set lower than 4 minutes. </br></br> Consider **[asynchronous polling patterns](/azure/architecture/patterns/async-request-reply)** to free up connection resources for other operations. </br></br> **Use TCP keepalives or application layer keepalives** to avoid intermediate systems timing out. For examples, see [.NET examples](/dotnet/api/system.net.servicepoint.settcpkeepalive). </br></br> Make connections to Azure PaaS services over the Azure backbone using **[Private Link](../../private-link/private-link-overview.md)**. The use of private link frees up SNAT ports for outbound connections to the internet. |
4748

4849
## Connection failures due to idle timeouts
4950

5051
### TCP idle timeout
5152

52-
As described in the [TCP timers](#tcp-idle-timeout-timers-set-higher-than-the-default-value) section above, TCP keepalives should be used to refresh idle flows and reset the idle timeout. TCP keepalives only need to be enabled from one side of a connection in order to keep a connection alive from both sides. When a TCP keepalive is sent from one side of a connection, the other side automatically sends an ACK packet. The idle timeout timer is then reset on both sides of the connection. To learn more, see [Timer considerations](./nat-gateway-resource.md#timer-considerations).
53+
As described in the [TCP timers](#tcp-idle-timeout-timers-set-higher-than-the-default-value) in the previous section, TCP keepalives should be used to refresh idle flows and reset the idle timeout. TCP keepalives only need to be enabled from one side of a connection in order to keep a connection alive from both sides. When a TCP keepalive is sent from one side of a connection, the other side automatically sends an ACK packet. The idle timeout timer is then reset on both sides of the connection. To learn more, see [Timer considerations](./nat-gateway-resource.md#timer-considerations).
5354

5455
>[!Note]
5556
>Increasing the TCP idle timeout is a last resort and may not resolve the root cause. A long timeout can cause low-rate failures when timeout expires and introduce delay and unnecessary failures.
5657
5758
### UDP idle timeout
5859

59-
UDP idle timeout timers are set to 4 minutes. Unlike TCP idle timeout timers for NAT gateway, UDP idle timeout timers aren't configurable. The table below describes a common scenario encountered with connections dropping due to UDP traffic idle timing out and steps to take to mitigate the issue.
60+
UDP idle timeout timers are set to 4 minutes. Unlike TCP idle timeout timers for NAT gateway, UDP idle timeout timers aren't configurable.
61+
62+
The following table describes a common scenario encountered with connections dropping due to UDP traffic idle timing out and steps to take to mitigate the issue.
6063

6164
| Scenario | Evidence | Mitigation |
6265
|---|---|---|
@@ -66,13 +69,13 @@ UDP idle timeout timers are set to 4 minutes. Unlike TCP idle timeout timers for
6669

6770
### VMs hold on to prior SNAT IP with active connection after NAT gateway added to a virtual network
6871

69-
[NAT gateway](nat-overview.md) becomes the default route to the internet when configured to a subnet. Migration from default outbound access or load balancer to NAT gateway results in new connections immediately using the IP address(es) associated with the NAT gateway resource. If a virtual machine has an established connection during the migration, the connection will continue to use the old SNAT IP address that was assigned when the connection was established.
72+
[NAT gateway](nat-overview.md) becomes the default route to the internet when configured to a subnet. Migration from default outbound access or load balancer to NAT gateway results in new connections immediately using the IP address(es) associated with the NAT gateway resource. If a virtual machine has an established connection during the migration, the connection continues to use the old SNAT IP address that was assigned when the connection was established.
7073

7174
Test and resolve issues with VMs holding on to old SNAT IP addresses by:
7275

7376
- Ensure you've established a new connection and that existing connections aren't being reused in the OS or that the browser is caching the connections. For example, when using curl in PowerShell, make sure to specify the -DisableKeepalive parameter to force a new connection. If you're using a browser, connections may also be pooled.
7477

75-
- It isn't necessary to reboot a virtual machine in a subnet configured to NAT gateway. However, if a virtual machine is rebooted, the connection state is flushed. When the connection state has been flushed, all connections will begin using the NAT gateway resource's IP address(es). This behavior is a side effect of the virtual machine reboot and not an indicator that a reboot is required.
78+
- It isn't necessary to reboot a virtual machine in a subnet configured to NAT gateway. However, if a virtual machine is rebooted, the connection state is flushed. When the connection state has been flushed, all connections begin using the NAT gateway resource's IP address(es). This behavior is a side effect of the virtual machine reboot and not an indicator that a reboot is required.
7679

7780
If you're still having trouble, open a support case for further troubleshooting.
7881

@@ -95,7 +98,7 @@ Once the custom UDR is removed from the routing table, the NAT gateway public IP
9598

9699
### Private IPs are used to connect to Azure services by Private Link
97100

98-
[Private Link](../../private-link/private-link-overview.md) connects your Azure virtual networks privately to Azure PaaS services such as Azure Storage, Azure SQL, or Azure Cosmos DB over the Azure backbone network instead of over the internet. Private Link uses the private IP addresses of virtual machine instances in your virtual network to connect to these Azure platform services instead of the public IP of NAT gateway. As a result, when looking at the source IP address used to connect to these Azure services, you'll notice that the private IPs of your instances are used. See [Azure services listed here](../../private-link/availability.md) for all services supported by Private Link.
101+
[Private Link](../../private-link/private-link-overview.md) connects your Azure virtual networks privately to Azure PaaS services such as Azure Storage, Azure SQL, or Azure Cosmos DB over the Azure backbone network instead of over the internet. Private Link uses the private IP addresses of virtual machine instances in your virtual network to connect to these Azure platform services instead of the public IP of NAT gateway. As a result, when looking at the source IP address used to connect to these Azure services, you notice that the private IPs of your instances are used. See [Azure services listed here](../../private-link/availability.md) for all services supported by Private Link.
99102

100103
To check which Private Endpoints you have set up with Private Link:
101104

@@ -107,7 +110,7 @@ Service endpoints can also be used to connect your virtual network to Azure PaaS
107110

108111
1. From the Azure portal, navigate to your virtual network and select "Service endpoints" from Settings.
109112

110-
2. All Service endpoints created will be listed along with which subnets they're configured. For more information, see [logging and troubleshooting Service endpoints](../virtual-network-service-endpoints-overview.md#logging-and-troubleshooting).
113+
2. All Service endpoints created are listed along with which subnets they're configured. For more information, see [logging and troubleshooting Service endpoints](../virtual-network-service-endpoints-overview.md#logging-and-troubleshooting).
111114

112115
>[!NOTE]
113116
>Private Link is the recommended option over Service endpoints for private access to Azure hosted services.
@@ -126,7 +129,7 @@ Use NAT gateway [metrics](nat-metrics.md) in Azure monitor to diagnose connectio
126129

127130
* Look at packet count at the source and the destination (if available) to determine how many connection attempts were made.
128131

129-
* Look at dropped packets to see how many packets were dropped by NAT gateway.
132+
* Look at dropped packets to see how many packets dropped by NAT gateway.
130133

131134
What else to check for:
132135

@@ -144,8 +147,10 @@ Outbound Passive FTP may not work for NAT gateway with multiple public IP addres
144147

145148
Passive FTP establishes different connections for control and data channels. When a NAT gateway with multiple public IP addresses sends traffic outbound, it randomly selects one of its public IP addresses for the source IP address. FTP may fail when data and control channels use different source IP addresses, depending on your FTP server configuration.
146149

147-
To prevent possible passive FTP connection failures, make sure to do the following:
150+
To prevent possible passive FTP connection failures, do the following steps:
151+
148152
1. Check that your NAT gateway is attached to a single public IP address rather than multiple IP addresses or a prefix.
153+
149154
2. Make sure that the passive port range from your NAT gateway is allowed to pass any firewalls that may be at the destination endpoint.
150155

151156
### Extra network captures
@@ -158,7 +163,7 @@ If your investigation is inconclusive, open a support case for further troublesh
158163

159164
## Outbound connectivity best practices
160165

161-
Azure monitors and operates its infrastructure with great care. However, transient failures can still occur from deployed applications, there's no guarantee that transmissions are lossless. NAT gateway is the preferred option to connect outbound from Azure deployments in order to ensure highly reliable and resilient outbound connectivity. In addition to using NAT gateway to connect outbound, use the guidance below for how to ensure that applications are using connections efficiently.
166+
Azure monitors and operates its infrastructure with great care. However, transient failures can still occur from deployed applications, there's no guarantee that transmissions are lossless. NAT gateway is the preferred option to connect outbound from Azure deployments in order to ensure highly reliable and resilient outbound connectivity. In addition to using NAT gateway to connect outbound, use the guidance later in the article for how to ensure that applications are using connections efficiently.
162167

163168
### Modify the application to use connection pooling
164169

@@ -189,14 +194,17 @@ When possible, Private Link should be used to connect directly from your virtual
189194
To create a Private Link, see the following Quickstart guides to get started:
190195

191196
* [Create a Private Endpoint](../../private-link/create-private-endpoint-portal.md?tabs=dynamic-ip)
197+
192198
* [Create a Private Link](../../private-link/create-private-link-service-portal.md)
193199

194200
## Next steps
195201

196-
We're always looking to improve the experience of our customers. If you're experiencing issues with NAT gateway that aren't listed or resolved by this article, submit feedback through GitHub via the bottom of this page. We'll address your feedback as soon as possible.
202+
We always strive to enhance our customers' experience. If you encounter NAT gateway issues that not addressed or resolved by this article, provide feedback through GitHub at the bottom of this page.
197203

198204
To learn more about NAT gateway, see:
199205

200206
* [Azure NAT Gateway](./nat-overview.md)
207+
201208
* [NAT gateway resource](./nat-gateway-resource.md)
209+
202210
* [Metrics and alerts for NAT gateway resources](./nat-metrics.md)

0 commit comments

Comments
 (0)