Skip to content

Commit 863d0bb

Browse files
authored
Merge pull request #108807 from v-miegge/v-miegge/troubleshoot-intermittent-outbound-connection-errors
CI 115826 - Created file and TOC entry
2 parents ffc2353 + f8431d6 commit 863d0bb

File tree

2 files changed

+176
-0
lines changed

2 files changed

+176
-0
lines changed

articles/app-service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,8 @@
265265
href: https://azure.microsoft.com/documentation/scripts/
266266
- name: Troubleshooting
267267
items:
268+
- name: Troubleshoot intermittent outbound connection errors
269+
href: troubleshoot-intermittent-outbound-connection-errors.md
268270
- name: Troubleshoot with Visual Studio
269271
href: troubleshoot-dotnet-visual-studio.md
270272
- name: Troubleshoot Node.js app
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
---
2+
title: Troubleshooting intermittent outbound connection errors in Azure App Service
3+
description: Troubleshoot intermittent connection errors and related performance issues in Azure App Service
4+
author: v-miegge
5+
manager: barbkess
6+
7+
ms.topic: troubleshooting
8+
ms.date: 03/24/2020
9+
ms.author: ramakoni
10+
ms.custom: security-recommendations
11+
12+
---
13+
14+
# Troubleshooting intermittent outbound connection errors in Azure App Service
15+
16+
This article helps you troubleshoot intermittent connection errors and related performance issues in [Azure App Service](https://docs.microsoft.com/azure/app-service/overview). This topic will provide more information on, and troubleshooting methodologies for, exhaustion of source address network translation (SNAT) ports. If you require more help at any point in this article, contact the Azure experts at the [MSDN Azure and the Stack Overflow forums](https://azure.microsoft.com/support/forums/). Alternatively, file an Azure support incident. Go to the [Azure Support site](https://azure.microsoft.com/support/options/) and select **Get Support**.
17+
18+
## Symptoms
19+
20+
Applications and Functions hosted on Azure App service may exhibit one or more of the following symptoms:
21+
22+
* Slow response times on all or some of the instances in a service plan.
23+
* Intermittent 5xx or **Bad Gateway** errors
24+
* Timeout error messages
25+
* Could not connect to external endpoints (like SQLDB, Service Fabric, other App services etc.)
26+
27+
## Cause
28+
29+
A major cause of these symptoms is that the application instance is not able to open a new connection to the external endpoint because it has reached one of the following limits:
30+
31+
* TCP Connections: There is a limit on the number of outbound connections that can be made. This is associated with the size of the worker used.
32+
* SNAT ports: As discussed in [Outbound connections in Azure](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections), Azure uses source network address translation (SNAT) and a Load Balancer (not exposed to customers) to communicate with end points outside Azure in the public IP address space. Each instance on Azure App service is initially given a pre-allocated number of **128** SNAT ports. That limit affects opening connections to the same host and port combination. If your app creates connections to a mix of address and port combinations, you will not use up your SNAT ports. The SNAT ports are used up when you have repeated calls to the same address and port combination. Once a port has been released, the port is available for reuse as needed. The Azure Network load balancer reclaims SNAT port from closed connections only after waiting for 4 minutes.
33+
34+
When applications or functions rapidly open a new connection, they can quickly exhaust their pre-allocated quota of the 128 ports. They are then blocked until a new SNAT port becomes available, either through dynamically allocating additional SNAT ports, or through reuse of a reclaimed SNAT port. Applications or functions that are blocked because of this inability to create new connections will begin experiencing one or more of the issues described in the **Symptoms** section of this article.
35+
36+
## Avoiding the problem
37+
38+
Avoiding the SNAT port problem means avoiding the creation of new connections repetitively to the same host and port.
39+
40+
General strategies for mitigating SNAT port exhaustion are discussed in the [Problem-solving section](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#problemsolving) of the **Outbound connections of Azure** documentation. Of these strategies, the following are applicable to apps and functions hosted on Azure App service.
41+
42+
### Modify the application to use connection pooling
43+
44+
* For pooling HTTP connections, review [Pool HTTP connections with HttpClientFactory](https://docs.microsoft.com/aspnet/core/performance/performance-best-practices#pool-http-connections-with-httpclientfactory).
45+
* For information on SQL Server connection pooling, review [SQL Server Connection Pooling (ADO.NET)](https://docs.microsoft.com/dotnet/framework/data/adonet/sql-server-connection-pooling).
46+
* For implementing pooling with entity framework applications, review [DbContext pooling](https://docs.microsoft.com/ef/core/what-is-new/ef-core-2.0#dbcontext-pooling).
47+
48+
Here is a collection of links for implementing Connection pooling by different solution stack.
49+
50+
#### Node
51+
52+
By default, connections for NodeJS are not kept alive. Below are the popular databases and packages for connection pooling which contain examples for how to implement them.
53+
54+
* [MySQL](https://github.com/mysqljs/mysql#pooling-connections)
55+
* [MongoDB](https://blog.mlab.com/2017/05/mongodb-connection-pooling-for-express-applications/)
56+
* [PostgreSQL](https://node-postgres.com/features/pooling)
57+
* [SQL Server](https://github.com/tediousjs/node-mssql#connection-pools)
58+
59+
HTTP Keep-alive
60+
61+
* [agentkeepalive](https://www.npmjs.com/package/agentkeepalive)
62+
* [Node.js v13.9.0 Documentation](https://nodejs.org/api/http.html)
63+
64+
#### Java
65+
66+
Below are the popular libraries used for JDBC connection pooling which contain examples for how to implement them:
67+
JDBC Connection Pooling.
68+
69+
* [Tomcat 8](https://tomcat.apache.org/tomcat-8.0-doc/jdbc-pool.html)
70+
* [C3p0](https://github.com/swaldman/c3p0)
71+
* [HikariCP](https://github.com/brettwooldridge/HikariCP)
72+
* [Apache DBCP](https://commons.apache.org/proper/commons-dbcp/)
73+
74+
HTTP Connection Pooling
75+
76+
* [Apache Connection Management](https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html)
77+
* [Class PoolingHttpClientConnectionManager](http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/conn/PoolingHttpClientConnectionManager.html)
78+
79+
#### PHP
80+
81+
Although PHP does not support connection pooling, you can try using persistent database connections to your back-end server.
82+
83+
* MySQL server
84+
85+
* [MySQLi connections](https://www.php.net/manual/mysqli.quickstart.connections.php) for newer versions
86+
* [mysql_pconnect](https://www.php.net/manual/function.mysql-pconnect.php) for older versions of PHP
87+
88+
* Other data Sources
89+
90+
* [PHP Connection Management](https://www.php.net/manual/en/pdo.connections.php)
91+
92+
#### Python
93+
94+
* [MySQL](https://github.com/mysqljs/mysql#pooling-connections)
95+
* [MongoDB](https://blog.mlab.com/2017/05/mongodb-connection-pooling-for-express-applications/)
96+
* [PostgreSQL](https://node-postgres.com/features/pooling)
97+
* [SQL Server](https://github.com/tediousjs/node-mssql#connection-pools) (NOTE: SQLAlchemy can be used with other databases besides MicrosoftSQL Server)
98+
* [HTTP Keep-alive](https://requests.readthedocs.io/en/master/user/advanced/#keep-alive)(Keep-Alive is automatic when using sessions [session-objects](https://requests.readthedocs.io/en/master/user/advanced/#keep-alive)).
99+
100+
For other environments, review provider or driver-specific documents for implementing connection pooling in your applications.
101+
102+
### Modify the application to reuse connections
103+
104+
* For additional pointers and examples on managing connections in Azure functions, review [Manage connections in Azure Functions](https://docs.microsoft.com/azure/azure-functions/manage-connections).
105+
106+
### Modify the application to use less aggressive retry logic
107+
108+
* For additional guidance and examples, review [Retry pattern](https://docs.microsoft.com/azure/architecture/patterns/retry).
109+
110+
### Use keepalives to reset the outbound idle timeout
111+
112+
* For implementing keepalives for Node.js apps, review [My node application is making excessive outbound calls](https://docs.microsoft.com/azure/app-service/app-service-web-nodejs-best-practices-and-troubleshoot-guide#my-node-application-is-making-excessive-outbound-calls).
113+
114+
### Additional guidance specific to App Service:
115+
116+
* A [load test](https://docs.microsoft.com/azure/devops/test/load-test/app-service-web-app-performance-test) should simulate real world data in a steady feeding speed. Testing apps and functions under real world stress can identify and resolve SNAT port exhaustion issues ahead of time.
117+
* Ensure that the back-end services can return responses quickly. For troubleshooting performance issues with Azure SQL database, review [Troubleshoot Azure SQL Database performance issues with Intelligent Insights](https://docs.microsoft.com/azure/sql-database/sql-database-intelligent-insights-troubleshoot-performance#recommended-troubleshooting-flow).
118+
* Scale out the App Service plan to more instances. For more information on scaling, see [Scale an app in Azure App Service](https://docs.microsoft.com/azure/app-service/manage-scale-up). Each worker instance in an app service plan is allocated a number of SNAT ports. If you spread your usage across more instances, you might get the SNAT port usage per instance below the recommended limit of 100 outbound connections, per unique remote endpoint.
119+
* Consider moving to [App Service Environment (ASE)](https://docs.microsoft.com/azure/app-service/environment/using-an-ase), where you are allotted a single outbound IP address, and the limits for connections and SNAT ports are much higher.
120+
121+
Avoiding the outbound TCP limits is easier to solve, as the limits are set by the size of your worker. You can see the limits in [Sandbox Cross VM Numerical Limits - TCP Connections](https://github.com/projectkudu/kudu/wiki/Azure-Web-App-sandbox#cross-vm-numerical-limits)
122+
123+
|Limit name|Description|Small (A1)|Medium (A2)|Large (A3)|Isolated tier (ASE)|
124+
|---|---|---|---|---|---|
125+
|Connections|Number of connections across entire VM|1920|3968|8064|16,000|
126+
127+
To avoid outbound TCP limits, you can either increase the size of your workers, or scale out horizontally.
128+
129+
## Troubleshooting
130+
131+
Knowing the two types of outbound connection limits, and what your app does, should make it easier to troubleshoot. If you know that your app makes many calls to the same storage account, you might suspect a SNAT limit. If your app creates a great many calls to endpoints all over the internet, you would suspect you are reaching the VM limit.
132+
133+
If you do not know the application behavior enough to determine the cause quickly, there are some tools and techniques available in App Service to help with that determination.
134+
135+
### Find SNAT port allocation information
136+
137+
You can use [App Service Diagnostics](https://docs.microsoft.com/azure/app-service/overview-diagnostics) to find SNAT port allocation information, and observe the SNAT ports allocation metric of an App Service site. To find SNAT port allocation information, follow the following steps:
138+
139+
1. To access App Service diagnostics, navigate to your App Service web app or App Service Environment in the [Azure portal](https://portal.azure.com/). In the left navigation, select **Diagnose and solve problems**.
140+
2. Select Availability and Performance Category
141+
3. Select SNAT Port Exhaustion tile in the list of available tiles under the category. The practice is to keep it below 128.
142+
If you do need it, you can still open a support ticket and the support engineer will get the metric from back-end for you.
143+
144+
Note that since SNAT port usage is not available as a metric, it is not possible to either autoscale based on SNAT port usage, or to configure auto scale based on SNAT ports allocation metric.
145+
146+
### TCP Connections and SNAT Ports
147+
148+
TCP connections and SNAT ports are not directly related. A TCP connections usage detector is included in the Diagnose and Solve Problems blade of any App Service site. Search for the phrase "TCP connections" to find it.
149+
150+
* The SNAT Ports are only used for external network flows, while the total TCP Connections includes local loopback connections.
151+
* A SNAT port can be shared by different flows, if the flows are different in either protocol, IP address or port. The TCP Connections metric counts every TCP connection.
152+
* The TCP connections limit happens at the worker instance level. The Azure Network outbound load balancing doesn't use the TCP Connections metric for SNAT port limiting.
153+
* The TCP connections limits are described in [Sandbox Cross VM Numerical Limits - TCP Connections](https://github.com/projectkudu/kudu/wiki/Azure-Web-App-sandbox#cross-vm-numerical-limits)
154+
155+
|Limit name|Description|Small (A1)|Medium (A2)|Large (A3)|Isolated tier (ASE)|
156+
|---|---|---|---|---|---|
157+
|Connections|Number of connections across entire VM|1920|3968|8064|16,000|
158+
159+
### WebJobs and Database connections
160+
161+
If SNAT ports are exhausted, where WebJobs are unable to connect to the Azure SQL database, there is no metric to show how many connections are opened by each individual web application process. To find the problematic WebJob, move several WebJobs out to another App Service plan to see if the situation improves, or if an issue remains in one of the plans. Repeat the process until you find the problematic WebJob.
162+
163+
### Using SNAT ports sooner
164+
165+
You cannot change any Azure settings to release the used SNAT ports sooner, as all SNAT ports will be released as per the below conditions and the behavior is by design.
166+
167+
* If either server or client sends FINACK, the [SNAT port will be released](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#tcp-snat-port-release) after 240 seconds.
168+
* If an RST is seen, the SNAT port will be released after 15 seconds.
169+
* If idle timeout has been reached, the port is released.
170+
171+
## Additional information
172+
173+
* [SNAT with App Service](https://4lowtherabbit.github.io/blogs/2019/10/SNAT/)
174+
* [Troubleshoot slow app performance issues in Azure App Service](https://docs.microsoft.com/azure/app-service/troubleshoot-performance-degradation)

0 commit comments

Comments
 (0)