Skip to content

Commit 26e01cc

Browse files
Merge pull request #107331 from ealsur/users/ealsur/perfnetworking
Cosmos DB - Troubleshooting Accelerated Networking
2 parents 3fd4bda + 7802290 commit 26e01cc

File tree

1 file changed

+37
-21
lines changed

1 file changed

+37
-21
lines changed

articles/cosmos-db/troubleshoot-dot-net-sdk.md

Lines changed: 37 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK
33
description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues when using .NET SDK.
44
author: j82w
55
ms.service: cosmos-db
6-
ms.date: 05/28/2019
6+
ms.date: 03/11/2020
77
ms.author: jawilley
88
ms.subservice: cosmosdb-sql
99
ms.topic: troubleshooting
@@ -16,13 +16,13 @@ The .NET SDK provides client-side logical representation to access the Azure Cos
1616
## Checklist for troubleshooting issues:
1717
Consider the following checklist before you move your application to production. Using the checklist will prevent several common issues you might see. You can also quickly diagnose when an issue occurs:
1818

19-
* Use the latest [SDK](https://github.com/Azure/azure-cosmos-dotnet-v2/blob/master/changelog.md). Preview SDKs should not be used for production. This will prevent hitting known issues that are already fixed.
20-
* Review the [performance tips](performance-tips.md), and follow the suggested practices. This will help prevent scaling, latency, and other performance issues.
21-
* Enable the SDK logging to help you troubleshoot an issue. Enabling the logging may affect performance so its best to enable it only when troubleshooting issues. You can enable the following logs:
22-
* [Log metrics](monitor-accounts.md) by using the Azure portal. Portal metrics show the Azure Cosmos DB telemetry, which is helpful to determine if the issue corresponds to Azure Cosmos DB or if its from the client side.
23-
* Log the [diagnostics string](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.resourceresponsebase.requestdiagnosticsstring?view=azure-dotnet) from the point operation responses.
24-
* Log the [SQL Query Metrics](sql-api-query-metrics.md) from all the query responses
25-
* Follow the setup for [SDK logging]( https://github.com/Azure/azure-cosmos-dotnet-v2/blob/master/docs/documentdb-sdk_capture_etl.md)
19+
* Use the latest [SDK](sql-api-sdk-dotnet-standard.md). Preview SDKs should not be used for production. This will prevent hitting known issues that are already fixed.
20+
* Review the [performance tips](performance-tips.md), and follow the suggested practices. This will help prevent scaling, latency, and other performance issues.
21+
* Enable the SDK logging to help you troubleshoot an issue. Enabling the logging may affect performance so it's best to enable it only when troubleshooting issues. You can enable the following logs:
22+
* [Log metrics](monitor-accounts.md) by using the Azure portal. Portal metrics show the Azure Cosmos DB telemetry, which is helpful to determine if the issue corresponds to Azure Cosmos DB or if it's from the client side.
23+
* Log the [diagnostics string](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.resourceresponsebase.requestdiagnosticsstring) in the V2 SDK or [diagnostics](https://docs.microsoft.com/dotnet/api/microsoft.azure.cosmos.responsemessage.diagnostics) in V3 SDK from the point operation responses.
24+
* Log the [SQL Query Metrics](sql-api-query-metrics.md) from all the query responses
25+
* Follow the setup for [SDK logging]( https://github.com/Azure/azure-cosmos-dotnet-v2/blob/master/docs/documentdb-sdk_capture_etl.md)
2626

2727
Take a look at the [Common issues and workarounds](#common-issues-workarounds) section in this article.
2828

@@ -36,15 +36,15 @@ Check the [GitHub issues section](https://github.com/Azure/azure-cosmos-dotnet-v
3636
* You may run into connectivity/availability issues due to lack of resources on your client machine. We recommend monitoring your CPU utilization on nodes running the Azure Cosmos DB client, and scaling up/out if they're running at high load.
3737

3838
### Check the portal metrics
39-
Checking the [portal metrics](monitor-accounts.md) will help determine if it's a client side issue or if there is an issue with the service. For example if the metrics contain a high rate of rate-limited requests(HTTP status code 429) which means the request is getting throttled then check the [Request rate too large] section.
39+
Checking the [portal metrics](monitor-accounts.md) will help determine if it's a client-side issue or if there is an issue with the service. For example, if the metrics contain a high rate of rate-limited requests(HTTP status code 429) which means the request is getting throttled then check the [Request rate too large] section.
4040

4141
### <a name="request-timeouts"></a>Requests timeouts
42-
RequestTimeout usually happens when using Direct/TCP, but can happen in Gateway mode. These are the common known causes, and suggestions on how to fix the problem.
42+
RequestTimeout usually happens when using Direct/TCP, but can happen in Gateway mode. These errors are the common known causes, and suggestions on how to fix the problem.
4343

4444
* CPU utilization is high, which will cause latency and/or request timeouts. The customer can scale up the host machine to give it more resources, or the load can be distributed across more machines.
45-
* Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. To reduce the chance of hitting this issue, use the latest version 2.x or 3.x of the .NET SDK. This is an example of why it is recommend to always run the latest SDK version.
45+
* Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. To reduce the chance of hitting this issue, use the latest version 2.x or 3.x of the .NET SDK. This is an example of why it is recommended to always run the latest SDK version.
4646
* Creating multiple DocumentClient instances might lead to connection contention and timeout issues. Follow the [performance tips](performance-tips.md), and use a single DocumentClient instance across an entire process.
47-
* Users sometimes see elevated latency or request timeouts because their collections are provisioned insufficiently, the back-end throttles requests, and the client retries internally without surfacing this to the caller. Check the [portal metrics](monitor-accounts.md).
47+
* Users sometimes see elevated latency or request timeouts because their collections are provisioned insufficiently, the back-end throttles requests, and the client retries internally. Check the [portal metrics](monitor-accounts.md).
4848
* Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot [partition key](partition-data.md). This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.
4949
* Additionally, the 2.0 SDK adds channel semantics to direct/TCP connections. One TCP connection is used for multiple requests at the same time. This can lead to two issues under specific cases:
5050
* A high degree of concurrency can lead to contention on the channel.
@@ -53,26 +53,42 @@ RequestTimeout usually happens when using Direct/TCP, but can happen in Gateway
5353
* Try to scale the application up/out.
5454
* Additionally, SDK logs can be captured through [Trace Listener](https://github.com/Azure/azure-cosmosdb-dotnet/blob/master/docs/documentdb-sdk_capture_etl.md) to get more details.
5555

56-
### Connection throttling
57-
Connection throttling can happen because of a connection limit on a host machine. Previous to 2.0, clients running in Azure could hit the [Azure SNAT (PAT) port exhaustion].
56+
### <a name="high-network-latency"></a>High network latency
57+
High network latency can be identified by using the [diagnostics string](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.resourceresponsebase.requestdiagnosticsstring?view=azure-dotnet) in the V2 SDK or [diagnostics](https://docs.microsoft.com/dotnet/api/microsoft.azure.cosmos.responsemessage.diagnostics?view=azure-dotnet#Microsoft_Azure_Cosmos_ResponseMessage_Diagnostics) in V3 SDK.
58+
59+
If no [timeouts](#request-timeouts) are present and the diagnostics show single requests where the high latency is evident on the difference between `ResponseTime` and `RequestStartTime`, like so (>300 milliseconds in this example):
60+
61+
```bash
62+
RequestStartTime: 2020-03-09T22:44:49.5373624Z, RequestEndTime: 2020-03-09T22:44:49.9279906Z, Number of regions attempted:1
63+
ResponseTime: 2020-03-09T22:44:49.9279906Z, StoreResult: StorePhysicalAddress: rntbd://..., ...
64+
```
65+
66+
This latency can have multiple causes:
67+
68+
* Your application is not running in the same region as your Azure Cosmos DB account.
69+
* Your [PreferredLocations](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.connectionpolicy.preferredlocations) or [ApplicationRegion](https://docs.microsoft.com/dotnet/api/microsoft.azure.cosmos.cosmosclientoptions.applicationregion) configuration is incorrect and is trying to connect to a different region to where your application is currently running on.
70+
* There might be a bottleneck on the Network interface because of high traffic. If the application is running on Azure Virtual Machines, there are possible workarounds:
71+
* Consider using a [Virtual Machine with Accelerated Networking enabled](../virtual-network/create-vm-accelerated-networking-powershell.md).
72+
* Enable [Accelerated Networking on an existing Virtual Machine](../virtual-network/create-vm-accelerated-networking-powershell.md#enable-accelerated-networking-on-existing-vms).
73+
* Consider using a [higher end Virtual Machine](../virtual-machines/windows/sizes.md).
5874

5975
### <a name="snat"></a>Azure SNAT (PAT) port exhaustion
6076

61-
If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports).
77+
If your app is deployed on [Azure Virtual Machines without a public IP address](../load-balancer/load-balancer-outbound-connections.md#defaultsnat), by default [Azure SNAT ports](../load-balancer/load-balancer-outbound-connections.md#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](../load-balancer/load-balancer-outbound-connections.md#preallocatedports). This situation can lead to connection throttling, connection closure, or the above mentioned [Request timeouts](#request-timeouts).
6278

63-
Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation:
79+
Azure SNAT ports are used only when your VM has a private IP address is connecting to a public IP address. There are two workarounds to avoid Azure SNAT limitation (provided you already are using a single client instance across the entire application):
6480

65-
* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview).
81+
* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](../virtual-network/virtual-network-service-endpoints-overview.md).
6682

67-
When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl).
68-
* Assign a public IP to your Azure VM.
83+
When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](../virtual-network/virtual-networks-acl.md).
84+
* Assign a [public IP to your Azure VM](../load-balancer/load-balancer-outbound-connections.md#assignilpip).
6985

7086
### HTTP proxy
7187
If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK `ConnectionPolicy`.
7288
Otherwise, you face connection issues.
7389

74-
### Request rate too large<a name="request-rate-too-large"></a>
75-
'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the provisioned throughput. The SDK will automatically retry requests based on the specified [retry policy](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.connectionpolicy.retryoptions?view=azure-dotnet). If you get this failure often, consider increasing the throughput on the collection. Check the [portals metrics](use-metrics.md) to see if you are getting 429 errors. Review your [partition key](https://docs.microsoft.com/azure/cosmos-db/partitioning-overview#choose-partitionkey) to ensure it results in an even distribution of storage and request volume.
90+
### <a name="request-rate-too-large"></a>Request rate too large
91+
'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the [provisioned throughput](set-throughput.md). The SDK will automatically retry requests based on the specified [retry policy](https://docs.microsoft.com/dotnet/api/microsoft.azure.documents.client.connectionpolicy.retryoptions?view=azure-dotnet). If you get this failure often, consider increasing the throughput on the collection. Check the [portal's metrics](use-metrics.md) to see if you are getting 429 errors. Review your [partition key](partitioning-overview.md#choose-partitionkey) to ensure it results in an even distribution of storage and request volume.
7692

7793
### Slow query performance
7894
The [query metrics](sql-api-query-metrics.md) will help determine where the query is spending most of the time. From the query metrics, you can see how much of it is being spent on the back-end vs the client.

0 commit comments

Comments
 (0)