Skip to content

Commit aca2dc4

Browse files
authored
Merge pull request #209925 from ealsur/users/ealsur/cancellationtoken
Cosmos DB: SDK - Adding note about CT
2 parents cb0765f + 8e2c231 commit aca2dc4

File tree

1 file changed

+44
-12
lines changed

1 file changed

+44
-12
lines changed

articles/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout.md

Lines changed: 44 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to diagnose and fix .NET SDK request timeout exceptions.
44
author: rothja
55
ms.service: cosmos-db
66
ms.subservice: cosmosdb-sql
7-
ms.date: 02/02/2022
7+
ms.date: 09/01/2022
88
ms.author: jroth
99
ms.topic: troubleshooting
1010
ms.reviewer: mjbrown
@@ -32,9 +32,24 @@ All the async operations in the SDK have an optional CancellationToken parameter
3232
> The `CancellationToken` parameter is a mechanism where the library will check the cancellation when it [won't cause an invalid state](https://devblogs.microsoft.com/premier-developer/recommended-patterns-for-cancellationtoken/). The operation might not cancel exactly when the time defined in the cancellation is up. Instead, after the time is up, it cancels when it's safe to do so.
3333
3434
## Troubleshooting steps
35+
3536
The following list contains known causes and solutions for request timeout exceptions.
3637

38+
### CosmosOperationCanceledException
39+
40+
This type of exception is common when your application is passing [CancellationTokens](#cancellationtoken) to the SDK operations. The SDK checks the state of the `CancellationToken` in-between [retries](conceptual-resilient-sdk-applications.md#should-my-application-retry-on-errors) and if the `CancellationToken` is canceled, it will abort the current operation with this exception.
41+
42+
The exception's `Message` / `ToString()` will also indicate the state of your `CancellationToken` through `Cancellation Token has expired: true` and it will also contain [Diagnostics](troubleshoot-dot-net-sdk.md#capture-diagnostics) that contain the context of the cancellation for the involved requests.
43+
44+
These exceptions are safe to retry on and can be treated as [timeouts](conceptual-resilient-sdk-applications.md#timeouts-and-connectivity-related-failures-http-408503) from the retrying perspective.
45+
46+
#### Solution
47+
48+
Verify the configured time in your `CancellationToken`, make sure that it's greater than your [RequestTimeout](#requesttimeout) and the [CosmosClientOptions.OpenTcpConnectionTimeout](/dotnet/api/microsoft.azure.cosmos.cosmosclientoptions.opentcpconnectiontimeout) (if you're using [Direct mode](sql-sdk-connection-modes.md)).
49+
If the available time in the `CancellationToken` is less than the configured timeouts, and the SDK is facing [transient connectivity issues](conceptual-resilient-sdk-applications.md#timeouts-and-connectivity-related-failures-http-408503), the SDK won't be able to retry and will throw `CosmosOperationCanceledException`.
50+
3751
### High CPU utilization
52+
3853
High CPU utilization is the most common case. For optimal latency, CPU usage should be roughly 40 percent. Use 10 seconds as the interval to monitor maximum (not average) CPU utilization. CPU spikes are more common with cross-partition queries where it might do multiple connections for a single query.
3954

4055
# [3.21 and 2.16 or greater SDK](#tab/cpu-new)
@@ -69,7 +84,7 @@ The timeouts will contain *Diagnostics*, which contain:
6984

7085
* If the `cpu` values are over 70%, the timeout is likely to be caused by CPU exhaustion. In this case, the solution is to investigate the source of the high CPU utilization and reduce it, or scale the machine to a larger resource size.
7186
* If the `threadInfo/isThreadStarving` nodes have `True` values, the cause is thread starvation. In this case the solution is to investigate the source/s of the thread starvation (potentially locked threads), or scale the machine/s to a larger resource size.
72-
* If the `dateUtc` time in-between measurements is not approximately 10 seconds, it also would indicate contention on the thread pool. CPU is measured as an independent Task that is enqueued in the thread pool every 10 seconds, if the time in-between measurement is longer, it would indicate that the async Tasks are not able to be processed in a timely fashion. Most common scenarios are when doing [blocking calls over async code](https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AsyncGuidance.md#avoid-using-taskresult-and-taskwait) in the application code.
87+
* If the `dateUtc` time in-between measurements isn't approximately 10 seconds, it also would indicate contention on the thread pool. CPU is measured as an independent Task that is enqueued in the thread pool every 10 seconds, if the time in-between measurement is longer, it would indicate that the async Tasks aren't able to be processed in a timely fashion. Most common scenarios are when doing [blocking calls over async code](https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AsyncGuidance.md#avoid-using-taskresult-and-taskwait) in the application code.
7388

7489
# [Older SDK](#tab/cpu-old)
7590

@@ -87,57 +102,74 @@ CPU count: 8)
87102
```
88103

89104
* If the CPU measurements are over 70%, the timeout is likely to be caused by CPU exhaustion. In this case, the solution is to investigate the source of the high CPU utilization and reduce it, or scale the machine to a larger resource size.
90-
* If the CPU measurements are not happening every 10 seconds (e.g., gaps or measurement times indicate larger times in between measurements), the cause is thread starvation. In this case the solution is to investigate the source/s of the thread starvation (potentially locked threads), or scale the machine/s to a larger resource size.
105+
* If the CPU measurements aren't happening every 10 seconds (e.g., gaps or measurement times indicate larger times in between measurements), the cause is thread starvation. In this case the solution is to investigate the source/s of the thread starvation (potentially locked threads), or scale the machine/s to a larger resource size.
106+
91107
---
92108

93-
#### Solution:
109+
#### Solution
110+
94111
The client application that uses the SDK should be scaled up or out.
95112

96113
### Socket or port availability might be low
114+
97115
When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion.
98116

99-
#### Solution 1:
117+
#### Solution 1
118+
100119
If you're running on Azure VMs, follow the [SNAT port exhaustion guide](troubleshoot-dot-net-sdk.md#snat).
101120

102-
#### Solution 2:
121+
#### Solution 2
122+
103123
If you're running on Azure App Service, follow the [connection errors troubleshooting guide](../../app-service/troubleshoot-intermittent-outbound-connection-errors.md#cause) and [use App Service diagnostics](https://azure.github.io/AppService/2018/03/01/Deep-Dive-into-TCP-Connections-in-App-Service-Diagnostics.html).
104124

105-
#### Solution 3:
125+
#### Solution 3
126+
106127
If you're running on Azure Functions, verify you're following the [Azure Functions recommendation](../../azure-functions/manage-connections.md#static-clients) of maintaining singleton or static clients for all of the involved services (including Azure Cosmos DB). Check the [service limits](../../azure-functions/functions-scale.md#service-limits) based on the type and size of your Function App hosting.
107128

108-
#### Solution 4:
129+
#### Solution 4
130+
109131
If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK `ConnectionPolicy`. Otherwise, you'll face connection issues.
110132

111133
### Create multiple client instances
134+
112135
Creating multiple client instances might lead to connection contention and timeout issues.
113136

114-
#### Solution:
137+
#### Solution
138+
115139
Follow the [performance tips](performance-tips-dotnet-sdk-v3-sql.md#sdk-usage), and use a single CosmosClient instance across an entire process.
116140

117141
### Hot partition key
142+
118143
Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. When there's a hot partition, one or more logical partition keys on a physical partition are consuming all the physical partition's Request Units per second (RU/s). At the same time, the RU/s on other physical partitions are going unused. As a symptom, the total RU/s consumed will be less than the overall provisioned RU/s at the database or container, but you'll still see throttling (429s) on the requests against the hot logical partition key. Use the [Normalized RU Consumption metric](../monitor-normalized-request-units.md) to see if the workload is encountering a hot partition.
119144

120-
#### Solution:
145+
#### Solution
146+
121147
Choose a good partition key that evenly distributes request volume and storage. Learn how to [change your partition key](https://devblogs.microsoft.com/cosmosdb/how-to-change-your-partition-key/).
122148

123149
### High degree of concurrency
150+
124151
The application is doing a high level of concurrency, which can lead to contention on the channel.
125152

126-
#### Solution:
153+
#### Solution
154+
127155
The client application that uses the SDK should be scaled up or out.
128156

129157
### Large requests or responses
158+
130159
Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.
131160

132-
#### Solution:
161+
#### Solution
133162
The client application that uses the SDK should be scaled up or out.
134163

135164
### Failure rate is within the Azure Cosmos DB SLA
165+
136166
The application should be able to handle transient failures and retry when necessary. Any 408 exceptions aren't retried because on create paths it's impossible to know if the service created the item or not. Sending the same item again for create will cause a conflict exception. User applications business logic might have custom logic to handle conflicts, which would break from the ambiguity of an existing item versus conflict from a create retry.
137167

138168
### Failure rate violates the Azure Cosmos DB SLA
169+
139170
Contact [Azure Support](https://aka.ms/azure-support).
140171

141172
## Next steps
173+
142174
* [Diagnose and troubleshoot](troubleshoot-dot-net-sdk.md) issues when you use the Azure Cosmos DB .NET SDK.
143175
* Learn about performance guidelines for [.NET v3](performance-tips-dotnet-sdk-v3-sql.md) and [.NET v2](performance-tips.md).

0 commit comments

Comments
 (0)