Skip to content

Commit eb35a73

Browse files
authored
Merge pull request #57583 from paulth1/troubleshoot-java-async-sdk
edit pass: Troubleshoot java async sdk
2 parents 783e828 + 18b767e commit eb35a73

File tree

1 file changed

+54
-48
lines changed

1 file changed

+54
-48
lines changed
Lines changed: 54 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Diagnose and troubleshoot Azure Cosmos DB Java Async SDK| Microsoft Docs
3-
description: Use features like client-side logging, and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues.
3+
description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues.
44
services: cosmos-db
55
author: moderakh
66

@@ -13,54 +13,60 @@ ms.component: cosmosdb-sql
1313
ms.topic: troubleshooting
1414
---
1515

16-
# Troubleshooting issues when using Java Async SDK with Azure Cosmos DB SQL API accounts
17-
This article covers common issues, workarounds, diagnostics steps, and tools when using [Java Async ADK](sql-api-sdk-async-java.md) with Azure Cosmos DB SQL API accounts.
18-
Java Async SDK provides client-side logical representation to access Azure Cosmos DB SQL API. This article describes the tools and approaches to help you if you run into any issues.
16+
# Troubleshoot issues when you use the Java Async SDK with Azure Cosmos DB SQL API accounts
17+
This article covers common issues, workarounds, diagnostic steps, and tools when you use the [Java Async SDK](sql-api-sdk-async-java.md) with Azure Cosmos DB SQL API accounts.
18+
The Java Async SDK provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues.
1919

2020
Start with this list:
21-
* Take a look at the [Common issues and workarounds] section in this article.
22-
* Our SDK is [open-source on github](https://github.com/Azure/azure-cosmosdb-java) and we have [issues section](https://github.com/Azure/azure-cosmosdb-java/issues) that we actively monitor. Check if you find any similar issue with a workaround already filed.
23-
* Review [performance tips](performance-tips-async-java.md) and follow the suggested practices.
24-
* Follow the rest of this article, if you didn't find a solution, file a [GitHub issue](https://github.com/Azure/azure-cosmosdb-java/issues).
21+
22+
* Take a look at the [Common issues and workarounds] section in this article.
23+
* Look at the SDK, which is available [open source on GitHub](https://github.com/Azure/azure-cosmosdb-java). It has an [issues section](https://github.com/Azure/azure-cosmosdb-java/issues) that's actively monitored. Check to see if any similar issue with a workaround is already filed.
24+
* Review the [performance tips](performance-tips-async-java.md), and follow the suggested practices.
25+
* Read the rest of this article, if you didn't find a solution. Then file a [GitHub issue](https://github.com/Azure/azure-cosmosdb-java/issues).
2526

2627
## <a name="common-issues-workarounds"></a>Common issues and workarounds
2728

2829
### Network issues, Netty read timeout failure, low throughput, high latency
2930

3031
#### General suggestions
31-
* Make sure the app is running on the same region as your Cosmos DB account.
32-
* Check the CPU usage on the host where the app is running. If CPU usage is 90% or more, consider running your app on a host with higher configuration or distribute the load on more machines.
32+
* Make sure the app is running on the same region as your Azure Cosmos DB account.
33+
* Check the CPU usage on the host where the app is running. If CPU usage is 90 percent or more, run your app on a host with a higher configuration. Or you can distribute the load on more machines.
3334

3435
#### Connection throttling
35-
Connection throttling can happen due to either [Connection limit on host machine], or [Azure SNAT (PAT) port exhaustion]:
36+
Connection throttling can happen because of either a [connection limit on a host machine] or [Azure SNAT (PAT) port exhaustion].
3637

37-
##### <a name="connection-limit-on-host"></a>Connection limit on host machine
38-
Some Linux systems (like 'Red Hat') have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections too.
39-
Run the following command:
38+
##### <a name="connection-limit-on-host"></a>Connection limit on a host machine
39+
Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too.
40+
Run the following command.
4041

4142
```bash
4243
ulimit -a
4344
```
44-
The number of open files ("nofile") needs to be large enough (at least as double as your connection pool size). Read more detail in [performance tips](performance-tips-async-java.md).
45+
The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. For more information, see [Performance tips](performance-tips-async-java.md).
4546

4647
##### <a name="snat"></a>Azure SNAT (PAT) port exhaustion
4748

48-
If your app is deployed on Azure VM without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) are used to establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports).
49+
If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports).
50+
51+
Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation:
4952

50-
The Azure SNAT ports are used only when your Azure VM has a private IP address and a process from the VM attempts to establish a connection to a public IP address. So, there are two workarounds to avoid Azure SNAT limitation:
51-
* Add your Azure Cosmos DB service endpoint to the subnet of your Azure VM VNET as explained in [Enabling VNET Service Endpoint](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview). When service endpoint is enabled, the requests no longer are sent from a public IP to cosmos DB instead the VNET and subnet identity is sent. This change may result in firewall drops if only public IPs are allowed. If you are using firewall, when enabling service endpoint, add subnet to firewall using [VNET ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl).
52-
* Assign a public IP to your Azure VM.
53+
* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview).
5354

54-
#### Http proxy
55+
When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl).
56+
* Assign a public IP to your Azure VM.
5557

56-
If you use an HttpProxy, make sure your HttpProxy can support the number of connections configured in the SDK `ConnectionPolicy`.
58+
#### HTTP proxy
59+
60+
If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK `ConnectionPolicy`.
5761
Otherwise, you face connection issues.
5862

5963
#### Invalid coding pattern: Blocking Netty IO thread
6064

61-
The SDK uses [Netty](https://netty.io/) IO library for communicating to Azure Cosmos DB Service. We have Async API and we use non-blocking IO APIs of netty. The SDK's IO work is performed on IO netty threads. The number of IO netty threads is configured to be the same as the number of the CPU cores of the app machine. The netty IO threads are only meant to be used for non blocking netty IO work. The SDK returns the API invocation result on one of the netty IO threads to the apps's code. If the app after receiving results on the netty thread performs a long lasting operation on the netty thread, that may result in SDK to not have enough number of IO threads for performing its internal IO work. Such app coding may result in low throughput, high latency, and `io.netty.handler.timeout.ReadTimeoutException` failures. The workaround is to switch the thread when you know the operation will take time.
65+
The SDK uses the [Netty](https://netty.io/) IO library to communicate with Azure Cosmos DB. The SDK has Async APIs and uses non-blocking IO APIs of Netty. The SDK's IO work is performed on IO Netty threads. The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine.
66+
67+
The Netty IO threads are meant to be used only for non-blocking Netty IO work. The SDK returns the API invocation result on one of the Netty IO threads to the app's code. If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. Such app coding might result in low throughput, high latency, and `io.netty.handler.timeout.ReadTimeoutException` failures. The workaround is to switch the thread when you know the operation takes time.
6268

63-
For example, the following code snippet shows that if you perform long lasting work, which takes more than a few milliseconds, on the netty thread, you eventually can get into a state where no netty IO thread is present to process IO work, and as a result you get ReadTimeoutException:
69+
For example, take a look at the following code snippet. You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. If so, you eventually can get into a state where no Netty IO thread is present to process IO work. As a result, you get a ReadTimeoutException failure.
6470
```java
6571
@Test
6672
public void badCodeWithReadTimeoutException() throws Exception {
@@ -86,19 +92,19 @@ public void badCodeWithReadTimeoutException() throws Exception {
8692
.createDocument(getCollectionLink(), docDefinition, null, false);
8793
createObservable.subscribe(r -> {
8894
try {
89-
// time consuming work. For example:
90-
// writing to a file, computationally heavy work, or just sleep
91-
// basically anything which takes more than a few milliseconds
92-
// doing such operation on the IO netty thread
93-
// without a proper scheduler, will cause problems.
94-
// The subscriber will get ReadTimeoutException failure.
95+
// Time-consuming work is, for example,
96+
// writing to a file, computationally heavy work, or just sleep.
97+
// Basically, it's anything that takes more than a few milliseconds.
98+
// Doing such operations on the IO Netty thread
99+
// without a proper scheduler will cause problems.
100+
// The subscriber will get a ReadTimeoutException failure.
95101
TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds);
96102
} catch (Exception e) {
97103
}
98104
},
99105

100106
exception -> {
101-
//will be io.netty.handler.timeout.ReadTimeoutException
107+
//It will be io.netty.handler.timeout.ReadTimeoutException.
102108
exception.printStackTrace();
103109
failureCount.incrementAndGet();
104110
latch.countDown();
@@ -112,43 +118,43 @@ public void badCodeWithReadTimeoutException() throws Exception {
112118
assertThat(failureCount.get()).isGreaterThan(0);
113119
}
114120
```
115-
The workaround is to change the thread on which you perform time taking work. Define a singleton instance of Scheduler for your app:
121+
The workaround is to change the thread on which you perform work that takes time. Define a singleton instance of the scheduler for your app.
116122
```java
117-
// have a singleton instance of executor and scheduler
123+
// Have a singleton instance of an executor and a scheduler.
118124
ExecutorService ex = Executors.newFixedThreadPool(30);
119125
Scheduler customScheduler = rx.schedulers.Schedulers.from(ex);
120126
```
121-
Whenever you need to do time taking work (for example, computationally heavy work, blocking IO), switch the thread to a worker provided by your `customScheduler` using `.observeOn(customScheduler)` API.
127+
You might need to do work that takes time, for example, computationally heavy work or blocking IO. In this case, switch the thread to a worker provided by your `customScheduler` by using the `.observeOn(customScheduler)` API.
122128
```java
123129
Observable<ResourceResponse<Document>> createObservable = client
124130
.createDocument(getCollectionLink(), docDefinition, null, false);
125131

126132
createObservable
127-
.observeOn(customScheduler) // switches the thread.
133+
.observeOn(customScheduler) // Switches the thread.
128134
.subscribe(
129135
// ...
130136
);
131137
```
132-
By using `observeOn(customScheduler)`, you release the netty IO thread and switch to your own custom thread provided by customScheduler.
133-
This modification will solve the problem, and you won't get `io.netty.handler.timeout.ReadTimeoutException` failure anymore.
138+
By using `observeOn(customScheduler)`, you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler.
139+
This modification solves the problem. You won't get a `io.netty.handler.timeout.ReadTimeoutException` failure anymore.
134140

135141
### Connection pool exhausted issue
136142

137-
`PoolExhaustedException` is a client-side failure. If you get this failure often, that's indication that your app workload is higher than what the SDK connection pool can serve. Increasing connection pool size or distributing the load on multiple apps may help.
143+
`PoolExhaustedException` is a client-side failure. This failure indicates that your app workload is higher than what the SDK connection pool can serve. Increase the connection pool size or distribute the load on multiple apps.
138144

139145
### Request rate too large
140-
This failure is a server-side failure indicating that you consumed your provisioned throughput and should retry later. If you get this failure often, consider increasing the collection throughput.
146+
This failure is a server-side failure. It indicates that you consumed your provisioned throughput. Retry later. If you get this failure often, consider an increase in the collection throughput.
141147

142148
### Failure connecting to Azure Cosmos DB emulator
143149

144-
Cosmos DB emulator HTTPS certificate is self-signed. For SDK to work with emulator you should import the emulator certificate to Java TrustStore. As explained [here](local-emulator-export-ssl-certificates.md).
150+
The Azure Cosmos DB emulator HTTPS certificate is self-signed. For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. For more information, see [Export Azure Cosmos DB emulator certificates](local-emulator-export-ssl-certificates.md).
145151

146152

147153
## <a name="enable-client-sice-logging"></a>Enable client SDK logging
148154

149155
The Java Async SDK uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback.
150156

151-
For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath:
157+
For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath.
152158

153159
```xml
154160
<dependency>
@@ -163,7 +169,7 @@ For example, if you want to use log4j as the logging framework, add the followin
163169
</dependency>
164170
```
165171

166-
Also add a log4j config:
172+
Also add a log4j config.
167173
```
168174
# this is a sample log4j configuration
169175
@@ -181,25 +187,25 @@ log4j.appender.A1.layout=org.apache.log4j.PatternLayout
181187
log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n
182188
```
183189

184-
Review [sfl4j logging manual](https://www.slf4j.org/manual.html) for more information.
190+
For more information, see the [sfl4j logging manual](https://www.slf4j.org/manual.html).
185191

186192
## <a name="netstats"></a>OS network statistics
187-
Run netstat command to get a sense of how many connections are in `Established` state, `CLOSE_WAIT` state, etc.
193+
Run the netstat command to get a sense of how many connections are in states such as `ESTABLISHED` and `CLOSE_WAIT`.
188194

189-
On Linux you can run the following command:
195+
On Linux, you can run the following command.
190196
```bash
191197
netstat -nap
192198
```
193-
Filter the result to only connections to Cosmos DB endpoint.
199+
Filter the result to only connections to the Azure Cosmos DB endpoint.
194200

195-
Apparently, the number of connections to Cosmos DB endpoint in `Established` state should be not greater than your configured connection pool size.
201+
The number of connections to the Azure Cosmos DB endpoint in the `ESTABLISHED` state can't be greater than your configured connection pool size.
196202

197-
If there are many connections to Cosmos DB endpoint in `CLOSE_WAIT` state, for example more than 1000 connections, that's an indication of connections are established and torn down quickly, which may potentially cause problems. Review [Common issues and workarounds] section for more detail.
203+
Many connections to the Azure Cosmos DB endpoint might be in the `CLOSE_WAIT` state. There might be more than 1,000. A number that high indicates that connections are established and torn down quickly. This situation potentially causes problems. For more information, see the [Common issues and workarounds] section.
198204

199205
<!--Anchors-->
200206
[Common issues and workarounds]: #common-issues-workarounds
201207
[Enable client SDK logging]: #enable-client-sice-logging
202-
[Connection limit on host machine]: #connection-limit-on-host
208+
[Connection limit on a host machine]: #connection-limit-on-host
203209
[Azure SNAT (PAT) port exhaustion]: #snat
204210

205211

0 commit comments

Comments
 (0)