Skip to content

Commit b813428

Browse files
committed
Added Java SDK v4 troubleshooting; added unambiguous versioning in Java SDK troubleshooting guide titles/TOC
1 parent 1524a36 commit b813428

File tree

3 files changed

+271
-7
lines changed

3 files changed

+271
-7
lines changed

articles/cosmos-db/TOC.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -979,7 +979,9 @@
979979
items:
980980
- name: Troubleshoot query performance
981981
href: troubleshoot-query-performance.md
982-
- name: Troubleshoot Java Async SDK
982+
- name: Troubleshoot Java SDK v4
983+
href: troubleshoot-java-sdk-v4-sql.md
984+
- name: Troubleshoot Java Async SDK v2
983985
href: troubleshoot-java-async-sdk.md
984986
- name: Troubleshoot .NET SDK
985987
href: troubleshoot-dot-net-sdk.md

articles/cosmos-db/troubleshoot-java-async-sdk.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
11
---
2-
title: Diagnose and troubleshoot Azure Cosmos DB Java Async SDK
3-
description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues.
4-
author: moderakh
2+
title: Diagnose and troubleshoot Azure Cosmos DB Async Java SDK v2
3+
description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues in Async Java SDK v2.
4+
author: anfeldma-ms
55
ms.service: cosmos-db
6-
ms.date: 04/30/2019
7-
ms.author: moderakh
6+
ms.date: 05/04/2020
7+
ms.author: anfeldma
88
ms.devlang: java
99
ms.subservice: cosmosdb-sql
1010
ms.topic: troubleshooting
1111
ms.reviewer: sngun
1212
---
1313

1414
# Troubleshoot issues when you use the Java Async SDK with Azure Cosmos DB SQL API accounts
15+
16+
> [!IMPORTANT]
17+
> This article covers troubleshooting for Async Java SDK v2 only. Please see the __[Release Notes]()__ and __[Maven repository]()__ for more information.
18+
>
19+
1520
This article covers common issues, workarounds, diagnostic steps, and tools when you use the [Java Async SDK](sql-api-sdk-async-java.md) with Azure Cosmos DB SQL API accounts.
1621
The Java Async SDK provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues.
1722

@@ -165,7 +170,7 @@ Exception in thread "main" java.lang.NoSuchMethodError: rx.Observable.toSingle()
165170

166171
The above exception suggests you have a dependency on an older version of RxJava lib (e.g., 1.2.2). Our SDK relies on RxJava 1.3.8 which has APIs not available in earlier version of RxJava.
167172

168-
The workaround for such issuses is to identify which other dependency brings in RxJava-1.2.2 and exclude the transitive dependency on RxJava-1.2.2, and allow CosmosDB SDK bring the newer version.
173+
The workaround for such issues is to identify which other dependency brings in RxJava-1.2.2 and exclude the transitive dependency on RxJava-1.2.2, and allow CosmosDB SDK bring the newer version.
169174

170175
To identify which library brings in RxJava-1.2.2 run the following command next to your project pom.xml file:
171176
```bash
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
title: Diagnose and troubleshoot Azure Cosmos DB Java SDK v4
3+
description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues in Java SDK v4.
4+
author: anfeldma-ms
5+
ms.service: cosmos-db
6+
ms.date: 05/04/2020
7+
ms.author: anfeldma
8+
ms.devlang: java
9+
ms.subservice: cosmosdb-sql
10+
ms.topic: troubleshooting
11+
---
12+
13+
# Troubleshoot issues when you use Java SDK v4 with Azure Cosmos DB SQL API accounts
14+
15+
> [!IMPORTANT]
16+
> This article covers troubleshooting for Java SDK v4 only. Please see the __[Release Notes]()__ and __[Maven repository](https://mvnrepository.com/artifact/com.azure/azure-cosmos)__ for more information.
17+
>
18+
19+
This article covers common issues, workarounds, diagnostic steps, and tools when you use [Java SDK v4]() with Azure Cosmos DB SQL API accounts.
20+
Java SDK v4 provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues.
21+
22+
Start with this list:
23+
24+
* Take a look at the [Common issues and workarounds] section in this article.
25+
* Look at the SDK, which is available [open source on GitHub](https://github.com/Azure/azure-cosmosdb-java). It has an [issues section](https://github.com/Azure/azure-cosmosdb-java/issues) that's actively monitored. Check to see if any similar issue with a workaround is already filed.
26+
* Review the [performance tips for Java SDK v4](), and follow the suggested practices.
27+
* Read the rest of this article, if you didn't find a solution. Then file a [GitHub issue](https://github.com/Azure/azure-sdk-for-java/issues).
28+
29+
## <a name="common-issues-workarounds"></a>Common issues and workarounds
30+
31+
### Network issues, Netty read timeout failure, low throughput, high latency
32+
33+
#### General suggestions
34+
* Make sure the app is running on the same region as your Azure Cosmos DB account.
35+
* Check the CPU usage on the host where the app is running. If CPU usage is 90 percent or more, run your app on a host with a higher configuration. Or you can distribute the load on more machines.
36+
37+
#### Connection throttling
38+
Connection throttling can happen because of either a [connection limit on a host machine] or [Azure SNAT (PAT) port exhaustion].
39+
40+
##### <a name="connection-limit-on-host"></a>Connection limit on a host machine
41+
Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too.
42+
Run the following command.
43+
44+
```bash
45+
ulimit -a
46+
```
47+
The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. For more information, see [Performance tips]().
48+
49+
##### <a name="snat"></a>Azure SNAT (PAT) port exhaustion
50+
51+
If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports).
52+
53+
Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation:
54+
55+
* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview).
56+
57+
When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl).
58+
* Assign a public IP to your Azure VM.
59+
60+
##### <a name="cant-connect"></a>Can't reach the Service - firewall
61+
``ConnectTimeoutException`` indicates that the SDK cannot reach the service.
62+
You may get a failure similar to the following when using the direct mode:
63+
```
64+
GoneException{error=null, resourceAddress='https://cdb-ms-prod-westus-fd4.documents.azure.com:14940/apps/e41242a5-2d71-5acb-2e00-5e5f744b12de/services/d8aa21a5-340b-21d4-b1a2-4a5333e7ed8a/partitions/ed028254-b613-4c2a-bf3c-14bd5eb64500/replicas/131298754052060051p//', statusCode=410, message=Message: The requested resource is no longer available at the server., getCauseInfo=[class: class io.netty.channel.ConnectTimeoutException, message: connection timed out: cdb-ms-prod-westus-fd4.documents.azure.com/101.13.12.5:14940]
65+
```
66+
67+
If you have a firewall running on your app machine, open port range 10,000 to 20,000 which are used by the direct mode.
68+
Also follow the [Connection limit on a host machine](#connection-limit-on-host).
69+
70+
#### HTTP proxy
71+
72+
If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK `ConnectionPolicy`.
73+
Otherwise, you face connection issues.
74+
75+
#### Invalid coding pattern: Blocking Netty IO thread
76+
77+
The SDK uses the [Netty](https://netty.io/) IO library to communicate with Azure Cosmos DB. The SDK has an Async API and uses non-blocking IO APIs of Netty. The SDK's IO work is performed on IO Netty threads. The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine.
78+
79+
The Netty IO threads are meant to be used only for non-blocking Netty IO work. The SDK returns the API invocation result on one of the Netty IO threads to the app's code. If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. Such app coding might result in low throughput, high latency, and `io.netty.handler.timeout.ReadTimeoutException` failures. The workaround is to switch the thread when you know the operation takes time.
80+
81+
For example, take a look at the following code snippet. You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. If so, you eventually can get into a state where no Netty IO thread is present to process IO work. As a result, you get a ReadTimeoutException failure.
82+
```java
83+
@Test
84+
public void badCodeWithReadTimeoutException() throws Exception {
85+
int requestTimeoutInSeconds = 10;
86+
87+
ConnectionPolicy policy = new ConnectionPolicy();
88+
policy.setRequestTimeoutInMillis(requestTimeoutInSeconds * 1000);
89+
90+
AsyncDocumentClient testClient = new AsyncDocumentClient.Builder()
91+
.withServiceEndpoint(TestConfigurations.HOST)
92+
.withMasterKeyOrResourceToken(TestConfigurations.MASTER_KEY)
93+
.withConnectionPolicy(policy)
94+
.build();
95+
96+
int numberOfCpuCores = Runtime.getRuntime().availableProcessors();
97+
int numberOfConcurrentWork = numberOfCpuCores + 1;
98+
CountDownLatch latch = new CountDownLatch(numberOfConcurrentWork);
99+
AtomicInteger failureCount = new AtomicInteger();
100+
101+
for (int i = 0; i < numberOfConcurrentWork; i++) {
102+
Document docDefinition = getDocumentDefinition();
103+
Observable<ResourceResponse<Document>> createObservable = testClient
104+
.createDocument(getCollectionLink(), docDefinition, null, false);
105+
createObservable.subscribe(r -> {
106+
try {
107+
// Time-consuming work is, for example,
108+
// writing to a file, computationally heavy work, or just sleep.
109+
// Basically, it's anything that takes more than a few milliseconds.
110+
// Doing such operations on the IO Netty thread
111+
// without a proper scheduler will cause problems.
112+
// The subscriber will get a ReadTimeoutException failure.
113+
TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds);
114+
} catch (Exception e) {
115+
}
116+
},
117+
118+
exception -> {
119+
//It will be io.netty.handler.timeout.ReadTimeoutException.
120+
exception.printStackTrace();
121+
failureCount.incrementAndGet();
122+
latch.countDown();
123+
},
124+
() -> {
125+
latch.countDown();
126+
});
127+
}
128+
129+
latch.await();
130+
assertThat(failureCount.get()).isGreaterThan(0);
131+
}
132+
```
133+
The workaround is to change the thread on which you perform work that takes time. Define a singleton instance of the scheduler for your app.
134+
```java
135+
// Have a singleton instance of an executor and a scheduler.
136+
ExecutorService ex = Executors.newFixedThreadPool(30);
137+
Scheduler customScheduler = rx.schedulers.Schedulers.from(ex);
138+
```
139+
You might need to do work that takes time, for example, computationally heavy work or blocking IO. In this case, switch the thread to a worker provided by your `customScheduler` by using the `.observeOn(customScheduler)` API.
140+
```java
141+
Observable<ResourceResponse<Document>> createObservable = client
142+
.createDocument(getCollectionLink(), docDefinition, null, false);
143+
144+
createObservable
145+
.observeOn(customScheduler) // Switches the thread.
146+
.subscribe(
147+
// ...
148+
);
149+
```
150+
By using `observeOn(customScheduler)`, you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler.
151+
This modification solves the problem. You won't get a `io.netty.handler.timeout.ReadTimeoutException` failure anymore.
152+
153+
### Connection pool exhausted issue
154+
155+
`PoolExhaustedException` is a client-side failure. This failure indicates that your app workload is higher than what the SDK connection pool can serve. Increase the connection pool size or distribute the load on multiple apps.
156+
157+
### Request rate too large
158+
This failure is a server-side failure. It indicates that you consumed your provisioned throughput. Retry later. If you get this failure often, consider an increase in the collection throughput.
159+
160+
### Failure connecting to Azure Cosmos DB emulator
161+
162+
The Azure Cosmos DB emulator HTTPS certificate is self-signed. For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. For more information, see [Export Azure Cosmos DB emulator certificates](local-emulator-export-ssl-certificates.md).
163+
164+
### Dependency Conflict Issues
165+
166+
```console
167+
Exception in thread "main" java.lang.NoSuchMethodError: rx.Observable.toSingle()Lrx/Single;
168+
```
169+
170+
The above exception suggests you have a dependency on an older version of RxJava lib (e.g., 1.2.2). Our SDK relies on RxJava 1.3.8 which has APIs not available in earlier version of RxJava.
171+
172+
The workaround for such issuses is to identify which other dependency brings in RxJava-1.2.2 and exclude the transitive dependency on RxJava-1.2.2, and allow CosmosDB SDK bring the newer version.
173+
174+
To identify which library brings in RxJava-1.2.2 run the following command next to your project pom.xml file:
175+
```bash
176+
mvn dependency:tree
177+
```
178+
For more information, see the [maven dependency tree guide](https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html).
179+
180+
Once you identify RxJava-1.2.2 is transitive dependency of which other dependency of your project, you can modify the dependency on that lib in your pom file and exclude RxJava transitive dependency it:
181+
182+
```xml
183+
<dependency>
184+
<groupId>${groupid-of-lib-which-brings-in-rxjava1.2.2}</groupId>
185+
<artifactId>${artifactId-of-lib-which-brings-in-rxjava1.2.2}</artifactId>
186+
<version>${version-of-lib-which-brings-in-rxjava1.2.2}</version>
187+
<exclusions>
188+
<exclusion>
189+
<groupId>io.reactivex</groupId>
190+
<artifactId>rxjava</artifactId>
191+
</exclusion>
192+
</exclusions>
193+
</dependency>
194+
```
195+
196+
For more information, see the [exclude transitive dependency guide](https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html).
197+
198+
199+
## <a name="enable-client-sice-logging"></a>Enable client SDK logging
200+
201+
The Java Async SDK uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback.
202+
203+
For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath.
204+
205+
```xml
206+
<dependency>
207+
<groupId>org.slf4j</groupId>
208+
<artifactId>slf4j-log4j12</artifactId>
209+
<version>${slf4j.version}</version>
210+
</dependency>
211+
<dependency>
212+
<groupId>log4j</groupId>
213+
<artifactId>log4j</artifactId>
214+
<version>${log4j.version}</version>
215+
</dependency>
216+
```
217+
218+
Also add a log4j config.
219+
```
220+
# this is a sample log4j configuration
221+
222+
# Set root logger level to DEBUG and its only appender to A1.
223+
log4j.rootLogger=INFO, A1
224+
225+
log4j.category.com.microsoft.azure.cosmosdb=DEBUG
226+
#log4j.category.io.netty=INFO
227+
#log4j.category.io.reactivex=INFO
228+
# A1 is set to be a ConsoleAppender.
229+
log4j.appender.A1=org.apache.log4j.ConsoleAppender
230+
231+
# A1 uses PatternLayout.
232+
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
233+
log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n
234+
```
235+
236+
For more information, see the [sfl4j logging manual](https://www.slf4j.org/manual.html).
237+
238+
## <a name="netstats"></a>OS network statistics
239+
Run the netstat command to get a sense of how many connections are in states such as `ESTABLISHED` and `CLOSE_WAIT`.
240+
241+
On Linux, you can run the following command.
242+
```bash
243+
netstat -nap
244+
```
245+
Filter the result to only connections to the Azure Cosmos DB endpoint.
246+
247+
The number of connections to the Azure Cosmos DB endpoint in the `ESTABLISHED` state can't be greater than your configured connection pool size.
248+
249+
Many connections to the Azure Cosmos DB endpoint might be in the `CLOSE_WAIT` state. There might be more than 1,000. A number that high indicates that connections are established and torn down quickly. This situation potentially causes problems. For more information, see the [Common issues and workarounds] section.
250+
251+
<!--Anchors-->
252+
[Common issues and workarounds]: #common-issues-workarounds
253+
[Enable client SDK logging]: #enable-client-sice-logging
254+
[Connection limit on a host machine]: #connection-limit-on-host
255+
[Azure SNAT (PAT) port exhaustion]: #snat
256+
257+

0 commit comments

Comments
 (0)