|
| 1 | +--- |
| 2 | +title: Diagnose and troubleshoot Azure Cosmos DB Java SDK v4 |
| 3 | +description: Use features like client-side logging and other third-party tools to identify, diagnose, and troubleshoot Azure Cosmos DB issues in Java SDK v4. |
| 4 | +author: anfeldma-ms |
| 5 | +ms.service: cosmos-db |
| 6 | +ms.date: 05/04/2020 |
| 7 | +ms.author: anfeldma |
| 8 | +ms.devlang: java |
| 9 | +ms.subservice: cosmosdb-sql |
| 10 | +ms.topic: troubleshooting |
| 11 | +--- |
| 12 | + |
| 13 | +# Troubleshoot issues when you use Java SDK v4 with Azure Cosmos DB SQL API accounts |
| 14 | + |
| 15 | +> [!IMPORTANT] |
| 16 | +> This article covers troubleshooting for Java SDK v4 only. Please see the __[Release Notes]()__ and __[Maven repository](https://mvnrepository.com/artifact/com.azure/azure-cosmos)__ for more information. |
| 17 | +> |
| 18 | +
|
| 19 | +This article covers common issues, workarounds, diagnostic steps, and tools when you use [Java SDK v4]() with Azure Cosmos DB SQL API accounts. |
| 20 | +Java SDK v4 provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues. |
| 21 | + |
| 22 | +Start with this list: |
| 23 | + |
| 24 | +* Take a look at the [Common issues and workarounds] section in this article. |
| 25 | +* Look at the SDK, which is available [open source on GitHub](https://github.com/Azure/azure-cosmosdb-java). It has an [issues section](https://github.com/Azure/azure-cosmosdb-java/issues) that's actively monitored. Check to see if any similar issue with a workaround is already filed. |
| 26 | +* Review the [performance tips for Java SDK v4](), and follow the suggested practices. |
| 27 | +* Read the rest of this article, if you didn't find a solution. Then file a [GitHub issue](https://github.com/Azure/azure-sdk-for-java/issues). |
| 28 | + |
| 29 | +## <a name="common-issues-workarounds"></a>Common issues and workarounds |
| 30 | + |
| 31 | +### Network issues, Netty read timeout failure, low throughput, high latency |
| 32 | + |
| 33 | +#### General suggestions |
| 34 | +* Make sure the app is running on the same region as your Azure Cosmos DB account. |
| 35 | +* Check the CPU usage on the host where the app is running. If CPU usage is 90 percent or more, run your app on a host with a higher configuration. Or you can distribute the load on more machines. |
| 36 | + |
| 37 | +#### Connection throttling |
| 38 | +Connection throttling can happen because of either a [connection limit on a host machine] or [Azure SNAT (PAT) port exhaustion]. |
| 39 | + |
| 40 | +##### <a name="connection-limit-on-host"></a>Connection limit on a host machine |
| 41 | +Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too. |
| 42 | +Run the following command. |
| 43 | + |
| 44 | +```bash |
| 45 | +ulimit -a |
| 46 | +``` |
| 47 | +The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. For more information, see [Performance tips](). |
| 48 | + |
| 49 | +##### <a name="snat"></a>Azure SNAT (PAT) port exhaustion |
| 50 | + |
| 51 | +If your app is deployed on Azure Virtual Machines without a public IP address, by default [Azure SNAT ports](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports) establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the [Azure SNAT configuration](https://docs.microsoft.com/azure/load-balancer/load-balancer-outbound-connections#preallocatedports). |
| 52 | + |
| 53 | + Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation: |
| 54 | + |
| 55 | +* Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see [Azure Virtual Network service endpoints](https://docs.microsoft.com/azure/virtual-network/virtual-network-service-endpoints-overview). |
| 56 | + |
| 57 | + When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using [Virtual Network ACLs](https://docs.microsoft.com/azure/virtual-network/virtual-networks-acl). |
| 58 | +* Assign a public IP to your Azure VM. |
| 59 | + |
| 60 | +##### <a name="cant-connect"></a>Can't reach the Service - firewall |
| 61 | +``ConnectTimeoutException`` indicates that the SDK cannot reach the service. |
| 62 | +You may get a failure similar to the following when using the direct mode: |
| 63 | +``` |
| 64 | +GoneException{error=null, resourceAddress='https://cdb-ms-prod-westus-fd4.documents.azure.com:14940/apps/e41242a5-2d71-5acb-2e00-5e5f744b12de/services/d8aa21a5-340b-21d4-b1a2-4a5333e7ed8a/partitions/ed028254-b613-4c2a-bf3c-14bd5eb64500/replicas/131298754052060051p//', statusCode=410, message=Message: The requested resource is no longer available at the server., getCauseInfo=[class: class io.netty.channel.ConnectTimeoutException, message: connection timed out: cdb-ms-prod-westus-fd4.documents.azure.com/101.13.12.5:14940] |
| 65 | +``` |
| 66 | + |
| 67 | +If you have a firewall running on your app machine, open port range 10,000 to 20,000 which are used by the direct mode. |
| 68 | +Also follow the [Connection limit on a host machine](#connection-limit-on-host). |
| 69 | + |
| 70 | +#### HTTP proxy |
| 71 | + |
| 72 | +If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK `ConnectionPolicy`. |
| 73 | +Otherwise, you face connection issues. |
| 74 | + |
| 75 | +#### Invalid coding pattern: Blocking Netty IO thread |
| 76 | + |
| 77 | +The SDK uses the [Netty](https://netty.io/) IO library to communicate with Azure Cosmos DB. The SDK has an Async API and uses non-blocking IO APIs of Netty. The SDK's IO work is performed on IO Netty threads. The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine. |
| 78 | + |
| 79 | +The Netty IO threads are meant to be used only for non-blocking Netty IO work. The SDK returns the API invocation result on one of the Netty IO threads to the app's code. If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. Such app coding might result in low throughput, high latency, and `io.netty.handler.timeout.ReadTimeoutException` failures. The workaround is to switch the thread when you know the operation takes time. |
| 80 | + |
| 81 | +For example, take a look at the following code snippet. You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. If so, you eventually can get into a state where no Netty IO thread is present to process IO work. As a result, you get a ReadTimeoutException failure. |
| 82 | +```java |
| 83 | +@Test |
| 84 | +public void badCodeWithReadTimeoutException() throws Exception { |
| 85 | + int requestTimeoutInSeconds = 10; |
| 86 | + |
| 87 | + ConnectionPolicy policy = new ConnectionPolicy(); |
| 88 | + policy.setRequestTimeoutInMillis(requestTimeoutInSeconds * 1000); |
| 89 | + |
| 90 | + AsyncDocumentClient testClient = new AsyncDocumentClient.Builder() |
| 91 | + .withServiceEndpoint(TestConfigurations.HOST) |
| 92 | + .withMasterKeyOrResourceToken(TestConfigurations.MASTER_KEY) |
| 93 | + .withConnectionPolicy(policy) |
| 94 | + .build(); |
| 95 | + |
| 96 | + int numberOfCpuCores = Runtime.getRuntime().availableProcessors(); |
| 97 | + int numberOfConcurrentWork = numberOfCpuCores + 1; |
| 98 | + CountDownLatch latch = new CountDownLatch(numberOfConcurrentWork); |
| 99 | + AtomicInteger failureCount = new AtomicInteger(); |
| 100 | + |
| 101 | + for (int i = 0; i < numberOfConcurrentWork; i++) { |
| 102 | + Document docDefinition = getDocumentDefinition(); |
| 103 | + Observable<ResourceResponse<Document>> createObservable = testClient |
| 104 | + .createDocument(getCollectionLink(), docDefinition, null, false); |
| 105 | + createObservable.subscribe(r -> { |
| 106 | + try { |
| 107 | + // Time-consuming work is, for example, |
| 108 | + // writing to a file, computationally heavy work, or just sleep. |
| 109 | + // Basically, it's anything that takes more than a few milliseconds. |
| 110 | + // Doing such operations on the IO Netty thread |
| 111 | + // without a proper scheduler will cause problems. |
| 112 | + // The subscriber will get a ReadTimeoutException failure. |
| 113 | + TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds); |
| 114 | + } catch (Exception e) { |
| 115 | + } |
| 116 | + }, |
| 117 | + |
| 118 | + exception -> { |
| 119 | + //It will be io.netty.handler.timeout.ReadTimeoutException. |
| 120 | + exception.printStackTrace(); |
| 121 | + failureCount.incrementAndGet(); |
| 122 | + latch.countDown(); |
| 123 | + }, |
| 124 | + () -> { |
| 125 | + latch.countDown(); |
| 126 | + }); |
| 127 | + } |
| 128 | + |
| 129 | + latch.await(); |
| 130 | + assertThat(failureCount.get()).isGreaterThan(0); |
| 131 | +} |
| 132 | +``` |
| 133 | + The workaround is to change the thread on which you perform work that takes time. Define a singleton instance of the scheduler for your app. |
| 134 | + ```java |
| 135 | +// Have a singleton instance of an executor and a scheduler. |
| 136 | +ExecutorService ex = Executors.newFixedThreadPool(30); |
| 137 | +Scheduler customScheduler = rx.schedulers.Schedulers.from(ex); |
| 138 | + ``` |
| 139 | + You might need to do work that takes time, for example, computationally heavy work or blocking IO. In this case, switch the thread to a worker provided by your `customScheduler` by using the `.observeOn(customScheduler)` API. |
| 140 | +```java |
| 141 | +Observable<ResourceResponse<Document>> createObservable = client |
| 142 | + .createDocument(getCollectionLink(), docDefinition, null, false); |
| 143 | + |
| 144 | +createObservable |
| 145 | + .observeOn(customScheduler) // Switches the thread. |
| 146 | + .subscribe( |
| 147 | + // ... |
| 148 | + ); |
| 149 | +``` |
| 150 | +By using `observeOn(customScheduler)`, you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler. |
| 151 | +This modification solves the problem. You won't get a `io.netty.handler.timeout.ReadTimeoutException` failure anymore. |
| 152 | + |
| 153 | +### Connection pool exhausted issue |
| 154 | + |
| 155 | +`PoolExhaustedException` is a client-side failure. This failure indicates that your app workload is higher than what the SDK connection pool can serve. Increase the connection pool size or distribute the load on multiple apps. |
| 156 | + |
| 157 | +### Request rate too large |
| 158 | +This failure is a server-side failure. It indicates that you consumed your provisioned throughput. Retry later. If you get this failure often, consider an increase in the collection throughput. |
| 159 | + |
| 160 | +### Failure connecting to Azure Cosmos DB emulator |
| 161 | + |
| 162 | +The Azure Cosmos DB emulator HTTPS certificate is self-signed. For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. For more information, see [Export Azure Cosmos DB emulator certificates](local-emulator-export-ssl-certificates.md). |
| 163 | + |
| 164 | +### Dependency Conflict Issues |
| 165 | + |
| 166 | +```console |
| 167 | +Exception in thread "main" java.lang.NoSuchMethodError: rx.Observable.toSingle()Lrx/Single; |
| 168 | +``` |
| 169 | + |
| 170 | +The above exception suggests you have a dependency on an older version of RxJava lib (e.g., 1.2.2). Our SDK relies on RxJava 1.3.8 which has APIs not available in earlier version of RxJava. |
| 171 | + |
| 172 | +The workaround for such issuses is to identify which other dependency brings in RxJava-1.2.2 and exclude the transitive dependency on RxJava-1.2.2, and allow CosmosDB SDK bring the newer version. |
| 173 | + |
| 174 | +To identify which library brings in RxJava-1.2.2 run the following command next to your project pom.xml file: |
| 175 | +```bash |
| 176 | +mvn dependency:tree |
| 177 | +``` |
| 178 | +For more information, see the [maven dependency tree guide](https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html). |
| 179 | + |
| 180 | +Once you identify RxJava-1.2.2 is transitive dependency of which other dependency of your project, you can modify the dependency on that lib in your pom file and exclude RxJava transitive dependency it: |
| 181 | + |
| 182 | +```xml |
| 183 | +<dependency> |
| 184 | + <groupId>${groupid-of-lib-which-brings-in-rxjava1.2.2}</groupId> |
| 185 | + <artifactId>${artifactId-of-lib-which-brings-in-rxjava1.2.2}</artifactId> |
| 186 | + <version>${version-of-lib-which-brings-in-rxjava1.2.2}</version> |
| 187 | + <exclusions> |
| 188 | + <exclusion> |
| 189 | + <groupId>io.reactivex</groupId> |
| 190 | + <artifactId>rxjava</artifactId> |
| 191 | + </exclusion> |
| 192 | + </exclusions> |
| 193 | +</dependency> |
| 194 | +``` |
| 195 | + |
| 196 | +For more information, see the [exclude transitive dependency guide](https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html). |
| 197 | + |
| 198 | + |
| 199 | +## <a name="enable-client-sice-logging"></a>Enable client SDK logging |
| 200 | + |
| 201 | +The Java Async SDK uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback. |
| 202 | + |
| 203 | +For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath. |
| 204 | + |
| 205 | +```xml |
| 206 | +<dependency> |
| 207 | + <groupId>org.slf4j</groupId> |
| 208 | + <artifactId>slf4j-log4j12</artifactId> |
| 209 | + <version>${slf4j.version}</version> |
| 210 | +</dependency> |
| 211 | +<dependency> |
| 212 | + <groupId>log4j</groupId> |
| 213 | + <artifactId>log4j</artifactId> |
| 214 | + <version>${log4j.version}</version> |
| 215 | +</dependency> |
| 216 | +``` |
| 217 | + |
| 218 | +Also add a log4j config. |
| 219 | +``` |
| 220 | +# this is a sample log4j configuration |
| 221 | +
|
| 222 | +# Set root logger level to DEBUG and its only appender to A1. |
| 223 | +log4j.rootLogger=INFO, A1 |
| 224 | +
|
| 225 | +log4j.category.com.microsoft.azure.cosmosdb=DEBUG |
| 226 | +#log4j.category.io.netty=INFO |
| 227 | +#log4j.category.io.reactivex=INFO |
| 228 | +# A1 is set to be a ConsoleAppender. |
| 229 | +log4j.appender.A1=org.apache.log4j.ConsoleAppender |
| 230 | +
|
| 231 | +# A1 uses PatternLayout. |
| 232 | +log4j.appender.A1.layout=org.apache.log4j.PatternLayout |
| 233 | +log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n |
| 234 | +``` |
| 235 | + |
| 236 | +For more information, see the [sfl4j logging manual](https://www.slf4j.org/manual.html). |
| 237 | + |
| 238 | +## <a name="netstats"></a>OS network statistics |
| 239 | +Run the netstat command to get a sense of how many connections are in states such as `ESTABLISHED` and `CLOSE_WAIT`. |
| 240 | + |
| 241 | +On Linux, you can run the following command. |
| 242 | +```bash |
| 243 | +netstat -nap |
| 244 | +``` |
| 245 | +Filter the result to only connections to the Azure Cosmos DB endpoint. |
| 246 | + |
| 247 | +The number of connections to the Azure Cosmos DB endpoint in the `ESTABLISHED` state can't be greater than your configured connection pool size. |
| 248 | + |
| 249 | +Many connections to the Azure Cosmos DB endpoint might be in the `CLOSE_WAIT` state. There might be more than 1,000. A number that high indicates that connections are established and torn down quickly. This situation potentially causes problems. For more information, see the [Common issues and workarounds] section. |
| 250 | + |
| 251 | + <!--Anchors--> |
| 252 | +[Common issues and workarounds]: #common-issues-workarounds |
| 253 | +[Enable client SDK logging]: #enable-client-sice-logging |
| 254 | +[Connection limit on a host machine]: #connection-limit-on-host |
| 255 | +[Azure SNAT (PAT) port exhaustion]: #snat |
| 256 | + |
| 257 | + |
0 commit comments