Skip to content

Conversation

LikeTheSalad
Copy link
Contributor

Fixes: open-telemetry/opentelemetry-android#1134

Summary

This is to avoid the following crash from happening when a grpc exporter is shut down while okhttp is waiting to establish a connection with the server.

java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1765)
	at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515)
	at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677)
	at okhttp3.internal.connection.FastFallbackExchangeFinder.awaitTcpConnect(FastFallbackExchangeFinder.kt:162)
	at okhttp3.internal.connection.FastFallbackExchangeFinder.find(FastFallbackExchangeFinder.kt:69)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:280)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:126)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:101)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:126)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:85)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:126)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:126)
	at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:96)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:126)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:208)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:530)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:651)
	at java.lang.Thread.run(Thread.java:1119)

Description

The crash happens when okhttp gets an unhandled interrupted exception here which propagates to the host app.

The proposed solution is to avoid interrupting the thread, so that the execution will return here after a timeout, and then finish here as the call is cancelled prior to attempting to shut down the executor.

Alternative solution

Asking okhttp to handle possible java.lang.InterruptedExceptions here.

@@ -214,7 +214,7 @@ private static String grpcMessage(Response response) {
public CompletableResultCode shutdown() {
client.dispatcher().cancelAll();
if (managedExecutor) {
client.dispatcher().executorService().shutdownNow();
client.dispatcher().executorService().shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if the executor is using daemon threads, or do we need to worry about this possibly impacting application shutdown?

Copy link

@bidetofevil bidetofevil Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So managedExecutor is only true if no ExecutorService is NOT passed into this class, which means we are purging the ExecutorService only when we are using OktHttp default ExecutorService for the Dispatcher.

Shouldn't it be the other way around, i.e. we only explicitly shut down the executor if we pass it in ourselves, and leave the clean up to OkHttp if it's providing its own executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So managedExecutor is only true if no ExecutorService is passed into this class, which means we are purging the ExecutorService only when we are using OktHttp default ExecutorService for the Dispatcher.

I think we never use okhttp's default service, as we set one here when users don't provide any.

do you know if the executor is using daemon threads, or do we need to worry about this possibly impacting application shutdown?

We are using daemon threads, but it's not guaranteed that it will always be the case. I didn't think of the app not closing due to a non-daemon thread scenario, because Android will still close an app regardless of existing non-daemon threads running, but it's a great point. Also, @laurit mentioned below that we're using shutdownNow in many places, so this doesn't seem to be a proper fix overall. I'll work on an alternative.

Copy link

@bidetofevil bidetofevil Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, sorry, I flipped the logic yesterday. What I meant to say is that managedExecutor is only true if you DON'T pass in an executor, so this codepath should never be hit if we always passed one in.

THAT was the logic that I thought seems a bit off.

That being said, handing the uncaught exception explicitly seems like a good idea. Android won't crash the app until the uncaught exception is thrown on the main thread, but silently crashing a background thread isn't great either if you can deal with it explicitly.

@laurit
Copy link
Contributor

laurit commented Aug 12, 2025

I guess the issue here is that the default uncaught exception handler on android kills the application while in java it prints the exception to stderr. I doubt that this issue is specific only to the grpc+okhttp, there are other places in the sdk that call shutdownNow. If the user is passing an executor the user could use a custom thread factory that overrides the default uncaught exception behavior. For the default case https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/common/src/main/java/io/opentelemetry/sdk/internal/DaemonThreadFactory.java could be modified to set the uncaught exception handler or wrap the runnable.

@LikeTheSalad
Copy link
Contributor Author

For the default case https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/common/src/main/java/io/opentelemetry/sdk/internal/DaemonThreadFactory.java could be modified to set the uncaught exception handler or wrap the runnable.

This is a better alternative, so I've created this PR to go with this approach instead. Thanks, @laurit.

@LikeTheSalad LikeTheSalad deleted the fix-grpc-shutdown branch August 13, 2025 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

App crashes after some period of downtime of the otel endpoint
4 participants