Stale check command for async connections #547

ok2c · 2025-08-10T10:29:27Z

@rschmitt This change-set introduces a relatively cheap 'stale' connection check command that works with both HTTP/1.1 and H2 protocols and can be used instead of a more expensive Ping command.

Please take a look.

rschmitt · 2025-08-11T20:08:40Z

Do you have the corresponding client changes? I tried integrating this into PoolingAsyncClientConnectionManager after the fashion of the H2 PingCommand code path, and now my requests don't finish.

I think there's might be a race condition here involving command execution and connection closure. I remember a few years ago there was an issue where enabling inactive connection validation would cause the client to hang by submitting a PingCommand on a connection that had already received a GOAWAY. My records indicate that this bug was fixed, but now I'm not so sure, or maybe it was reintroduced. In the debugger, I don't even see the IOReactor wake up when the command is submitted; readyCount comes back as 0. It doesn't look like there's anything in IOSessionImpl::enqueue that prevents a Command from being submitted against an already-closed connection.

rschmitt · 2025-08-11T20:12:39Z

httpcore5/src/main/java/org/apache/hc/core5/http/nio/command/StaleCheckCommand.java

+
+    @Override
+    public boolean cancel() {
+        return true;


Cancel the callback

rschmitt · 2025-08-11T20:13:35Z

httpcore5/src/main/java/org/apache/hc/core5/http/impl/nio/AbstractHttp1StreamDuplexer.java

+            callback.completed(false);
+        }
+        final ByteBuffer buffer = ByteBuffer.allocate(0);
+        final int bytesRead = ioSession.channel().read(buffer);


I think this read is actually redundant. Command is modeled as a write event, and reads are processed before writes. So if the connection is closed, we'll already know by the time we get here, provided that we include the changes from #543.

The end result ends up being very similar to the "synchronization barrier" between the event loop and the connection pool that I was talking about in that PR. The internal race condition basically goes away as long as connection reuse is completed through the event loop, which then has a chance to update all the relevant bookkeeping with respect to whatever IO events are pending.

rschmitt · 2025-08-11T20:42:06Z

I made the following changes locally:

I pulled in this change
I added Mark HTTP/1.1 async connection as not open (non-reusable) as soon as it becomes closed by the opposite endpoint #543 (specifically dfa2cd5)
I taught PoolingAsyncClientConnectionManager to submit a StaleCheckCommand as the implementation of validateAfterInactivity on HTTP/1.1 connections
I set setValidateAfterInactivity(ZERO_MILLISECONDS) in TestConnectionClosureRace

When I do all of this, the results are dramatic:

Http: Sequential requests (rapid): 2,500 succeeded; 0 failed (100.00% success rate)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 30 succeeded; 0 failed (100.00% success rate)
Http: Multiple small batches: 15 succeeded; 0 failed (100.00% success rate)

Https: Sequential requests (rapid): 2,499 succeeded; 1 failed (99.96% success rate, 0.00% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 29 succeeded; 1 failed (96.67% success rate, 0.00% retriable)
Https: Multiple small batches: 13 succeeded; 2 failed (86.67% success rate, 0.00% retriable)

When I disable inactive connection validation, I get the same results I've been getting:

Http: Sequential requests (rapid): 2,494 succeeded; 6 failed (99.76% success rate, 0.24% retriable)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 33.33% retriable)

Https: Sequential requests (rapid): 2,476 succeeded; 24 failed (99.04% success rate, 0.96% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Https: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 26.67% retriable)

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

rschmitt · 2025-08-11T20:43:49Z

I think there's might be a race condition here involving command execution and connection closure.

I think I was mistaken about this. The actual issue might have been the no-op implementation of StaleCheckCommand::cancel. IOSessionImpl::enqueue enqueues the command, but then cancels it if isStatusClosed(), which seems like it should be okay.

ok2c · 2025-08-12T11:20:33Z

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

@rschmitt The results looks encouraging.

ok2c · 2025-08-12T11:21:06Z

@rschmitt I will fix the problem with #cancel and see what else could be improved.

ok2c requested a review from rschmitt August 10, 2025 10:29

ok2c force-pushed the stale_conn_check_async branch from 4f0eea7 to e67636b Compare August 10, 2025 12:12

ok2c mentioned this pull request Aug 11, 2025

Mark HTTP/1.1 async connection as not open (non-reusable) as soon as it becomes closed by the opposite endpoint #543

Closed

rschmitt requested changes Aug 11, 2025

View reviewed changes

Stale check command for async connections

f70f0fd

ok2c force-pushed the stale_conn_check_async branch from e67636b to f70f0fd Compare August 12, 2025 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stale check command for async connections #547

Stale check command for async connections #547

ok2c commented Aug 10, 2025

Uh oh!

rschmitt commented Aug 11, 2025

Uh oh!

rschmitt Aug 11, 2025

Uh oh!

rschmitt Aug 11, 2025

Uh oh!

rschmitt commented Aug 11, 2025 •

edited

Loading

Uh oh!

rschmitt commented Aug 11, 2025

Uh oh!

ok2c commented Aug 12, 2025 •

edited

Loading

Uh oh!

ok2c commented Aug 12, 2025

Uh oh!

Uh oh!

Stale check command for async connections #547

Are you sure you want to change the base?

Stale check command for async connections #547

Conversation

ok2c commented Aug 10, 2025

Uh oh!

rschmitt commented Aug 11, 2025

Uh oh!

rschmitt Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

rschmitt Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

rschmitt commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschmitt commented Aug 11, 2025

Uh oh!

ok2c commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ok2c commented Aug 12, 2025

Uh oh!

Uh oh!

rschmitt commented Aug 11, 2025 •

edited

Loading

ok2c commented Aug 12, 2025 •

edited

Loading