Skip to content

Stale check command for async connections #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ok2c
Copy link
Member

@ok2c ok2c commented Aug 10, 2025

@rschmitt This change-set introduces a relatively cheap 'stale' connection check command that works with both HTTP/1.1 and H2 protocols and can be used instead of a more expensive Ping command.

Please take a look.

@rschmitt
Copy link
Contributor

Do you have the corresponding client changes? I tried integrating this into PoolingAsyncClientConnectionManager after the fashion of the H2 PingCommand code path, and now my requests don't finish.

I think there's might be a race condition here involving command execution and connection closure. I remember a few years ago there was an issue where enabling inactive connection validation would cause the client to hang by submitting a PingCommand on a connection that had already received a GOAWAY. My records indicate that this bug was fixed, but now I'm not so sure, or maybe it was reintroduced. In the debugger, I don't even see the IOReactor wake up when the command is submitted; readyCount comes back as 0. It doesn't look like there's anything in IOSessionImpl::enqueue that prevents a Command from being submitted against an already-closed connection.


@Override
public boolean cancel() {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel the callback

callback.completed(false);
}
final ByteBuffer buffer = ByteBuffer.allocate(0);
final int bytesRead = ioSession.channel().read(buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this read is actually redundant. Command is modeled as a write event, and reads are processed before writes. So if the connection is closed, we'll already know by the time we get here, provided that we include the changes from #543.

The end result ends up being very similar to the "synchronization barrier" between the event loop and the connection pool that I was talking about in that PR. The internal race condition basically goes away as long as connection reuse is completed through the event loop, which then has a chance to update all the relevant bookkeeping with respect to whatever IO events are pending.

@rschmitt
Copy link
Contributor

rschmitt commented Aug 11, 2025

I made the following changes locally:

  1. I pulled in this change
  2. I added Mark HTTP/1.1 async connection as not open (non-reusable) as soon as it becomes closed by the opposite endpoint #543 (specifically dfa2cd5)
  3. I taught PoolingAsyncClientConnectionManager to submit a StaleCheckCommand as the implementation of validateAfterInactivity on HTTP/1.1 connections
  4. I set setValidateAfterInactivity(ZERO_MILLISECONDS) in TestConnectionClosureRace

When I do all of this, the results are dramatic:

Http: Sequential requests (rapid): 2,500 succeeded; 0 failed (100.00% success rate)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 30 succeeded; 0 failed (100.00% success rate)
Http: Multiple small batches: 15 succeeded; 0 failed (100.00% success rate)

Https: Sequential requests (rapid): 2,499 succeeded; 1 failed (99.96% success rate, 0.00% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 29 succeeded; 1 failed (96.67% success rate, 0.00% retriable)
Https: Multiple small batches: 13 succeeded; 2 failed (86.67% success rate, 0.00% retriable)

When I disable inactive connection validation, I get the same results I've been getting:

Http: Sequential requests (rapid): 2,494 succeeded; 6 failed (99.76% success rate, 0.24% retriable)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 33.33% retriable)

Https: Sequential requests (rapid): 2,476 succeeded; 24 failed (99.04% success rate, 0.96% retriable)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 15 succeeded; 15 failed (50.00% success rate, 50.00% retriable)
Https: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate, 26.67% retriable)

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

@rschmitt
Copy link
Contributor

I think there's might be a race condition here involving command execution and connection closure.

I think I was mistaken about this. The actual issue might have been the no-op implementation of StaleCheckCommand::cancel. IOSessionImpl::enqueue enqueues the command, but then cancels it if isStatusClosed(), which seems like it should be okay.

@ok2c
Copy link
Member Author

ok2c commented Aug 12, 2025

We're definitely on the right track. If I'm right about the IO in doStalecheck, that code can probably all be deleted, and StaleCheckCommand can be renamed ConnectionLeaseCommand or something.

@rschmitt The results looks encouraging.

@ok2c
Copy link
Member Author

ok2c commented Aug 12, 2025

@rschmitt I will fix the problem with #cancel and see what else could be improved.

@ok2c ok2c force-pushed the stale_conn_check_async branch from e67636b to f70f0fd Compare August 12, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants