Context
Follow-up from #2202 (LoadBalance.ROUND_ROBIN), raised in this review comment.
In round-robin mode each request is pinned to a resolved IP via RoundRobinPartitionKey(base, IP), but the connection permit is still per host (maxConnectionsPerHost, keyed by the base host key). These two facts interact badly under a configured connection cap.
Problem
- Request A opens an HTTP/2 connection to
IP_A. It is registered in the H2 registry under its per-IP key (base, IP_A) (NettyConnectListener calls registerHttp2Connection(future.getPartitionKey(), …), and getPartitionKey() returns the per-IP override).
- Request B is pinned to
IP_B. The host is already at maxConnectionsPerHost, so acquirePartitionLockLazily() fails and B enters waitForHttp2Connection.
- B polls the registry with its own key
(base, IP_B) and finds nothing — A's connection lives only under (base, IP_A).
- B can neither open a new connection (host permit exhausted) nor reuse A's sibling connection. Off the event loop it spins for the full
connectTimeout in waitForHttp2Connection and then fails with the original permit exception — a stall followed by a failure.
This only bites when maxConnectionsPerHost is configured; the default is unlimited, so most users never hit it. That is why #2202 ships an accurate doc as the stopgap rather than a behavioral change.
Why the obvious fix does not work
Falling back to a poll on the per-host base key is a no-op: the H2 registry (ChannelManager.http2Connections) is an exact-key ConcurrentHashMap, and nothing is ever registered under the base key in round-robin mode. A one-line "poll the base key instead" swap would compile and change nothing.
Real fix
For B to reuse A's connection, the registry must let a request find any open H2 connection for the host across its per-IP keys — i.e. index the H2 registry by base key (or scan sibling round-robin keys for the same base). This is a genuine change to registry indexing, not a trivial poll swap, and should be scoped/reviewed as such.
Notes
- The stall-then-fail on
connectTimeout (off the event loop) is worth covering in any fix or test.
- Active health checks / failed-IP handling are tracked separately.
Context
Follow-up from #2202 (
LoadBalance.ROUND_ROBIN), raised in this review comment.In round-robin mode each request is pinned to a resolved IP via
RoundRobinPartitionKey(base, IP), but the connection permit is still per host (maxConnectionsPerHost, keyed by the base host key). These two facts interact badly under a configured connection cap.Problem
IP_A. It is registered in the H2 registry under its per-IP key(base, IP_A)(NettyConnectListenercallsregisterHttp2Connection(future.getPartitionKey(), …), andgetPartitionKey()returns the per-IP override).IP_B. The host is already atmaxConnectionsPerHost, soacquirePartitionLockLazily()fails and B enterswaitForHttp2Connection.(base, IP_B)and finds nothing — A's connection lives only under(base, IP_A).connectTimeoutinwaitForHttp2Connectionand then fails with the original permit exception — a stall followed by a failure.This only bites when
maxConnectionsPerHostis configured; the default is unlimited, so most users never hit it. That is why #2202 ships an accurate doc as the stopgap rather than a behavioral change.Why the obvious fix does not work
Falling back to a poll on the per-host base key is a no-op: the H2 registry (
ChannelManager.http2Connections) is an exact-keyConcurrentHashMap, and nothing is ever registered under the base key in round-robin mode. A one-line "poll the base key instead" swap would compile and change nothing.Real fix
For B to reuse A's connection, the registry must let a request find any open H2 connection for the host across its per-IP keys — i.e. index the H2 registry by base key (or scan sibling round-robin keys for the same base). This is a genuine change to registry indexing, not a trivial poll swap, and should be scoped/reviewed as such.
Notes
connectTimeout(off the event loop) is worth covering in any fix or test.