Detect stale server connections during client idle in transaction#144
Open
Detect stale server connections during client idle in transaction#144
Conversation
Add server socket monitoring and client idle timeout to prevent pool slot exhaustion when clients abandon transactions or servers terminate backends. Three protection mechanisms: 1. Server socket monitoring via tokio::select! in transaction loop: when PostgreSQL kills a backend (idle_in_transaction_session_timeout, pg_terminate_backend), pg_doorman detects it immediately through server_readable() and releases the pool slot. 2. New config option client_idle_in_transaction_timeout (default: 0/disabled): if a client holds a server connection in a transaction without sending data for longer than this timeout, pg_doorman closes the connection and frees the slot. Uses 3-branch select! only when timeout > 0, 2-branch select! otherwise (no timer wheel overhead in default config). 3. Fix CLIENTS_IN_TRANSACTIONS counter leak on early returns from transaction loop (Terminate handler, client read errors, new select branches) — all paths now properly decrement the counter. Performance: select! runs only when client is idle between queries in a transaction (already waiting on I/O), so overhead is negligible. biased select ensures client data is always checked first.
added 4 commits
March 3, 2026 15:30
After execute_server_roundtrip, BufStream's BufReader may have drained all protocol data without reading the underlying socket until WouldBlock. This leaves a stale readiness flag on the raw socket, causing server_readable() to fire immediately in the select! — falsely detecting a "dead server" and resetting the client connection. Fix: when server_readable() fires, verify with try_read() on the raw socket. If WouldBlock — spurious readiness, continue the loop. If EOF/data/error — genuine server event, handle as before. Add StreamInner::try_read() for non-blocking socket verification.
Keep only server socket monitoring via tokio::select! + server_readable(). The idle timeout feature can be added separately if needed.
The pg_terminate_backend scenario covers the same pg_doorman behavior more directly and without the 3s wait.
…essage Replace manual CLIENTS_IN_TRANSACTIONS fetch_add/fetch_sub (separated by ~200 lines with multiple early return paths) with TransactionGuard that increments on creation and decrements on drop. Fixes counter leaks on early returns from sync_parameters, deferred BEGIN, and write_all_flush. Extract inlined select! into wait_for_next_message method returning NextClientMessage enum. Encapsulate server liveness verification in Server::check_server_alive() — no more server.stream.get_ref().try_read() from transaction.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| match self.wait_for_next_message(server).await { | ||
| Ok(NextClientMessage::Message(msg)) => msg, | ||
| Ok(NextClientMessage::ServerDead) => { | ||
| warn!( |
Check failure
Code scanning / CodeQL
Cleartext logging of sensitive information High
Copilot Autofix
AI 5 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A client (TLS/SSL) was holding a transaction through pg_doorman. The client connection dropped (cause unknown — crash, network, freeze), but pg_doorman did not detect it. The server connection to PostgreSQL remained in "active/idle" state. PostgreSQL eventually killed the backend via
idle_in_transaction_session_timeout(~6 min), but pg_doorman still considered the slot occupied. New clients receivedQueryWaitTimeout.Root cause: In the transaction loop (
src/client/transaction.rs), pg_doorman only reads from the client socket. When PostgreSQL killed the backend (sent FATAL + TCP FIN), pg_doorman didn't notice because the server socket was not monitored between queries.Solution
1. Server socket monitoring via
select!In the transaction loop, where pg_doorman waits for the next client message, we now also monitor the server socket. This is encapsulated in
Client::wait_for_next_message():When PostgreSQL kills a backend (
idle_in_transaction_session_timeout,pg_terminate_backend), pg_doorman immediately detects it and releases the pool slot.2. Server liveness check encapsulated in
ServerNew methods on
Server:server_readable()— async future forselect!, waits until socket becomes readablecheck_server_alive()— synchronous probe:try_readreturningWouldBlockmeans alive, anything else (EOF/error) means deadNew methods on
StreamInner:readable()— cancel-safe readiness notification on the underlying sockettry_read()— non-blocking read for spurious readiness verificationNo more
server.stream.get_ref().try_read()leaking fromtransaction.rs— the liveness logic is fully encapsulated.3. RAII guard for
CLIENTS_IN_TRANSACTIONScounterThe old code managed
CLIENTS_IN_TRANSACTIONSwith manualfetch_add/fetch_subcalls separated by ~200 lines and multiple early return paths. This leaked the counter on several error paths.Fix:
TransactionGuard— increments on creation, decrements on drop:Counter leaks fixed for free on these paths:
sync_parameters().await?errorwrite_all_flusherror after transactionX) handler (existing bug — was missing decrement)4. Spurious readiness handling
server_readable()can fire spuriously due to stale epoll readiness flags fromBufStreamoperations.check_server_alive()handles this:try_readreturningWouldBlockclears the stale flag and the nextreadable()poll blocks correctly.Performance Analysis
The
tokio::select!is on the idle path, not the hot path.read()syscall per query roundtripCancel safety
All futures in the
select!are cancel-safe:read_message()— blocked onread_u8()(first byte), no bytes consumed until that returnsserver_readable()— readiness notification only, no data consumedBDD Test
tests/bdd/features/stale-server-detection.feature:@stale-server-pg-terminate-backend— backend killed viapg_terminate_backend()while client holds transaction withpool_size = 1. pg_doorman detects dead server, releases pool slot, new client successfully gets connection.Test plan
cargo buildcompilescargo clippy— no warningscargo test --lib— 259 passed@stale-server-detection— 16 steps passedgrep CLIENTS_IN_TRANSACTIONS.fetch_sub— only inTransactionGuard::dropgrep server.stream.get_ref()in transaction.rs — zero occurrences🤖 Generated with Claude Code