Detect stale server connections during client idle in transaction by vadv · Pull Request #144 · ozontech/pg_doorman

vadv · 2026-03-03T12:16:03Z

Problem

A client (TLS/SSL) was holding a transaction through pg_doorman. The client connection dropped (cause unknown — crash, network, freeze), but pg_doorman did not detect it. The server connection to PostgreSQL remained in "active/idle" state. PostgreSQL eventually killed the backend via idle_in_transaction_session_timeout (~6 min), but pg_doorman still considered the slot occupied. New clients received QueryWaitTimeout.

Root cause: In the transaction loop (src/client/transaction.rs), pg_doorman only reads from the client socket. When PostgreSQL killed the backend (sent FATAL + TCP FIN), pg_doorman didn't notice because the server socket was not monitored between queries.

Solution

1. Server socket monitoring via `select!`

In the transaction loop, where pg_doorman waits for the next client message, we now also monitor the server socket. This is encapsulated in Client::wait_for_next_message():

async fn wait_for_next_message(&mut self, server: &Server) -> Result<NextClientMessage, Error> {
    loop {
        tokio::select! {
            biased;
            result = read_message(&mut self.read, self.max_memory_usage) => {
                return result.map(NextClientMessage::Message);
            }
            _ = server.server_readable() => {
                if server.check_server_alive() {
                    continue; // spurious readiness
                }
                return Ok(NextClientMessage::ServerDead);
            }
        }
    }
}

When PostgreSQL kills a backend (idle_in_transaction_session_timeout, pg_terminate_backend), pg_doorman immediately detects it and releases the pool slot.

2. Server liveness check encapsulated in `Server`

New methods on Server:

server_readable() — async future for select!, waits until socket becomes readable
check_server_alive() — synchronous probe: try_read returning WouldBlock means alive, anything else (EOF/error) means dead

New methods on StreamInner:

readable() — cancel-safe readiness notification on the underlying socket
try_read() — non-blocking read for spurious readiness verification

No more server.stream.get_ref().try_read() leaking from transaction.rs — the liveness logic is fully encapsulated.

3. RAII guard for `CLIENTS_IN_TRANSACTIONS` counter

The old code managed CLIENTS_IN_TRANSACTIONS with manual fetch_add/fetch_sub calls separated by ~200 lines and multiple early return paths. This leaked the counter on several error paths.

Fix: TransactionGuard — increments on creation, decrements on drop:

struct TransactionGuard;

impl TransactionGuard {
    fn new() -> Self {
        CLIENTS_IN_TRANSACTIONS.fetch_add(1, Ordering::Relaxed);
        Self
    }
}

impl Drop for TransactionGuard {
    fn drop(&mut self) {
        CLIENTS_IN_TRANSACTIONS.fetch_sub(1, Ordering::Relaxed);
    }
}

Counter leaks fixed for free on these paths:

sync_parameters().await? error
Deferred BEGIN send/recv errors
write_all_flush error after transaction
Terminate (X) handler (existing bug — was missing decrement)

4. Spurious readiness handling

server_readable() can fire spuriously due to stale epoll readiness flags from BufStream operations. check_server_alive() handles this: try_read returning WouldBlock clears the stale flag and the next readable() poll blocks correctly.

Performance Analysis

The tokio::select! is on the idle path, not the hot path.

Aspect	Impact
Hot path (query execution)	Zero — select! not involved
Idle path (waiting for client)	~nanoseconds — poll overhead on an I/O-blocked path
Spurious readiness verification	One non-blocking `read()` syscall per query roundtrip
Memory	Zero additional allocations

Cancel safety

All futures in the select! are cancel-safe:

read_message() — blocked on read_u8() (first byte), no bytes consumed until that returns
server_readable() — readiness notification only, no data consumed

BDD Test

tests/bdd/features/stale-server-detection.feature:

@stale-server-pg-terminate-backend — backend killed via pg_terminate_backend() while client holds transaction with pool_size = 1. pg_doorman detects dead server, releases pool slot, new client successfully gets connection.

Test plan

cargo build compiles
cargo clippy — no warnings
cargo test --lib — 259 passed
BDD @stale-server-detection — 16 steps passed
grep CLIENTS_IN_TRANSACTIONS.fetch_sub — only in TransactionGuard::drop
grep server.stream.get_ref() in transaction.rs — zero occurrences

🤖 Generated with Claude Code

Add server socket monitoring and client idle timeout to prevent pool slot exhaustion when clients abandon transactions or servers terminate backends. Three protection mechanisms: 1. Server socket monitoring via tokio::select! in transaction loop: when PostgreSQL kills a backend (idle_in_transaction_session_timeout, pg_terminate_backend), pg_doorman detects it immediately through server_readable() and releases the pool slot. 2. New config option client_idle_in_transaction_timeout (default: 0/disabled): if a client holds a server connection in a transaction without sending data for longer than this timeout, pg_doorman closes the connection and frees the slot. Uses 3-branch select! only when timeout > 0, 2-branch select! otherwise (no timer wheel overhead in default config). 3. Fix CLIENTS_IN_TRANSACTIONS counter leak on early returns from transaction loop (Terminate handler, client read errors, new select branches) — all paths now properly decrement the counter. Performance: select! runs only when client is idle between queries in a transaction (already waiting on I/O), so overhead is negligible. biased select ensures client data is always checked first.

src/client/transaction.rs

After execute_server_roundtrip, BufStream's BufReader may have drained all protocol data without reading the underlying socket until WouldBlock. This leaves a stale readiness flag on the raw socket, causing server_readable() to fire immediately in the select! — falsely detecting a "dead server" and resetting the client connection. Fix: when server_readable() fires, verify with try_read() on the raw socket. If WouldBlock — spurious readiness, continue the loop. If EOF/data/error — genuine server event, handle as before. Add StreamInner::try_read() for non-blocking socket verification.

Keep only server socket monitoring via tokio::select! + server_readable(). The idle timeout feature can be added separately if needed.

The pg_terminate_backend scenario covers the same pg_doorman behavior more directly and without the 3s wait.

src/client/transaction.rs

…essage Replace manual CLIENTS_IN_TRANSACTIONS fetch_add/fetch_sub (separated by ~200 lines with multiple early return paths) with TransactionGuard that increments on creation and decrements on drop. Fixes counter leaks on early returns from sync_parameters, deferred BEGIN, and write_all_flush. Extract inlined select! into wait_for_next_message method returning NextClientMessage enum. Encapsulate server liveness verification in Server::check_server_alive() — no more server.stream.get_ref().try_read() from transaction.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

src/client/transaction.rs

+                            match self.wait_for_next_message(server).await {
+                                Ok(NextClientMessage::Message(msg)) => msg,
+                                Ok(NextClientMessage::ServerDead) => {
+                                    warn!(


github-advanced-security bot found potential problems Mar 3, 2026

View reviewed changes

src/client/transaction.rs Fixed Show fixed Hide fixed

src/client/transaction.rs Fixed Show fixed Hide fixed

dmitrivasilyev added 4 commits March 3, 2026 15:30

Remove client_idle_in_transaction_timeout to simplify the PR

113af6d

Keep only server socket monitoring via tokio::select! + server_readable(). The idle timeout feature can be added separately if needed.

Fix stale comment referencing removed 3-branch select

3868f19

Remove idle_in_transaction_session_timeout BDD scenario

afec397

The pg_terminate_backend scenario covers the same pg_doorman behavior more directly and without the 3s wait.

github-advanced-security bot found potential problems Mar 3, 2026

View reviewed changes

src/client/transaction.rs Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect stale server connections during client idle in transaction#144

Detect stale server connections during client idle in transaction#144
vadv wants to merge 6 commits intomasterfrom
fix/detect-stale-client-in-transaction

vadv commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vadv commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Server socket monitoring via select!

2. Server liveness check encapsulated in Server

3. RAII guard for CLIENTS_IN_TRANSACTIONS counter

4. Spurious readiness handling

Performance Analysis

Cancel safety

BDD Test

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vadv commented Mar 3, 2026 •

edited

Loading

1. Server socket monitoring via `select!`

2. Server liveness check encapsulated in `Server`

3. RAII guard for `CLIENTS_IN_TRANSACTIONS` counter