-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Critical database connection issues under high alert load
Description
Keep backend has multiple critical database connection issues that make it unusable in high-throughput production environments (1000+ alerts/minute). These issues occur both with direct PostgreSQL connections AND through PgBouncer.
Environment
- Keep version: 0.48.0
- PostgreSQL: 15.x (direct connection, VM)
- Python: 3.13
- psycopg2: latest
- SQLAlchemy: 2.0.36
- Deployment: 1 pods, 4 gunicorn workers + 4 ARQ workers each
Configuration tested
DATABASE_POOL_SIZE: "20"
DATABASE_MAX_OVERFLOW: "40"
KEEP_DB_PRE_PING_ENABLED: "true"Issue 1: Connection pool exhaustion (Direct PostgreSQL)
Symptoms
Under moderate load (~50 alerts/second), the connection pool becomes exhausted within minutes:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 40 reached,
connection timed out, timeout 30.00
Analysis
Connections are acquired but never returned to the pool. Even with pool_pre_ping=True, dead connections accumulate. PostgreSQL shows connections in idle state that are never reused by the application.
Expected: 60 connections (20 + 40 overflow) should be recycled
Actual: All 60 connections become "stuck", new requests timeout
Issue 2: Deferred loader failure after INSERT
Error
KeyError: "Deferred loader for attribute 'id' failed to populate correctly"
File "keep/api/core/db.py", line 2716, in create_deduplication_event
"deduplication_event_id": deduplication_event.id,Root cause
In db.py:2716, code accesses deduplication_event.id immediately after session.add() but before session.flush() or session.commit(). The ORM hasn't populated the auto-generated ID yet.
Fix required
# Current (broken):
session.add(deduplication_event)
return {"deduplication_event_id": deduplication_event.id} # ID not populated!
# Should be:
session.add(deduplication_event)
session.flush() # Force INSERT to get ID
return {"deduplication_event_id": deduplication_event.id}Issue 3: PgBouncer incompatibility
When using PgBouncer (even in session mode), psycopg2 produces protocol-level errors:
psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq
sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column 'workflow.id'
IndexError: tuple index out of range
PgBouncer config tested
pool_mode = session
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digitsResult: Still fails. psycopg2 binary protocol has fundamental incompatibility with PgBouncer.
Issue 4: Stale connections not detected
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
PostgreSQL closes idle connections (via idle_session_timeout), but SQLAlchemy pool doesn't detect this despite pool_pre_ping=True.
Summary table
| Setup | Result | Primary error |
|---|---|---|
| Direct PostgreSQL | β Fails | Pool exhaustion + deferred loader |
| Direct + pre_ping | β Fails | Same issues |
| PgBouncer session mode | β Fails | Protocol corruption |
| PgBouncer + DISCARD ALL | β Fails | Same protocol errors |
This blocks enterprise adoption where connection pooling is mandatory.