Skip to content

[πŸ› Bug]: Critical database connection issues under high alert loadΒ #5496

@Viste

Description

@Viste

Critical database connection issues under high alert load

Description

Keep backend has multiple critical database connection issues that make it unusable in high-throughput production environments (1000+ alerts/minute). These issues occur both with direct PostgreSQL connections AND through PgBouncer.

Environment

  • Keep version: 0.48.0
  • PostgreSQL: 15.x (direct connection, VM)
  • Python: 3.13
  • psycopg2: latest
  • SQLAlchemy: 2.0.36
  • Deployment: 1 pods, 4 gunicorn workers + 4 ARQ workers each

Configuration tested

DATABASE_POOL_SIZE: "20"
DATABASE_MAX_OVERFLOW: "40"
KEEP_DB_PRE_PING_ENABLED: "true"

Issue 1: Connection pool exhaustion (Direct PostgreSQL)

Symptoms

Under moderate load (~50 alerts/second), the connection pool becomes exhausted within minutes:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 40 reached, 
connection timed out, timeout 30.00

Analysis

Connections are acquired but never returned to the pool. Even with pool_pre_ping=True, dead connections accumulate. PostgreSQL shows connections in idle state that are never reused by the application.

Expected: 60 connections (20 + 40 overflow) should be recycled
Actual: All 60 connections become "stuck", new requests timeout


Issue 2: Deferred loader failure after INSERT

Error

KeyError: "Deferred loader for attribute 'id' failed to populate correctly"

File "keep/api/core/db.py", line 2716, in create_deduplication_event
    "deduplication_event_id": deduplication_event.id,

Root cause

In db.py:2716, code accesses deduplication_event.id immediately after session.add() but before session.flush() or session.commit(). The ORM hasn't populated the auto-generated ID yet.

Fix required

# Current (broken):
session.add(deduplication_event)
return {"deduplication_event_id": deduplication_event.id}  # ID not populated!

# Should be:
session.add(deduplication_event)
session.flush()  # Force INSERT to get ID
return {"deduplication_event_id": deduplication_event.id}

Issue 3: PgBouncer incompatibility

When using PgBouncer (even in session mode), psycopg2 produces protocol-level errors:

psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq
sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column 'workflow.id'
IndexError: tuple index out of range

PgBouncer config tested

pool_mode = session
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digits

Result: Still fails. psycopg2 binary protocol has fundamental incompatibility with PgBouncer.


Issue 4: Stale connections not detected

psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.

PostgreSQL closes idle connections (via idle_session_timeout), but SQLAlchemy pool doesn't detect this despite pool_pre_ping=True.


Summary table

Setup Result Primary error
Direct PostgreSQL ❌ Fails Pool exhaustion + deferred loader
Direct + pre_ping ❌ Fails Same issues
PgBouncer session mode ❌ Fails Protocol corruption
PgBouncer + DISCARD ALL ❌ Fails Same protocol errors

This blocks enterprise adoption where connection pooling is mandatory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions