[🐛 Bug]: Critical database connection issues under high alert load

## Critical database connection issues under high alert load

### Description

Keep backend has multiple critical database connection issues that make it unusable in high-throughput production environments (1000+ alerts/minute). These issues occur both with direct PostgreSQL connections AND through PgBouncer.

### Environment

- Keep version: 0.48.0
- PostgreSQL: 15.x (direct connection, VM)
- Python: 3.13
- psycopg2: latest
- SQLAlchemy: 2.0.36
- Deployment: 1 pods, 4 gunicorn workers + 4 ARQ workers each

### Configuration tested
```yaml
DATABASE_POOL_SIZE: "20"
DATABASE_MAX_OVERFLOW: "40"
KEEP_DB_PRE_PING_ENABLED: "true"
```

---

## Issue 1: Connection pool exhaustion (Direct PostgreSQL)

### Symptoms

Under moderate load (~50 alerts/second), the connection pool becomes exhausted within minutes:
```
sqlalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 40 reached, 
connection timed out, timeout 30.00
```

### Analysis

Connections are acquired but never returned to the pool. Even with `pool_pre_ping=True`, dead connections accumulate. PostgreSQL shows connections in `idle` state that are never reused by the application.

**Expected**: 60 connections (20 + 40 overflow) should be recycled
**Actual**: All 60 connections become "stuck", new requests timeout

---

## Issue 2: Deferred loader failure after INSERT

### Error
```python
KeyError: "Deferred loader for attribute 'id' failed to populate correctly"

File "keep/api/core/db.py", line 2716, in create_deduplication_event
    "deduplication_event_id": deduplication_event.id,
```

### Root cause

In `db.py:2716`, code accesses `deduplication_event.id` immediately after `session.add()` but before `session.flush()` or `session.commit()`. The ORM hasn't populated the auto-generated ID yet.

### Fix required
```python
# Current (broken):
session.add(deduplication_event)
return {"deduplication_event_id": deduplication_event.id}  # ID not populated!

# Should be:
session.add(deduplication_event)
session.flush()  # Force INSERT to get ID
return {"deduplication_event_id": deduplication_event.id}
```

---

## Issue 3: PgBouncer incompatibility

When using PgBouncer (even in session mode), psycopg2 produces protocol-level errors:
```
psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq
sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column 'workflow.id'
IndexError: tuple index out of range
```

### PgBouncer config tested
```ini
pool_mode = session
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digits
```

**Result**: Still fails. psycopg2 binary protocol has fundamental incompatibility with PgBouncer.

---

## Issue 4: Stale connections not detected
```
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
```

PostgreSQL closes idle connections (via `idle_session_timeout`), but SQLAlchemy pool doesn't detect this despite `pool_pre_ping=True`.

---

## Summary table

| Setup | Result | Primary error |
|-------|--------|---------------|
| Direct PostgreSQL | ❌ Fails | Pool exhaustion + deferred loader |
| Direct + pre_ping | ❌ Fails | Same issues |
| PgBouncer session mode | ❌ Fails | Protocol corruption |
| PgBouncer + DISCARD ALL | ❌ Fails | Same protocol errors |

This blocks enterprise adoption where connection pooling is mandatory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Critical database connection issues under high alert load #5496

Critical database connection issues under high alert load

Description

Environment

Configuration tested

Issue 1: Connection pool exhaustion (Direct PostgreSQL)

Symptoms

Analysis

Issue 2: Deferred loader failure after INSERT

Error

Root cause

Fix required

Issue 3: PgBouncer incompatibility

PgBouncer config tested

Issue 4: Stale connections not detected

Summary table

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setup	Result	Primary error
Direct PostgreSQL	❌ Fails	Pool exhaustion + deferred loader
Direct + pre_ping	❌ Fails	Same issues
PgBouncer session mode	❌ Fails	Protocol corruption
PgBouncer + DISCARD ALL	❌ Fails	Same protocol errors

[🐛 Bug]: Critical database connection issues under high alert load #5496

Description

Critical database connection issues under high alert load

Description

Environment

Configuration tested

Issue 1: Connection pool exhaustion (Direct PostgreSQL)

Symptoms

Analysis

Issue 2: Deferred loader failure after INSERT

Error

Root cause

Fix required

Issue 3: PgBouncer incompatibility

PgBouncer config tested

Issue 4: Stale connections not detected

Summary table

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions