[bootstrap] Prevent concurrent default admin creation on multi-worker startup

### Context - why is this issue relevant?

When running the API with multiple Gunicorn workers (`--workers N`), each worker independently runs the FastAPI `lifespan()` context manager on startup. This means `bootstrap_default_admin()` is called N times simultaneously. All workers query `has_admin_user()` at roughly the same time, all get False, and then all race to insert the default admin role and user. Only one succeeds, the rest hit a PostgreSQL `IntegrityError` which propagates as a `RuntimeError`, crashing the worker. **In practice, starting the API in multi-worker mode is broken.**

The root cause is a TOCTOU (`Time-of-Check to Time-of-Use`) race condition: the check and the write are not atomic, and each worker is a fully isolated OS process with no awareness of the others (see https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use).

### Objective – what is the aim of this issue?

Ensure `bootstrap_default_admin()` runs exactly once at startup, regardless of the number of workers configured, so that the API starts reliably in multi-worker mode, or make this process non-blocking.

### Expected outcomes – what do we expect at the end of this issue (concrete outcomes)?

#### Outcomes

- The API starts successfully and without errors when configured with multiple Gunicorn workers.
- The default admin (role and user) is created exactly once, even under concurrent startup.
- If the admin already exists (e.g. container restart), bootstrap is skipped gracefully.

#### Acceptance criteria

- Starting the API with `--workers 5` (or any `N > 1`) creates exactly one default admin user and role with no errors or crashes.
- Starting the API with `--workers 1` or without flag still works correctly.
- If the admin already exists at startup, bootstrap is skipped with a log message and no error.

---

### Solutions to explore

#### Option A: Absorb conflicts silently in lifespan.py

In `lifespan.py`, `RoleAlreadyExistsError` and `UserAlreadyExistsError` currently raise a `RuntimeError`. Since these errors during bootstrap mean another worker already succeeded, they could instead be treated as `BootstrapAdminUseCaseSkipped` and logged as a warning.

- Pros: One-line change, no new dependencies, no architectural changes.
- Cons: Does not prevent the race, only survives it. Workers still all attempt creation and only one fully succeeds. Partially noisy (multiple workers crashing and recovering is not clean). Does not address the root cause.

---

#### Option B: when_ready Gunicorn hook

Gunicorn has a `when_ready(server)` hook that runs in the master process after it has fully initialized but before any worker is forked. `scripts/gunicorn.conf.py` already uses the `child_exit` hook. Bootstrap could be added to `when_ready`, and removed from `lifespan()`.

- Pros: Runs exactly once by design. No locking mechanism needed. Uses existing Gunicorn infrastructure. Clean separation: bootstrap is a deployment concern, not a per-worker runtime concern.
- Cons: The hook is synchronous: `bootstrap_default_admin()` is async, so it requires wrapping in `asyncio.run()`. Only works when Gunicorn is the process manager (not uvicorn standalone, not tests). The master process does not share any initialized state with workers. This logic is not handled in the app logic/codebase directly.

---

#### Option C: Pre-flight script in startup_api.sh

Bootstrap could be extracted into a standalone Python script and called in startup_api.sh before gunicorn is launched, similarly to how `alembic upgrade head` is already run.

```
python -m alembic -c api/alembic.ini upgrade head
python -m api.scripts.bootstrap_admin   # new
exec gunicorn ...
```

- Pros: Completely decoupled from the application runtime. Runs in a single process with no concurrency. Easy to understand and test in isolation. Works regardless of how the app is later served (Gunicorn, uvicorn, etc.).
- Cons: Requires creating a new script and modifying the startup script. Bootstrap is no longer self-contained within the application. Operators and developers deploying differently (e.g. via uvicorn directly) must remember to run it separately.

---

#### Option D: PostgreSQL advisory lock

PostgreSQL supports session-level advisory locks (`pg_try_advisory_lock`) that are atomic and scoped to the database connection. The first worker to acquire the lock runs bootstrap; the others detect the lock is taken and skip. The lock is automatically released when the session closes, even on crash.

- Pros: Fully atomic at the DB level. No external dependencies beyond PostgreSQL, which is already required. Works regardless of how many workers or replicas are running. An info log can be added when detected.
- Cons: Adds non-trivial plumbing to `bootstrap_default_admin()`. Workers that do not acquire the lock skip immediately: they do not wait, so there is a small window where a worker starts serving traffic before bootstrap is fully complete (though in practice this is negligible since the lifespan blocks the server from accepting requests).

---

#### Option E: Redis distributed lock

Redis is already a required dependency. A distributed lock with a TTL (e.g. `SET bootstrap_lock 1 NX EX 30`) ensures only one worker runs bootstrap. The TTL prevents deadlocks if a worker crashes while holding the lock.

- Pros: Clean distributed lock with automatic expiry. Redis is already initialized at this point in the lifespan. Works across multiple replicas/pods (unlike Gunicorn hooks which are single-host).
- Cons: Slightly more complex than an advisory lock. Requires passing the Redis pool into `bootstrap_default_admin()`. Introduces a dependency on Redis being reachable before bootstrap can run (already the case in practice).

---

#### Option F: Write a flag file

After a successful bootstrap, write a marker file (e.g. `/tmp/bootstrap_done`). Workers check for the file before attempting bootstrap.

- Pros: No external dependencies. Simple to implement.
- Cons: The file is local to the host: does not work across multiple replicas/pods. Race condition still exists between the file check and the write (two workers can both miss the file and both attempt creation). File persists across container restarts only if on a mounted volume, otherwise it disappears and bootstrap re-runs on every restart (which is fine since `has_admin_user()` guards it, but makes the flag pointless). Fragile in general. Non-compatible with a future Kubernetes implementation.

---

#### Option G: Environment variable

Set an environment variable (e.g. `BOOTSTRAP_DONE=1`) after the first worker completes bootstrap. Subsequent workers check for it before attempting.

- Pros: Zero dependencies, trivially simple.
- Cons: Environment variables cannot be set in a parent process from a child process. A worker cannot modify the environment of other workers or the master. This fundamentally does not work in a multi-process model. Could work if set externally (e.g. in the startup script after a pre-flight bootstrap script), but then it's just a worse version of Option C.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bootstrap] Prevent concurrent default admin creation on multi-worker startup #798

Context - why is this issue relevant?

Objective – what is the aim of this issue?

Expected outcomes – what do we expect at the end of this issue (concrete outcomes)?

Outcomes

Acceptance criteria

Solutions to explore

Option A: Absorb conflicts silently in lifespan.py

Option B: when_ready Gunicorn hook

Option C: Pre-flight script in startup_api.sh

Option D: PostgreSQL advisory lock

Option E: Redis distributed lock

Option F: Write a flag file

Option G: Environment variable

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bootstrap] Prevent concurrent default admin creation on multi-worker startup #798

Description

Context - why is this issue relevant?

Objective – what is the aim of this issue?

Expected outcomes – what do we expect at the end of this issue (concrete outcomes)?

Outcomes

Acceptance criteria

Solutions to explore

Option A: Absorb conflicts silently in lifespan.py

Option B: when_ready Gunicorn hook

Option C: Pre-flight script in startup_api.sh

Option D: PostgreSQL advisory lock

Option E: Redis distributed lock

Option F: Write a flag file

Option G: Environment variable

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions