-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Context - why is this issue relevant?
When running the API with multiple Gunicorn workers (--workers N), each worker independently runs the FastAPI lifespan() context manager on startup. This means bootstrap_default_admin() is called N times simultaneously. All workers query has_admin_user() at roughly the same time, all get False, and then all race to insert the default admin role and user. Only one succeeds, the rest hit a PostgreSQL IntegrityError which propagates as a RuntimeError, crashing the worker. In practice, starting the API in multi-worker mode is broken.
The root cause is a TOCTOU (Time-of-Check to Time-of-Use) race condition: the check and the write are not atomic, and each worker is a fully isolated OS process with no awareness of the others (see https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use).
Objective – what is the aim of this issue?
Ensure bootstrap_default_admin() runs exactly once at startup, regardless of the number of workers configured, so that the API starts reliably in multi-worker mode, or make this process non-blocking.
Expected outcomes – what do we expect at the end of this issue (concrete outcomes)?
Outcomes
- The API starts successfully and without errors when configured with multiple Gunicorn workers.
- The default admin (role and user) is created exactly once, even under concurrent startup.
- If the admin already exists (e.g. container restart), bootstrap is skipped gracefully.
Acceptance criteria
- Starting the API with
--workers 5(or anyN > 1) creates exactly one default admin user and role with no errors or crashes. - Starting the API with
--workers 1or without flag still works correctly. - If the admin already exists at startup, bootstrap is skipped with a log message and no error.
Solutions to explore
Option A: Absorb conflicts silently in lifespan.py
In lifespan.py, RoleAlreadyExistsError and UserAlreadyExistsError currently raise a RuntimeError. Since these errors during bootstrap mean another worker already succeeded, they could instead be treated as BootstrapAdminUseCaseSkipped and logged as a warning.
- Pros: One-line change, no new dependencies, no architectural changes.
- Cons: Does not prevent the race, only survives it. Workers still all attempt creation and only one fully succeeds. Partially noisy (multiple workers crashing and recovering is not clean). Does not address the root cause.
Option B: when_ready Gunicorn hook
Gunicorn has a when_ready(server) hook that runs in the master process after it has fully initialized but before any worker is forked. scripts/gunicorn.conf.py already uses the child_exit hook. Bootstrap could be added to when_ready, and removed from lifespan().
- Pros: Runs exactly once by design. No locking mechanism needed. Uses existing Gunicorn infrastructure. Clean separation: bootstrap is a deployment concern, not a per-worker runtime concern.
- Cons: The hook is synchronous:
bootstrap_default_admin()is async, so it requires wrapping inasyncio.run(). Only works when Gunicorn is the process manager (not uvicorn standalone, not tests). The master process does not share any initialized state with workers. This logic is not handled in the app logic/codebase directly.
Option C: Pre-flight script in startup_api.sh
Bootstrap could be extracted into a standalone Python script and called in startup_api.sh before gunicorn is launched, similarly to how alembic upgrade head is already run.
python -m alembic -c api/alembic.ini upgrade head
python -m api.scripts.bootstrap_admin # new
exec gunicorn ...
- Pros: Completely decoupled from the application runtime. Runs in a single process with no concurrency. Easy to understand and test in isolation. Works regardless of how the app is later served (Gunicorn, uvicorn, etc.).
- Cons: Requires creating a new script and modifying the startup script. Bootstrap is no longer self-contained within the application. Operators and developers deploying differently (e.g. via uvicorn directly) must remember to run it separately.
Option D: PostgreSQL advisory lock
PostgreSQL supports session-level advisory locks (pg_try_advisory_lock) that are atomic and scoped to the database connection. The first worker to acquire the lock runs bootstrap; the others detect the lock is taken and skip. The lock is automatically released when the session closes, even on crash.
- Pros: Fully atomic at the DB level. No external dependencies beyond PostgreSQL, which is already required. Works regardless of how many workers or replicas are running. An info log can be added when detected.
- Cons: Adds non-trivial plumbing to
bootstrap_default_admin(). Workers that do not acquire the lock skip immediately: they do not wait, so there is a small window where a worker starts serving traffic before bootstrap is fully complete (though in practice this is negligible since the lifespan blocks the server from accepting requests).
Option E: Redis distributed lock
Redis is already a required dependency. A distributed lock with a TTL (e.g. SET bootstrap_lock 1 NX EX 30) ensures only one worker runs bootstrap. The TTL prevents deadlocks if a worker crashes while holding the lock.
- Pros: Clean distributed lock with automatic expiry. Redis is already initialized at this point in the lifespan. Works across multiple replicas/pods (unlike Gunicorn hooks which are single-host).
- Cons: Slightly more complex than an advisory lock. Requires passing the Redis pool into
bootstrap_default_admin(). Introduces a dependency on Redis being reachable before bootstrap can run (already the case in practice).
Option F: Write a flag file
After a successful bootstrap, write a marker file (e.g. /tmp/bootstrap_done). Workers check for the file before attempting bootstrap.
- Pros: No external dependencies. Simple to implement.
- Cons: The file is local to the host: does not work across multiple replicas/pods. Race condition still exists between the file check and the write (two workers can both miss the file and both attempt creation). File persists across container restarts only if on a mounted volume, otherwise it disappears and bootstrap re-runs on every restart (which is fine since
has_admin_user()guards it, but makes the flag pointless). Fragile in general. Non-compatible with a future Kubernetes implementation.
Option G: Environment variable
Set an environment variable (e.g. BOOTSTRAP_DONE=1) after the first worker completes bootstrap. Subsequent workers check for it before attempting.
- Pros: Zero dependencies, trivially simple.
- Cons: Environment variables cannot be set in a parent process from a child process. A worker cannot modify the environment of other workers or the master. This fundamentally does not work in a multi-process model. Could work if set externally (e.g. in the startup script after a pre-flight bootstrap script), but then it's just a worse version of Option C.