Fix TOCTOU race in template deletion causing FK violation#2231
Fix TOCTOU race in template deletion causing FK violation#2231
Conversation
Wrap the delete template handler's guard checks (ExistsTemplateSnapshots) and deletion (DeleteTemplate) in a single DB transaction with SELECT ... FOR UPDATE on the env row. This prevents a concurrent UpsertSnapshot from inserting a snapshot with base_env_id referencing the env being deleted. On the snapshot creation side, handle FK violations gracefully by returning a typed BaseTemplateDeletedError instead of propagating a raw constraint violation when the base template was deleted between sandbox creation and pause.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 349764164b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if dberrors.IsForeignKeyViolation(err) { | ||
| telemetry.ReportError(ctx, "base template was deleted, cannot create snapshot", | ||
| err, telemetry.WithTemplateID(sbx.BaseTemplateID)) | ||
|
|
||
| return BaseTemplateDeletedError{BaseTemplateID: sbx.BaseTemplateID} |
There was a problem hiding this comment.
Preserve original FK error for pause fallback
Returning BaseTemplateDeletedError here discards the original PostgreSQL error, but the only caller (removeSandboxFromNode) still relies on dberrors.IsForeignKeyViolation(err) to trigger the kill-sandbox fallback when pause cannot snapshot due to a deleted base template. In that race, this branch now makes the caller miss the FK case, skip cleanup, and return a generic auto-pause failure instead of executing the intended fallback path.
Useful? React with 👍 / 👎.
| if err != nil { | ||
| // Check if the error is an FK violation on base_env_id, which means | ||
| // the base template was deleted between sandbox creation and pause. | ||
| if dberrors.IsForeignKeyViolation(err) { |
There was a problem hiding this comment.
dberrors.IsForeignKeyViolation(err) matches any FK violation from UpsertSnapshot, not just the base_env_id one. That query has several other FK constraints — snapshots.team_id, the env_build_assignments.env_id / build_id edges, and env_builds.cluster_node_id (if it has a FK). A violation on any of those would be silently reclassified as a non-critical BaseTemplateDeletedError instead of propagating as a critical error, masking real bugs.
Use pgconn.PgError.ConstraintName to narrow the check to the specific constraint, e.g.:
var pgErr *pgconn.PgError
if errors.As(err, &pgErr) && pgErr.Code == "23503" && pgErr.ConstraintName == "snapshots_base_env_id_fkey" {(adjust the constraint name to match the actual schema)
Bug
There is a TOCTOU (Time-of-Check-Time-of-Use) race condition in the
DELETE /templates/{templateID}handler. When aSandbox.create(snapshotId)runs concurrently with template deletion, the following sequence causes an FK constraint violation (snapshots_envs_base_env_id):ExistsTemplateSnapshots()— no snapshots foundSandbox.create(snapshotId)resolves the template env, settingBaseTemplateIDDeleteTemplate()— deletes the env rowUpsertSnapshottries to INSERT withbase_env_idpointing to the deleted env — FK violationFix
Two-pronged approach:
1. Atomic delete with row locking (
template_delete.go)ExistsTemplateSnapshots+DeleteTemplatein a single DB transaction (following the existingWithTxpattern fromtemplate_tags.go)SELECT ... FOR UPDATEon the env row at the start of the transaction, which blocks concurrentUpsertSnapshotFK checks until the transaction commits2. Graceful FK violation handling (
pause_instance.go)UpsertSnapshotfails with an FK violation (base template was deleted after the sandbox was created but before it paused), return a typedBaseTemplateDeletedErrorinstead of propagating a raw constraint violationFiles changed
packages/api/internal/handlers/template_delete.goSELECT ... FOR UPDATEon the env rowpackages/api/internal/orchestrator/pause_instance.goBaseTemplateDeletedErrortype; catch FK violations fromUpsertSnapshotand return descriptive errorpackages/db/queries/templates/lock_env_for_update.sqlSELECT id FROM envs WHERE id = @env_id FOR UPDATEpackages/db/queries/lock_env_for_update.sql.goLockEnvForUpdatequeryReproduction
Racing
Sandbox.create(snapshotId)againstSandbox.deleteSnapshot(snapshotId)reproduces the FK violation on nearly every first attempt.Test plan
Sandbox.create(snapshotId)+ template deletion no longer causes FK violationBaseTemplateDeletedError(not raw FK violation) when the base template was deletedmake generate/dbto confirm the generated SQL code matches