Skip to content

Fix TOCTOU race in template deletion causing FK violation#2231

Open
beran-t wants to merge 1 commit intomainfrom
fix/toctou-template-delete-fk-violation
Open

Fix TOCTOU race in template deletion causing FK violation#2231
beran-t wants to merge 1 commit intomainfrom
fix/toctou-template-delete-fk-violation

Conversation

@beran-t
Copy link
Copy Markdown
Contributor

@beran-t beran-t commented Mar 25, 2026

Bug

There is a TOCTOU (Time-of-Check-Time-of-Use) race condition in the DELETE /templates/{templateID} handler. When a Sandbox.create(snapshotId) runs concurrently with template deletion, the following sequence causes an FK constraint violation (snapshots_envs_base_env_id):

  1. Delete handler calls ExistsTemplateSnapshots() — no snapshots found
  2. Concurrently, Sandbox.create(snapshotId) resolves the template env, setting BaseTemplateID
  3. Delete handler calls DeleteTemplate() — deletes the env row
  4. The new sandbox later pauses and UpsertSnapshot tries to INSERT with base_env_id pointing to the deleted env — FK violation

Fix

Two-pronged approach:

1. Atomic delete with row locking (template_delete.go)

  • Wrap ExistsTemplateSnapshots + DeleteTemplate in a single DB transaction (following the existing WithTx pattern from template_tags.go)
  • Acquire SELECT ... FOR UPDATE on the env row at the start of the transaction, which blocks concurrent UpsertSnapshot FK checks until the transaction commits
  • This eliminates the TOCTOU window between the snapshot check and the delete

2. Graceful FK violation handling (pause_instance.go)

  • If UpsertSnapshot fails with an FK violation (base template was deleted after the sandbox was created but before it paused), return a typed BaseTemplateDeletedError instead of propagating a raw constraint violation
  • Reported as a non-critical telemetry error since this is a legitimate race outcome

Files changed

File Change
packages/api/internal/handlers/template_delete.go Wrap guard checks + deletion in a DB transaction with SELECT ... FOR UPDATE on the env row
packages/api/internal/orchestrator/pause_instance.go Add BaseTemplateDeletedError type; catch FK violations from UpsertSnapshot and return descriptive error
packages/db/queries/templates/lock_env_for_update.sql New SQL query: SELECT id FROM envs WHERE id = @env_id FOR UPDATE
packages/db/queries/lock_env_for_update.sql.go Generated Go code for the LockEnvForUpdate query

Reproduction

Racing Sandbox.create(snapshotId) against Sandbox.deleteSnapshot(snapshotId) reproduces the FK violation on nearly every first attempt.

Test plan

  • Verify template deletion still works correctly when no sandboxes reference the template
  • Verify template deletion returns 400 when paused sandboxes exist
  • Verify concurrent Sandbox.create(snapshotId) + template deletion no longer causes FK violation
  • Verify sandbox pause returns BaseTemplateDeletedError (not raw FK violation) when the base template was deleted
  • Run make generate/db to confirm the generated SQL code matches

Wrap the delete template handler's guard checks (ExistsTemplateSnapshots)
and deletion (DeleteTemplate) in a single DB transaction with SELECT ...
FOR UPDATE on the env row. This prevents a concurrent UpsertSnapshot from
inserting a snapshot with base_env_id referencing the env being deleted.

On the snapshot creation side, handle FK violations gracefully by returning
a typed BaseTemplateDeletedError instead of propagating a raw constraint
violation when the base template was deleted between sandbox creation and
pause.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 349764164b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +49 to +53
if dberrors.IsForeignKeyViolation(err) {
telemetry.ReportError(ctx, "base template was deleted, cannot create snapshot",
err, telemetry.WithTemplateID(sbx.BaseTemplateID))

return BaseTemplateDeletedError{BaseTemplateID: sbx.BaseTemplateID}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve original FK error for pause fallback

Returning BaseTemplateDeletedError here discards the original PostgreSQL error, but the only caller (removeSandboxFromNode) still relies on dberrors.IsForeignKeyViolation(err) to trigger the kill-sandbox fallback when pause cannot snapshot due to a deleted base template. In that race, this branch now makes the caller miss the FK case, skip cleanup, and return a generic auto-pause failure instead of executing the intended fallback path.

Useful? React with 👍 / 👎.

if err != nil {
// Check if the error is an FK violation on base_env_id, which means
// the base template was deleted between sandbox creation and pause.
if dberrors.IsForeignKeyViolation(err) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dberrors.IsForeignKeyViolation(err) matches any FK violation from UpsertSnapshot, not just the base_env_id one. That query has several other FK constraints — snapshots.team_id, the env_build_assignments.env_id / build_id edges, and env_builds.cluster_node_id (if it has a FK). A violation on any of those would be silently reclassified as a non-critical BaseTemplateDeletedError instead of propagating as a critical error, masking real bugs.

Use pgconn.PgError.ConstraintName to narrow the check to the specific constraint, e.g.:

var pgErr *pgconn.PgError
if errors.As(err, &pgErr) && pgErr.Code == "23503" && pgErr.ConstraintName == "snapshots_base_env_id_fkey" {

(adjust the constraint name to match the actual schema)

@ValentaTomas ValentaTomas removed their request for review March 25, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants