Skip to content

feat(scheduler): recover cutover-ready applies after restart#218

Open
aparajon wants to merge 23 commits into
mainfrom
armand/cutover-recovery-state
Open

feat(scheduler): recover cutover-ready applies after restart#218
aparajon wants to merge 23 commits into
mainfrom
armand/cutover-recovery-state

Conversation

@aparajon
Copy link
Copy Markdown
Collaborator

@aparajon aparajon commented Jun 2, 2026

Why

A MySQL/Spirit apply can restart after reaching deferred cutover readiness, but before SchemaBot has safely reattached to Spirit's checkpoint state. Operators need recovery to fail closed while SchemaBot proves the engine state, without allowing durable cutover-ready storage to move backward.

What

  • Add a generic recovering apply/task/proto state for restart recovery
  • Block cutover while recovery is unresolved; row-copy progress can be displayed during recovery, but durable storage does not move backward to running
  • Return recovery to waiting_for_cutover only after Spirit proves cutover readiness again
  • Preserve durable cutover requests through recovery and row-copy reporting, sending cutover only after the apply is cutover-ready again
  • Reconcile the manual-sentinel-drop case by re-planning against live schema: complete if the desired schema is already present, otherwise fail closed for manual reconciliation
  • Update CLI/TUI/PR comment rendering, apply logs, state docs, and gRPC edge-case docs for the recovery behavior
storage: waiting_for_cutover
          │
          ▼
   restart recovery
          │
          ▼
   check Spirit sentinel
      ┌───┴────┐
      ▼        ▼
 sentinel   sentinel
 exists     absent
      │        │
      ▼        ▼
 recovering  live-schema reconcile
      │        │
      │        ├─ desired schema present → completed
      │        └─ desired schema missing → failed, manual reconciliation
      │
      ├─ Spirit reports row copy → stay recovering, show row-copy detail
      └─ Spirit reports waiting_for_cutover → waiting_for_cutover

Risk Assessment

Medium — this touches scheduler recovery, control readiness, proto state conversion, and user-facing renderers. The behavior is scoped to restart recovery and fails closed by rejecting or holding cutover until recovery proves readiness.

Generated with Amp

Copilot AI review requested due to automatic review settings June 2, 2026 20:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new explicit recovery phase for MySQL/Spirit deferred cutover applies (recovering_cutover) so that, after a restart, SchemaBot visibly “fails closed” and rejects cutover until it has safely reattached to Spirit’s checkpoint/sentinel-wait state (or reconciled completion when the sentinel is already absent).

Changes:

  • Add recovering_cutover as a first-class apply/task/proto state and ensure state derivation / conversions handle it end-to-end.
  • Implement scheduler-side deferred-cutover recovery behavior (sentinel present → enter recovery + block cutover; sentinel absent → re-plan against live schema and complete if already applied).
  • Update user-facing renderers (CLI/TUI/PR comments/logs) and documentation to reflect the new recovery edge cases, plus add targeted tests (including integration coverage).

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/webhook/templates/apply.go Render recovering_cutover in PR comment header, progress summary, table rows, and footer messaging.
pkg/webhook/templates/apply_test.go Add PR comment rendering test coverage for recovering_cutover.
pkg/ui/format.go Ensure table priority ordering treats recovering_cutover like other cutover-wait phases.
pkg/ui/format_test.go Add unit coverage for recovering_cutover table priority.
pkg/tern/state_converters.go Map recovering state across task→apply derivation and storage↔proto conversions; add helper to extract task states.
pkg/tern/local_control.go Reject cutover requests while an apply/task is in recovering_cutover.
pkg/tern/local_control_resume.go Add deferred-cutover recovery flow, including marking apply/tasks recovering and resuming from checkpoint when sentinel exists.
pkg/tern/local_client.go Add sentinel existence probe; include recovering_cutover in progress selection and state-guard logic.
pkg/tern/local_client_test.go Add unit test verifying proto conversion for the new recovery state.
pkg/tern/local_client_integration_test.go Add integration scenarios for sentinel-present recovery (block cutover until ready) and sentinel-absent reconciliation to completion.
pkg/tern/local_apply_grouped.go Derive apply state from persisted task states and prevent backward progress during atomic progress sync.
pkg/storage/mysqlstore/applies.go Allow scheduler claiming of applies in recovering_cutover.
pkg/state/task.go Define and normalize the new task state string recovering_cutover.
pkg/state/README.md Document the new state and update state diagrams / derivation ordering.
pkg/state/metadata.go Add label metadata for the apply recovery state.
pkg/state/apply.go Define the new apply state and update derivation/normalization to include it.
pkg/proto/ternv1/tern.pb.go Regenerate/update protobuf bindings to include STATE_RECOVERING_CUTOVER.
pkg/proto/tern.proto Add STATE_RECOVERING_CUTOVER to the public gRPC enum.
pkg/cmd/internal/templates/progress.go Render recovering_cutover in CLI progress views and status label/color handling.
pkg/cmd/internal/templates/progress_states_test.go Add CLI template test coverage for recovering_cutover.
pkg/cmd/commands/watch_tui_view.go Show recovery messaging and suppress cutover prompt in the TUI while recovering.
pkg/cmd/commands/watch_tui_test.go Add TUI test coverage for recovery messaging / blocked cutover.
pkg/cmd/commands/apply.go Emit apply/table log messages for the recovery phase.
pkg/cmd/commands/apply_log_test.go Add log emitter + “active status” coverage for the recovery state.
pkg/api/handlers_test.go Add API-level tests rejecting cutover while local/remote apply state is recovering.
pkg/api/control_handlers.go Extend cutover readiness gating to treat recovering_cutover as a reject condition (local + remote).
docs/grpc-control-edge-cases.md Document the sentinel-present/absent recovery edge cases and expected user-visible outcomes.
docs/architecture.md Document the new recovery phase and how sentinel presence/absence drives behavior after restart.
Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aparajon aparajon marked this pull request as ready for review June 3, 2026 20:10
@aparajon aparajon requested review from Kiran01bm and morgo as code owners June 3, 2026 20:10
Comment thread pkg/tern/local_control.go
@aparajon aparajon force-pushed the armand/cutover-recovery-state branch from 1f3cc42 to a32d64c Compare June 4, 2026 17:23
@aparajon aparajon force-pushed the armand/cutover-recovery-state branch from 78a6a95 to 909ac4c Compare June 4, 2026 21:36
@aparajon aparajon changed the title feat(scheduler): recover deferred cutover state feat(scheduler): recover cutover-ready applies after restart Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants