feat(scheduler): recover cutover-ready applies after restart#218
Open
aparajon wants to merge 23 commits into
Open
feat(scheduler): recover cutover-ready applies after restart#218aparajon wants to merge 23 commits into
aparajon wants to merge 23 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new explicit recovery phase for MySQL/Spirit deferred cutover applies (recovering_cutover) so that, after a restart, SchemaBot visibly “fails closed” and rejects cutover until it has safely reattached to Spirit’s checkpoint/sentinel-wait state (or reconciled completion when the sentinel is already absent).
Changes:
- Add
recovering_cutoveras a first-class apply/task/proto state and ensure state derivation / conversions handle it end-to-end. - Implement scheduler-side deferred-cutover recovery behavior (sentinel present → enter recovery + block cutover; sentinel absent → re-plan against live schema and complete if already applied).
- Update user-facing renderers (CLI/TUI/PR comments/logs) and documentation to reflect the new recovery edge cases, plus add targeted tests (including integration coverage).
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/webhook/templates/apply.go | Render recovering_cutover in PR comment header, progress summary, table rows, and footer messaging. |
| pkg/webhook/templates/apply_test.go | Add PR comment rendering test coverage for recovering_cutover. |
| pkg/ui/format.go | Ensure table priority ordering treats recovering_cutover like other cutover-wait phases. |
| pkg/ui/format_test.go | Add unit coverage for recovering_cutover table priority. |
| pkg/tern/state_converters.go | Map recovering state across task→apply derivation and storage↔proto conversions; add helper to extract task states. |
| pkg/tern/local_control.go | Reject cutover requests while an apply/task is in recovering_cutover. |
| pkg/tern/local_control_resume.go | Add deferred-cutover recovery flow, including marking apply/tasks recovering and resuming from checkpoint when sentinel exists. |
| pkg/tern/local_client.go | Add sentinel existence probe; include recovering_cutover in progress selection and state-guard logic. |
| pkg/tern/local_client_test.go | Add unit test verifying proto conversion for the new recovery state. |
| pkg/tern/local_client_integration_test.go | Add integration scenarios for sentinel-present recovery (block cutover until ready) and sentinel-absent reconciliation to completion. |
| pkg/tern/local_apply_grouped.go | Derive apply state from persisted task states and prevent backward progress during atomic progress sync. |
| pkg/storage/mysqlstore/applies.go | Allow scheduler claiming of applies in recovering_cutover. |
| pkg/state/task.go | Define and normalize the new task state string recovering_cutover. |
| pkg/state/README.md | Document the new state and update state diagrams / derivation ordering. |
| pkg/state/metadata.go | Add label metadata for the apply recovery state. |
| pkg/state/apply.go | Define the new apply state and update derivation/normalization to include it. |
| pkg/proto/ternv1/tern.pb.go | Regenerate/update protobuf bindings to include STATE_RECOVERING_CUTOVER. |
| pkg/proto/tern.proto | Add STATE_RECOVERING_CUTOVER to the public gRPC enum. |
| pkg/cmd/internal/templates/progress.go | Render recovering_cutover in CLI progress views and status label/color handling. |
| pkg/cmd/internal/templates/progress_states_test.go | Add CLI template test coverage for recovering_cutover. |
| pkg/cmd/commands/watch_tui_view.go | Show recovery messaging and suppress cutover prompt in the TUI while recovering. |
| pkg/cmd/commands/watch_tui_test.go | Add TUI test coverage for recovery messaging / blocked cutover. |
| pkg/cmd/commands/apply.go | Emit apply/table log messages for the recovery phase. |
| pkg/cmd/commands/apply_log_test.go | Add log emitter + “active status” coverage for the recovery state. |
| pkg/api/handlers_test.go | Add API-level tests rejecting cutover while local/remote apply state is recovering. |
| pkg/api/control_handlers.go | Extend cutover readiness gating to treat recovering_cutover as a reject condition (local + remote). |
| docs/grpc-control-edge-cases.md | Document the sentinel-present/absent recovery edge cases and expected user-visible outcomes. |
| docs/architecture.md | Document the new recovery phase and how sentinel presence/absence drives behavior after restart. |
Files not reviewed (1)
- pkg/proto/ternv1/tern.pb.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Kiran01bm
reviewed
Jun 3, 2026
1f3cc42 to
a32d64c
Compare
78a6a95 to
909ac4c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A MySQL/Spirit apply can restart after reaching deferred cutover readiness, but before SchemaBot has safely reattached to Spirit's checkpoint state. Operators need recovery to fail closed while SchemaBot proves the engine state, without allowing durable cutover-ready storage to move backward.
What
recoveringapply/task/proto state for restart recoveryrunningwaiting_for_cutoveronly after Spirit proves cutover readiness againRisk Assessment
Medium — this touches scheduler recovery, control readiness, proto state conversion, and user-facing renderers. The behavior is scoped to restart recovery and fails closed by rejecting or holding cutover until recovery proves readiness.
Generated with Amp