-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: Add MigrateSchedule RPC for V1 to V2 scheduler migration #9261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add the infrastructure for migrating schedules from the workflow-backed scheduler (V1) to the CHASM-backed scheduler (V2): - Add MigrateSchedule RPC to CHASM scheduler service proto - Add MigrateScheduleRequest/Response messages with migration state - Implement AdminHandler.MigrateSchedule to signal V1 workflow - Add migrate signal handler in V1 scheduler workflow - Add MigrateSchedule activity to call CHASM scheduler service - Update migration function to accept proto types directly - Wire up SchedulerServiceClient in worker service fx module
Add handler and logic in chasm/lib/scheduler to create a CHASM scheduler from migrated V1 state: - CreateSchedulerFromMigration initializes scheduler with migrated state - MigrateSchedule handler uses StartExecution with reject duplicate policy - Tests for migration functionality
| if errors.As(err, &alreadyStartedErr) { | ||
| return nil, serviceerror.NewWorkflowExecutionAlreadyStarted( | ||
| "CHASM schedule already exists", | ||
| "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we include a this info n the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will circle back to this.
…migration test LegacyToSchedulerMigrationState was returning *SchedulerMigrationState but the MigrateSchedule activity expects *MigrateScheduleRequest. Rename to LegacyToMigrateScheduleRequest and return the full request with NamespaceId populated. Also fix the migrate signal channel (was incorrectly using SignalNameForceCAN instead of SignalNameMigrate), add TestScheduleMigrationV1ToV2 integration test, expose SchedulerClient from test cluster, and fix staticcheck SA4006 lint errors in scheduler_test.go.
service/worker/scheduler/workflow.go
Outdated
| // inc the sequence number to prevent to invalidate signals to | ||
| // this workflow after the migration has started. | ||
| // they should target the chasm scheduler after this point | ||
| s.incSeqNo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the conflict token is not checked until a signal is processed. So this likely has no benefit.
| "namespace", s.State.Namespace, | ||
| "schedule-id", s.State.ScheduleId, | ||
| ) | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything else that should be done before closing the workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will drop any signals received during the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will drop any signals received during the migration.
yep, regrettably true, though updates are already accepted (or rejected) asynchronously, so it's not new behavior.
Add activity-level tests for MigrateSchedule covering success, already-exists (idempotent), and error paths. Add workflow-level tests for migrate signal handling: success terminates workflow, failure continues, and signals are still processed after a failed migration. Cap migration local activity to 1 attempt with 60s schedule-to-close timeout instead of inheriting the default 1h with unlimited retries. Remove unnecessary incSeqNo() before migration -- the conflict token change is never visible externally since it's in-memory only, and queued signals are dropped on workflow termination regardless.
What changed
MigrateScheduleRPC to admin handler and CHASM scheduler servicemigratesignal, runs a local activity that callsMigrateScheduleto create the CHASM schedule from V1 stateCreateSchedulerFromMigrationinitializes a full CHASM scheduler tree (generator, invoker, backfillers, visibility) from the migrated V1 state, preserving the conflict token for client compatibilityLegacyToMigrateScheduleRequestconverts V1InternalState+ScheduleInfointo the migration request format, including running/completed workflows as buffered starts and ongoing backfillsWhy
Support migrating from workflow-backed (V1) schedulers to CHASM (V2) schedulers. The admin API (
MigrateSchedule) signals the V1 workflow, which snapshots its state and creates the V2 schedule in a single local activity.Signals during migration
Signals received while the migration local activity is executing are dropped if the migration succeeds (the workflow terminates without consuming them).
Migration activity retry policy
The migration local activity uses a restricted retry policy (1 attempt, 60s schedule-to-close) rather than the default (unlimited retries, 1h). A persistent failure should fail fast and let the workflow continue, rather than blocking for up to an hour.
Follow-up PRs
tdbgcommand for triggering migration