Skip to content

Comments

fix: delay resuming workflows after restart#41

Merged
jason-lynch merged 2 commits intomainfrom
fix/PLAT-99/workflow-resume-delay
Jun 3, 2025
Merged

fix: delay resuming workflows after restart#41
jason-lynch merged 2 commits intomainfrom
fix/PLAT-99/workflow-resume-delay

Conversation

@jason-lynch
Copy link
Member

The go-workflows backend locks workflows and activities while they're being executed. In the event of a crash or restart, the lock would prevent the worker from resuming the workflow. This fix makes a few changes to resolve this issue:

  • Uses a stable ID for the worker ID
  • Adds a random "worker instance" ID
  • Adds the new IDs to the lock entities
  • Incorporates the new IDs into the lock checks

If the worker ID matches the lock's worker ID but the worker instance ID does not match, we know that the lock belonged to an old instance of the worker and that the lock can be reassigned to the new instance.

PLAT-99

@jason-lynch jason-lynch requested review from mmols and tsivaprasad June 2, 2025 15:48
@jason-lynch jason-lynch force-pushed the fix/PLAT-99/workflow-resume-delay branch from 9166097 to 53e1663 Compare June 2, 2025 16:23
The `go-workflows` backend locks workflows and activities while they're
being executed. In the event of a crash or restart, the lock would
prevent the worker from resuming the workflow. This fix makes a few
changes to resolve this issue:

- Uses a stable ID for the worker ID
- Adds a random "worker instance" ID
- Adds the new IDs to the lock entities
- Incorporates the new IDs into the lock checks

If the worker ID matches the lock's worker ID but the worker instance ID
does not match, we know that the lock belonged to an old instance of the
worker and that the lock can be reassigned to the new instance.

PLAT-99
@jason-lynch jason-lynch force-pushed the fix/PLAT-99/workflow-resume-delay branch from 53e1663 to 2710268 Compare June 2, 2025 20:53
Create(&workflow_instance_lock.Value{
WorkflowInstanceID: instanceID,
WorkflowExecutionID: executionID,
CreatedAt: time.Now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're calling time.Now() multiple times. I think we can optimize it by calling it once—now := time.Now()—and reuse. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in 514b7d0

@@ -252,15 +254,34 @@ func (b *Backend) GetWorkflowTask(ctx context.Context, queues []workflow.Queue)
instanceID := item.WorkflowInstance.InstanceID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetWorkflowTask method looks a bit large and complex with 3 nested loops. It might be a good idea to refactor. What do you think?

Copy link
Member Author

@jason-lynch jason-lynch Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this package could absolutely be refactored. But, I would prefer to do that as its own ticket, because changes to this package need to be made very carefully. This ticket is just focused on fixing a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@jason-lynch jason-lynch requested a review from tsivaprasad June 3, 2025 17:52
@jason-lynch jason-lynch merged commit 167cbe9 into main Jun 3, 2025
2 checks passed
@jason-lynch jason-lynch deleted the fix/PLAT-99/workflow-resume-delay branch June 4, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants