Skip to content

[πŸ› BUG]: Workflow events are lost after restarting the PHP workerΒ #668

@roxblnfk

Description

@roxblnfk

No duplicates πŸ₯².

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Workflow events are lost immediately after restarting the PHP worker.

The test for reproduction is in PR #667. In the RoadRunner tests, I see this error:

{"message":"static_pool_exec: Workers watcher stopped:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker","source":"GoSDK","stackTrace":"process event for default [panic]:\ngithub.com/temporalio/roadrunner-temporal/v5/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc0006a40e0, 0x3?)\n\td:/git/temporal/temporalio-roadrunner-temporal/aggregatedpool/workflow.go:312 +0x3e5\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc000b02798, 0xc000022b40, 0x3?, 0x0)\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_event_handlers.go:1204 +0x30a\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc000c19c20, 0xc00079be00)\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_task_handlers.go:1197 +0x1a8a\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc00034e900, 0xc00079be00, 0xc000c19c20, 0xc00102efc0)\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_task_handlers.go:934 +0x59e\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0002ce000, 0xc00079be00)\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_task_pollers.go:402 +0x3db\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0002ce000, {0x17b04a0, 0xc00079be00})\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_task_pollers.go:350 +0x205\ngo.temporal.io/sdk/internal.(*baseWorker).processTaskAsync.func1()\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker_base.go:440 +0x12f\ncreated by go.temporal.io/sdk/internal.(*baseWorker).processTaskAsync in goroutine 228\n\tC:/Users/roxbl/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker_base.go:419 +0x8c","applicationFailureInfo":{"type":"PanicError","nonRetryable":true}}

In the PHP tests, the error is slightly different, but events are also lost:

{
  "message": "BadCancelTimerAttributes: invalid history builder state for action: add-timer-canceled-event, TimerID: 5",
  "serverFailureInfo": {}
}

After some tests, I found that:

  • PHP Worker restart happens
  • The new Worker receives events sent to the old Worker.
  • The PHP SDK responds with Failure: Workflow with the specified run identifier "4bc0a3d6-9ea2-4fc5-b232-823611320f4e" not found (RR_ID is used).
  • The plugin records it as a just Workflow Task Completed event.

During the next replay of the Workflow, the PHP SDK does not send a Failure but sends an event (such as Cancel Timer or Side Effect, which would occur with correct execution of the Workflow Task). This causes a determinism error.

Version

RR 2025.1.2
PHP SDK 2.15.0

Relevant log output

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions