Skip to content

[Bug] [Master] If a task fails during initialization, it will neither be dispatched by the Master nor can it be properly killed. #17758

@njnu-seafish

Description

@njnu-seafish

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

1, import a workflow definition.
2, When running the workflow, the task cann‘t be dispatched.
some error info from dolphinscheduler-master.log

[WI-8][TI-0] - 2025-12-01 15:58:06.007 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskDispatchLifecycleEvent{task=sh01}
[WI-8][TI-0] - 2025-12-01 15:58:06.007 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskStartLifecycleEvent{task=sh01} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 15:58:06.010 ERROR [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.WorkflowEventBusFireWorker:[88] - Fire event failed for WorkflowExecuteRunnable: flow_condition_import_20251201155754777-20251201155805585
org.apache.dolphinscheduler.server.master.engine.exceptions.WorkflowEventFireException: Failed to fire event: TaskDispatchLifecycleEvent{task=sh01}
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleWorkflowEventBus(WorkflowEventBusFireWorker.java:133)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.fireAllRegisteredEvent(WorkflowEventBusFireWorker.java:86)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: Cannot find the environment: 144873539254368
at org.apache.dolphinscheduler.server.master.runner.TaskExecutionContextFactory.getEnvironmentConfigFromDB(TaskExecutionContextFactory.java:217)
at org.apache.dolphinscheduler.server.master.runner.TaskExecutionContextFactory.createTaskExecutionContext(TaskExecutionContextFactory.java:102)
at org.apache.dolphinscheduler.server.master.engine.task.runnable.TaskExecutionRunnable.initializeTaskExecutionContext(TaskExecutionRunnable.java:148)
at org.apache.dolphinscheduler.server.master.engine.task.statemachine.TaskSubmittedStateAction.onDispatchEvent(TaskSubmittedStateAction.java:112)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:40)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:31)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:46)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:32)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleEvent(WorkflowEventBusFireWorker.java:144)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleWorkflowEventBus(WorkflowEventBusFireWorker.java:122)

3, When manually killing the workflow, it remains stuck in the "waiting to be killed" state.
some error info from dolphinscheduler-master.log

[WI-0][TI-0] - 2025-12-01 18:54:51.752 INFO [MasterRpcServer-methodInvoker-15] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: WorkflowStopLifecycleEvent{workflow=flow_condition_import_20251201155754777-20251201155805585}
[WI-8][TI-0] - 2025-12-01 18:54:51.851 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[47] - Begin fire workflow flow_condition_import_20251201155754777-20251201155805585 LifecycleEvent[WorkflowStopLifecycleEvent{workflow=flow_condition_import_20251201155754777-20251201155805585}] with state: RUNNING_EXECUTION
[WI-8][TI-0] - 2025-12-01 18:54:51.863 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.w.s.AbstractWorkflowStateAction:[161] - Success set WorkflowExecuteRunnable: flow_condition_import_20251201155754777-20251201155805585 state from: RUNNING_EXECUTION to READY_STOP
[WI-8][TI-0] - 2025-12-01 18:54:51.864 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=0}
[WI-0][TI-0] - 2025-12-01 18:54:51.865 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[52] - Fired workflow flow_condition_import_20251201155754777-20251201155805585 LifecycleEvent[WorkflowStopLifecycleEvent{workflow=flow_condition_import_20251201155754777-20251201155805585}] with state: READY_STOP
[WI-0][TI-0] - 2025-12-01 18:54:51.865 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-0][TI-0] - 2025-12-01 18:54:51.866 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-0][TI-0] - 2025-12-01 18:54:51.866 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}
[WI-0][TI-0] - 2025-12-01 18:54:51.866 INFO [ds-workflow-eventbus-worker-0] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskKillLifecycleEvent{task=sh01, delayTime=0} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 18:54:56.878 INFO [ds-workflow-eventbus-worker-16] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-8][TI-0] - 2025-12-01 18:54:56.878 INFO [ds-workflow-eventbus-worker-16] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-8][TI-0] - 2025-12-01 18:54:56.878 INFO [ds-workflow-eventbus-worker-16] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}
[WI-8][TI-0] - 2025-12-01 18:54:56.878 INFO [ds-workflow-eventbus-worker-16] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskKillLifecycleEvent{task=sh01, delayTime=5000} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 18:55:01.889 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-8][TI-0] - 2025-12-01 18:55:01.890 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-8][TI-0] - 2025-12-01 18:55:01.890 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}
[WI-8][TI-0] - 2025-12-01 18:55:01.890 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskKillLifecycleEvent{task=sh01, delayTime=5000} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 18:55:06.900 INFO [ds-workflow-eventbus-worker-1] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-8][TI-0] - 2025-12-01 18:55:06.901 INFO [ds-workflow-eventbus-worker-1] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-8][TI-0] - 2025-12-01 18:55:06.901 INFO [ds-workflow-eventbus-worker-1] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}
[WI-8][TI-0] - 2025-12-01 18:55:06.901 INFO [ds-workflow-eventbus-worker-1] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskKillLifecycleEvent{task=sh01, delayTime=5000} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 18:55:11.912 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-8][TI-0] - 2025-12-01 18:55:11.912 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-8][TI-0] - 2025-12-01 18:55:11.912 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}
[WI-8][TI-0] - 2025-12-01 18:55:11.913 INFO [ds-workflow-eventbus-worker-14] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskKillLifecycleEvent{task=sh01, delayTime=5000} with state SUBMITTED_SUCCESS
[WI-8][TI-0] - 2025-12-01 18:55:16.923 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.t.d.WorkerGroupDispatcherCoordinator:[74] - Failed to remove Task[id=14] from WorkerGroupDispatcher[name=default], this task has been dispatched
[WI-8][TI-0] - 2025-12-01 18:55:16.923 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[158] - The task[id=14] is submitted and already dispatched, cannot kill, will kill it after 5s
[WI-8][TI-0] - 2025-12-01 18:55:16.923 INFO [ds-workflow-eventbus-worker-13] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskKillLifecycleEvent{task=sh01, delayTime=5000}

What you expected to happen

A workflow that failed during initialization should be able to be killed normally.

How to reproduce

1, import a workflow definition(The environment code does not exist.).
2, When running the workflow, the task cann‘t be dispatched.
3, When manually killing the workflow, it remains stuck in the "waiting to be killed" state.

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

backendbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions