Skip to content

await-agent-restart workflow action can hang when used in a sub workflow #3848

@reubenmiller

Description

@reubenmiller

Describe the bug

The workflow action, await-agent-restart, seems to behave different depending on whether it is called from a workflow or from a sub-operation (i.e. sub-workflow).

The difference in behaviour is only observable when the await-agent-restart action doesn't use an intermediate state, and is being called from another workflow.

For instance, below shows an example workflow which is called from another workflow. When the [restarting] uses the on_success = "successful transition, then the workflow never completes, however when using an intermediate state, on_success = "restarted", the workflow successfully completes.

Below show an example of the two workflows, where restart-tedge-agent-wrapper workflow will call the restart-tedge-agent-internal workflow.

file: restart-tedge-agent-internal.toml

operation = "restart-tedge-agent-internal"

[init]
action = "proceed"
on_success = "restart"

[restart]
background_script = "sudo systemctl restart tedge-agent"
on_exec = "restarting"

[restarting]
action = "await-agent-restart"
# on_success = "restarted"  # <=== Result: PASS
on_success = "successful"   # <=== Result: FAIL
timeout_second = 30
on_timeout = "failed"

[restarted]
action = "proceed"
on_success = "successful"

[successful]
action = "cleanup"

[failed]
action = "cleanup"

file: restart-tedge-agent-wrapper.toml

operation = "restart-tedge-agent-wrapper"

[init]
action = "proceed"
on_success = "restart"

[restart]
operation = "restart-tedge-agent-internal"
on_exec = "restarting"

[restarting]
action = "await-operation-completion"
on_success = "successful"

[successful]
action = "cleanup"

[failed]
action = "cleanup"

Symptoms

  • The restart-tedge-agent-internal workflow successfully completes when called directly (e.g. not from another workflow)
  • The restart-tedge-agent-wrapper does not finish/hangs if the restart-tedge-agent-internal workflow uses on_success = "successful" in the [restarting] state.
  • The restart-tedge-agent-wrapper completes if the restart-tedge-agent-internal workflow uses on_success = "restarted" in the [restarting] state.

The following shows a snippet of the workflow when th restart hangs:

----------------------[ restart-tedge-agent-wrapper @ restarting | time=2025-11-04T02:18:42.68615348Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"restarting"}

Action:   await sub-operation completion

=> restart-tedge-agent-internal sub-operation is still running

To Reproduce

Reproducing the bug is slightly complicated, so a system test was created to demonstrate the bug.

Expected behavior

The await-agent-restart action should not require an intermediate state and should behave the same when either being called directly or from another workflow.

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Debian GNU/Linux 12 (bookworm)
Hardware [incl. revision] unknown
System-Architecture Linux 5dc411da8849 6.8.0-64-generic #67-Ubuntu SMP PREEMPT_DYNAMIC Sun Jun 15 20:23:40 UTC 2025 aarch64 GNU/Linux
thin-edge.io version tedge 1.6.2~275+g7689e03

Additional context

Workflow log

==> /var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log <==

==================================================================
Triggered restart-tedge-agent-wrapper workflow
==================================================================

topic:     te/device/main///cmd/restart-tedge-agent-wrapper/robot-1
operation: restart-tedge-agent-wrapper
cmd_id:    robot-1
time:      2025-11-04T02:17:42.470574114Z

==================================================================

----------------------[ restart-tedge-agent-wrapper @ init | time=2025-11-04T02:17:42.471326527Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"init"}

Action:   move to restart state

=> moving to restart-tedge-agent-wrapper @ restart

----------------------[ restart-tedge-agent-wrapper @ restart | time=2025-11-04T02:17:42.476852168Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"restart"}

Action:   execute restart-tedge-agent-internal as sub-operation

=> moving to restart-tedge-agent-wrapper @ restarting

----------------------[ restart-tedge-agent-wrapper @ restarting | time=2025-11-04T02:17:42.48417155Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"restarting"}

Action:   await sub-operation completion


----------------------[ restart-tedge-agent-wrapper > restart-tedge-agent-internal @ init | time=2025-11-04T02:17:42.5072824Z ]----------------------

State:    {"@version":"2225b1c86aeb227c52a25413683692fb3a09fc6d85ce8c59dc17e1333be08533","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"init"}

Action:   move to restart state

=> moving to restart-tedge-agent-wrapper > restart-tedge-agent-internal @ restart

----------------------[ restart-tedge-agent-wrapper > restart-tedge-agent-internal @ restart | time=2025-11-04T02:17:42.511401381Z ]----------------------

State:    {"@version":"2225b1c86aeb227c52a25413683692fb3a09fc6d85ce8c59dc17e1333be08533","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"restart"}

Action:   sudo systemctl restart tedge-agent

=> moving to restart-tedge-agent-wrapper > restart-tedge-agent-internal @ restarting
Killed by signal: 15

stderr (EMPTY)

stdout (EMPTY)

----------------------[ restart-tedge-agent-wrapper @ restarting | time=2025-11-04T02:18:42.673433099Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","resumed_at":"1762222662.0","status":"restarting"}

Action:   await sub-operation completion

=> restart-tedge-agent-internal sub-operation is still running

----------------------[ restart-tedge-agent-wrapper > restart-tedge-agent-internal @ successful | time=2025-11-04T02:18:42.675430302Z ]----------------------

State:    {"@version":"2225b1c86aeb227c52a25413683692fb3a09fc6d85ce8c59dc17e1333be08533","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","resumed_at":"1762222662.0","status":"successful"}

Action:   wait for the requester to finalize the command

Resuming invoking command te/device/main///cmd/restart-tedge-agent-wrapper/robot-1

----------------------[ restart-tedge-agent-wrapper @ restarting | time=2025-11-04T02:18:42.68615348Z ]----------------------

State:    {"@version":"b7e6501165817cde457d08806c7702994c15b65edafb6e574b9224d829ee8e6b","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-wrapper-robot-1.log","status":"restarting"}

Action:   await sub-operation completion

=> restart-tedge-agent-internal sub-operation is still running

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtheme:workflowsTheme: Workflow engine topics

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions