Skip to content

[BUG] Workloads stuck in "RUNNING" state, no way to cancel or delete them #2573

@fourslashw

Description

@fourslashw

Describe the issue
While under relatively high load, 4 workflows got stuck in glitched "RUNNING" state. Pressing "cancel" does nothing other then updating "Finished at" field and duration, while still leaving workflow as "running".

Those stuck workflows are creating endless background activity of some kind on worker side while leaking memory which inevitably leads to workers crashing. We had to temporary create a new tenant and move everything there.

Environment

  • SDK: @hatchet-dev/typescript-sdk: "1.6.3"
  • Engine: Self hosted v0.73.34, using rabbitmq as message queue and amazon RDS m8g.xlarge postgresql as database.

Expected behavior
To cancel workflow or being able to delete it somehow.

Code to Reproduce, Logs, or Screenshots

Logs from incident time grepped by "error":
hatchet_logs_errors.csv
A lot of errors with SQLSTATE 57014

Log snippet from engine and dashboard while trying to cancel task:
hatchet_cancel_task_logs.csv
The only anomaly I can see there is this

"2025-11-26T10:25:12.000+01:00","jobs.prod.our.website","[90m2025-11-26T09:25:12.82Z�[0m �[31mERR�[0m �[1mAPI�[0m �[36merror=�[0m�[31m�[1m""code=404, message=Task not found""�[0m�[0m �[36mhost=�[0mhatchet.prod.internal.ournetwork �[36mlatency=�[0m63.446482 �[36mmethod=�[0mGET �[36mremote_ip=�[0m79.125.5.174 �[36mservice=�[0mserver �[36mstatus=�[0m500 �[36muri=�[0m/api/v1/stable/tasks/e332fd2a-156d-44e9-94c4-5c46e3605a6d �[36muser_agent=�[0m""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36""","/hatchet-hatchet-dashboard-1","1","docker.hatchet.dashboard"

Which saying task e332fd2a-156d-44e9-94c4-5c46e3605a6d is not found while being on page
https://hatchet.prod.internal.ourwebsite/tenants/75309d5c-e674-44cc-9a65-992b0c9fc405/runs/e332fd2a-156d-44e9-94c4-5c46e3605a6d?createdAfter=2025-11-25T09%3A32%3A57.171Z&pageIndex=0 where this task is opened:

Image Image Image

Pressing "cancel" leads to updating duration and run time:
Image

Database load at the approximate time of issue appearing:
Image

Image

Additional context
seems relevant to #1996

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions