-
Notifications
You must be signed in to change notification settings - Fork 299
Description
Describe the issue
While under relatively high load, 4 workflows got stuck in glitched "RUNNING" state. Pressing "cancel" does nothing other then updating "Finished at" field and duration, while still leaving workflow as "running".
Those stuck workflows are creating endless background activity of some kind on worker side while leaking memory which inevitably leads to workers crashing. We had to temporary create a new tenant and move everything there.
Environment
- SDK: @hatchet-dev/typescript-sdk: "1.6.3"
- Engine: Self hosted v0.73.34, using rabbitmq as message queue and amazon RDS m8g.xlarge postgresql as database.
Expected behavior
To cancel workflow or being able to delete it somehow.
Code to Reproduce, Logs, or Screenshots
Logs from incident time grepped by "error":
hatchet_logs_errors.csv
A lot of errors with SQLSTATE 57014
Log snippet from engine and dashboard while trying to cancel task:
hatchet_cancel_task_logs.csv
The only anomaly I can see there is this
"2025-11-26T10:25:12.000+01:00","jobs.prod.our.website","[90m2025-11-26T09:25:12.82Z�[0m �[31mERR�[0m �[1mAPI�[0m �[36merror=�[0m�[31m�[1m""code=404, message=Task not found""�[0m�[0m �[36mhost=�[0mhatchet.prod.internal.ournetwork �[36mlatency=�[0m63.446482 �[36mmethod=�[0mGET �[36mremote_ip=�[0m79.125.5.174 �[36mservice=�[0mserver �[36mstatus=�[0m500 �[36muri=�[0m/api/v1/stable/tasks/e332fd2a-156d-44e9-94c4-5c46e3605a6d �[36muser_agent=�[0m""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36""","/hatchet-hatchet-dashboard-1","1","docker.hatchet.dashboard"
Which saying task e332fd2a-156d-44e9-94c4-5c46e3605a6d is not found while being on page
https://hatchet.prod.internal.ourwebsite/tenants/75309d5c-e674-44cc-9a65-992b0c9fc405/runs/e332fd2a-156d-44e9-94c4-5c46e3605a6d?createdAfter=2025-11-25T09%3A32%3A57.171Z&pageIndex=0 where this task is opened:
Pressing "cancel" leads to updating duration and run time:

Database load at the approximate time of issue appearing:

Additional context
seems relevant to #1996