Skip to content

Files: Files worker can get into a semi-stuck state #194

@dvernon-tacc

Description

@dvernon-tacc

Jan 30, 2026 - files worker was stuck on a transfer task with 500+ parents with a single child. When the task started it used up all worker threads, but all of the connections were timing out (30 seconds) plus retries. Since we had a bunch of stuff pre-loaded and couldn't wrap that work up quickly, the "fair scheduling" couldn't kick in. It would have eventually, but too many timeouts to wait for it. I cancelled the transfer task via the database, started a second worker, and everything went through fine.

  1. Check to see if there's some way to detect this situation and re-assign thing?
  2. Check to see if anything is blocked in the ssh pool when this is going on
  3. Implement the "try again after time" logic that is in the archive worker.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions