Skip to content

timeout Sisyphus storage ops and Docker pulls #26

@1fish2

Description

@1fish2

Sisyphus GCS downloads got stuck in #24 due to a bug: missing a Google Client library. The worker would get stuck, never complete, and never shut itself down. [Why didn't the lack of a library throw an exception?]

That bug is fixed but we ought to bound the problems that can be caused during file downloads and uploads and Docker image pulls, e.g. if the remote server is slow.

Triggers:

  1. A request from Gaia (via Kafka) to terminate the current task.
  2. A timer expires.

Approaches from easiest to most robust:

  1. In Sisyphus, check these triggers before each file transfer or Docker image pull. This is straightforward other than picking the timeout duration and whether it should be per file or total. It would handle most cases but not the bug that caused Sisyphus gets stuck pulling an input file #24.
  2. In Sisyphus, do the file transfers and Docker pull in separate threads and be prepared to kill them on these triggers. The file cleanup code might need to be more careful.
  3. Make Gaia able to delete a stuck worker node, esp. once it becomes responsible for starting and stopping Sisyphus workers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions