Skip to content

User-initiated job cancellation improvements  #444

@tazlin

Description

@tazlin

While considering issue #443, I identified that job cancellations, although a corner case in normal operations with well-intentioned users, also represent a potential Denial of Service (DoS) attack vector and is an actual non-trivial source of wasted GPU cycles. This issue is distinct from the bug identified in #443, which pertains specifically to the submission of completed jobs by workers. To address my other concerns, I propose the following improvements to the handling of canceled jobs within the worker job dispatch system.

Proposed Changes:

  1. Job Cancellation Handling:

    • Introduce a new field jobs_cancelled in the job pop responses. This field will list job ids that were assigned to the worker but have since been canceled by the requesting user.
  2. New Worker Notification Endpoint:

    • Create a new POST endpoint for worker notifications:
      • The endpoint will always respond with the jobs_cancelled field, providing a list of canceled job ids.
      • It will not assign new jobs to the worker in this response.
      • The worker can send a payload containing the jobs_cancelled field to acknowledge that they have stopped working on the canceled job(s).
  3. Prorated Kudos for Canceled Jobs:

    • Implement a prorated kudos system where the amount of kudos awarded decreases based on how much time has elapsed before the worker acknowledges the job cancellation. This incentivizes workers to abandon canceled jobs quickly, thereby saving GPU cycles.
  4. Abuse Prevention Measures:

    • Recognize the potential for abuse and introduce mechanisms to mitigate it:
      • Flagging High Cancellation Pairs: Monitor and flag user/worker pairs that have a high frequency of job cancellations for review.
      • Statistical Anomalies: Identify and flag workers with abnormal or statistically unlikely cancellation rates.
      • Targeted Cancellations: Pay extra attention to workers who cancel jobs that were specifically targeted to them using the workers field.
      • Untrusted workers: Workers who are not yet trusted should trigger additional scrutiny when high volumes of cancellations occur for jobs they have been assigned.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions