-
-
Notifications
You must be signed in to change notification settings - Fork 154
Open
Labels
Description
While considering issue #443, I identified that job cancellations, although a corner case in normal operations with well-intentioned users, also represent a potential Denial of Service (DoS) attack vector and is an actual non-trivial source of wasted GPU cycles. This issue is distinct from the bug identified in #443, which pertains specifically to the submission of completed jobs by workers. To address my other concerns, I propose the following improvements to the handling of canceled jobs within the worker job dispatch system.
Proposed Changes:
-
Job Cancellation Handling:
- Introduce a new field
jobs_cancelledin the job pop responses. This field will list job ids that were assigned to the worker but have since been canceled by the requesting user.
- Introduce a new field
-
New Worker Notification Endpoint:
- Create a new
POSTendpoint for worker notifications:- The endpoint will always respond with the
jobs_cancelledfield, providing a list of canceled job ids. - It will not assign new jobs to the worker in this response.
- The worker can send a payload containing the
jobs_cancelledfield to acknowledge that they have stopped working on the canceled job(s).
- The endpoint will always respond with the
- Create a new
-
Prorated Kudos for Canceled Jobs:
- Implement a prorated kudos system where the amount of kudos awarded decreases based on how much time has elapsed before the worker acknowledges the job cancellation. This incentivizes workers to abandon canceled jobs quickly, thereby saving GPU cycles.
-
Abuse Prevention Measures:
- Recognize the potential for abuse and introduce mechanisms to mitigate it:
- Flagging High Cancellation Pairs: Monitor and flag user/worker pairs that have a high frequency of job cancellations for review.
- Statistical Anomalies: Identify and flag workers with abnormal or statistically unlikely cancellation rates.
- Targeted Cancellations: Pay extra attention to workers who cancel jobs that were specifically targeted to them using the
workersfield. - Untrusted workers: Workers who are not yet trusted should trigger additional scrutiny when high volumes of cancellations occur for jobs they have been assigned.
- Recognize the potential for abuse and introduce mechanisms to mitigate it:
Reactions are currently unavailable