-
Notifications
You must be signed in to change notification settings - Fork 261
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several improvement points in the current job-data deletion flow:
-
Duplication
The executor hasclean_all_shuffle_dataalongside other ad-hoc removal logic. These overlap in functionality, making the code harder to maintain and reason about. -
Push-based broadcast
When the scheduler initiates cleanup, it currently notifies all executors. This is inefficient because only a subset of executors actually hold the job’s data. -
Per-job deletion tasks
Inclean_up_successful_job/clean_up_failed_job, the scheduler spawns a separate delayed task (sleep) for each job and callsstate.remove_job(job_id)individually. This results in many small tasks and RPCs, which could be batched more efficiently.
Describe the solution you'd like
Unify cleanup behind a single, testable “deletion facility”:
- Deduplicate logic with
clean_all_shuffle_data; extract/keep a shared async remover (e.g.,remove_job_dir) with safety checks. - Targeted push: notify only executors that actually hold the job’s data (no broadcast).
- Batching: we already dispatch periodically; change each tick to send one batched
remove_jobs(Vec<JobId>)for all pending IDs rather than spawning per-job sleeps and individual removals.
Describe alternatives you've considered