[EPIC] Improve job data cleanup logic

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

There are several improvement points in the current job-data deletion flow:

1. *Duplication*  
   The executor has `clean_all_shuffle_data` alongside other ad-hoc removal logic. These overlap in functionality, making the code harder to maintain and reason about.

2. *Push-based broadcast*
   When the scheduler initiates cleanup, it currently notifies all executors. This is inefficient because only a subset of executors actually hold the job’s data.

3. *Per-job deletion tasks* 
   In `clean_up_successful_job` / `clean_up_failed_job`, the scheduler spawns a separate delayed task (`sleep`) for each job and calls `state.remove_job(job_id)` individually. This results in many small tasks and RPCs, which could be batched more efficiently.

**Describe the solution you'd like**
Unify cleanup behind a single, testable “deletion facility”:

1. *Deduplicate* logic with `clean_all_shuffle_data`; extract/keep a shared async remover (e.g., `remove_job_dir`) with safety checks.
2. *Targeted push*: notify only executors that actually hold the job’s data (no broadcast).
3. *Batching*: we already dispatch periodically; change each tick to send one batched `remove_jobs(Vec<JobId>)` for all pending IDs rather than spawning per-job sleeps and individual removals.

**Describe alternatives you've considered**


**Additional context**
Related: #1219 , #1314 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve job data cleanup logic #1316

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Improve job data cleanup logic #1316

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions