You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task
### What changes were proposed in this pull request?
This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job).
How do we detect and handle nondeterministic before:
- Stages are labeled as indeterminate at planning time, prior to query execution
- When a task completes and `FetchFailed` is detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages.
- In `submitMissingTasks()`, if a stage itself is isIndeterminate, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage.
How do we detect and handle nondeterministic now:
- During query execution, we keep track on the checksums produced by each map task.
- When a task completes and checksum mismatch is detected, we will abort unrollbackable succeeding stages of the stage with checksum mismatches. The failed stages resubmission still happen in the same places as before.
- In `submitMissingTasks()`, if the parent of a stage has checksum mismatches, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage.
Note that (1) if a stage `isReliablyCheckpointed`, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches.
### Why are the changes needed?
Handle nondeterministic issues caused by the retry of shuffle map task.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UTs added.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#52336 from ivoson/SPARK-53575.
Authored-by: Tengfei Huang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments