fix: treat OCI runtime create failed errors as transient #15183
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for detecting OCI runtime container creation errors as transient errors, addressing issue #12793.
Problem
When using projected volumes (secrets, configmaps) with
subPathmounts, there's a known race condition in Kubernetes (kubernetes/kubernetes#63726, kubernetes/kubernetes#68211) where the kubelet may not have finished projecting the volume contents when the container runtime tries to create the container.This results in errors like:
Kubernetes automatically retries the container and subsequent attempts usually succeed. However, Argo Workflows marks the workflow node as failed before Kubernetes has a chance to retry, even when using
OnTransientErrorretry policy.Real-world scenario we encountered
We're using the JuiceFS CSI driver which injects init containers that mount a
check_mount.shscript from a secret volume usingsubPath. Due to the Kubernetes race condition, the init container occasionally fails on first attempt with:Kubernetes retries and the pod eventually succeeds, but Argo Workflows has already marked the workflow as failed. This happens ~10-20% of the time under load, causing significant pipeline disruption.
Solution
Add
isTransientContainerRuntimeErr()function inutil/errors/errors.gothat detects:These are now recognized as transient errors, allowing workflows with
retryPolicy: OnTransientErrorto properly retry affected steps.Changes
util/errors/errors.go: AddedisTransientContainerRuntimeErr()function and integrated it intoisTransientErr()util/errors/errors_test.go: Added comprehensive tests for the new error detectionImpact
This fix benefits:
Testing
All existing tests pass, plus new tests added:
TestIsTransientErr/OCIRuntimeCreateFailedErr✅TestIsTransientErr/MountErrorNoSuchFileErr✅TestIsTransientErr/NonTransientOCIErr✅ (ensures we don't over-match)Fixes #12793