fix: treat OCI runtime create failed errors as transient #15183

fabioaraujopt · 2025-12-16T19:42:25Z

Summary

This PR adds support for detecting OCI runtime container creation errors as transient errors, addressing issue #12793.

Problem

When using projected volumes (secrets, configmaps) with subPath mounts, there's a known race condition in Kubernetes (kubernetes/kubernetes#63726, kubernetes/kubernetes#68211) where the kubelet may not have finished projecting the volume contents when the container runtime tries to create the container.

This results in errors like:

OCI runtime create failed: error mounting "/var/lib/kubelet/pods/.../volume-subpaths/jfs-check-mount/jfs-mount/3" to rootfs at "/check_mount.sh": mount src=..., dst=/check_mount.sh: no such file or directory

Kubernetes automatically retries the container and subsequent attempts usually succeed. However, Argo Workflows marks the workflow node as failed before Kubernetes has a chance to retry, even when using OnTransientError retry policy.

Real-world scenario we encountered

We're using the JuiceFS CSI driver which injects init containers that mount a check_mount.sh script from a secret volume using subPath. Due to the Kubernetes race condition, the init container occasionally fails on first attempt with:

Warning  Failed  Error: failed to create containerd task: OCI runtime create failed: error mounting "..." to rootfs at "/check_mount.sh": no such file or directory

Kubernetes retries and the pod eventually succeeds, but Argo Workflows has already marked the workflow as failed. This happens ~10-20% of the time under load, causing significant pipeline disruption.

Solution

Add isTransientContainerRuntimeErr() function in util/errors/errors.go that detects:

"OCI runtime create failed" errors - The specific error pattern from containerd/runc when volume mounts fail
Mount errors with "no such file or directory" - Generic pattern covering similar race conditions

These are now recognized as transient errors, allowing workflows with retryPolicy: OnTransientError to properly retry affected steps.

Changes

util/errors/errors.go: Added isTransientContainerRuntimeErr() function and integrated it into isTransientErr()
util/errors/errors_test.go: Added comprehensive tests for the new error detection

Impact

This fix benefits:

Workflows using JuiceFS CSI driver with check_mount.sh script
Workflows using other CSI drivers that mount secrets/configmaps with subPath
Any workflow encountering projected volume race conditions

Testing

All existing tests pass, plus new tests added:

TestIsTransientErr/OCIRuntimeCreateFailedErr ✅
TestIsTransientErr/MountErrorNoSuchFileErr ✅
TestIsTransientErr/NonTransientOCIErr ✅ (ensures we don't over-match)

Fixes #12793

This PR adds support for detecting OCI runtime container creation errors as transient errors. This addresses issue argoproj#12793 where workflows using retryStrategy with OnTransientError policy fail instead of retrying when init containers encounter race conditions with projected volume mounts. ## Problem When using projected volumes (secrets, configmaps) with subPath mounts, there's a known race condition in Kubernetes (kubernetes/kubernetes#63726, kubernetes/kubernetes#68211) where the kubelet may not have finished projecting the volume contents when the container runtime tries to create the container. This results in errors like: ``` OCI runtime create failed: error mounting "/var/lib/kubelet/pods/.../ volume-subpaths/jfs-check-mount/jfs-mount/3" to rootfs at "/check_mount.sh": mount src=..., dst=/check_mount.sh: no such file or directory ``` Kubernetes automatically retries the container and subsequent attempts usually succeed. However, Argo Workflows marks the workflow node as failed before Kubernetes has a chance to retry, even when using OnTransientError retry policy. ## Solution Add `isTransientContainerRuntimeErr()` function that detects: 1. "OCI runtime create failed" errors 2. Mount errors with "no such file or directory" These are now recognized as transient errors, allowing workflows with `retryPolicy: OnTransientError` to properly retry affected steps. ## Impact - Workflows using JuiceFS CSI driver with check_mount.sh script - Workflows using other CSI drivers that mount secrets with subPath - Any workflow encountering projected volume race conditions Fixes argoproj#12793 Signed-off-by: Fabio Araujo <[email protected]>

Joibel · 2025-12-17T10:52:32Z

Does this actually work in the cluster? See the discussion in #12572

github-actions · 2026-01-02T02:36:57Z

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

github-actions · 2026-01-16T02:37:27Z

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

fabioaraujopt mentioned this pull request Dec 16, 2025

Cannot handle OCI runtime create failed as OnTransientError. #12793

Open

4 tasks

Joibel added the problem/more information needed Not enough information has been provide to diagnose this issue. label Dec 18, 2025

github-actions bot added the problem/stale This has not had a response in some time label Jan 2, 2026

github-actions bot closed this Jan 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: treat OCI runtime create failed errors as transient #15183

fix: treat OCI runtime create failed errors as transient #15183

Uh oh!

fabioaraujopt commented Dec 16, 2025

Uh oh!

Joibel commented Dec 17, 2025

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: treat OCI runtime create failed errors as transient #15183

fix: treat OCI runtime create failed errors as transient #15183

Uh oh!

Conversation

fabioaraujopt commented Dec 16, 2025

Summary

Problem

Real-world scenario we encountered

Solution

Changes

Impact

Testing

Uh oh!

Joibel commented Dec 17, 2025

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants