Skip to content

Conversation

@fabioaraujopt
Copy link

Summary

This PR adds support for detecting OCI runtime container creation errors as transient errors, addressing issue #12793.

Problem

When using projected volumes (secrets, configmaps) with subPath mounts, there's a known race condition in Kubernetes (kubernetes/kubernetes#63726, kubernetes/kubernetes#68211) where the kubelet may not have finished projecting the volume contents when the container runtime tries to create the container.

This results in errors like:

OCI runtime create failed: error mounting "/var/lib/kubelet/pods/.../volume-subpaths/jfs-check-mount/jfs-mount/3" to rootfs at "/check_mount.sh": mount src=..., dst=/check_mount.sh: no such file or directory

Kubernetes automatically retries the container and subsequent attempts usually succeed. However, Argo Workflows marks the workflow node as failed before Kubernetes has a chance to retry, even when using OnTransientError retry policy.

Real-world scenario we encountered

We're using the JuiceFS CSI driver which injects init containers that mount a check_mount.sh script from a secret volume using subPath. Due to the Kubernetes race condition, the init container occasionally fails on first attempt with:

Warning  Failed  Error: failed to create containerd task: OCI runtime create failed: error mounting "..." to rootfs at "/check_mount.sh": no such file or directory

Kubernetes retries and the pod eventually succeeds, but Argo Workflows has already marked the workflow as failed. This happens ~10-20% of the time under load, causing significant pipeline disruption.

Solution

Add isTransientContainerRuntimeErr() function in util/errors/errors.go that detects:

  1. "OCI runtime create failed" errors - The specific error pattern from containerd/runc when volume mounts fail
  2. Mount errors with "no such file or directory" - Generic pattern covering similar race conditions

These are now recognized as transient errors, allowing workflows with retryPolicy: OnTransientError to properly retry affected steps.

Changes

  • util/errors/errors.go: Added isTransientContainerRuntimeErr() function and integrated it into isTransientErr()
  • util/errors/errors_test.go: Added comprehensive tests for the new error detection

Impact

This fix benefits:

  • Workflows using JuiceFS CSI driver with check_mount.sh script
  • Workflows using other CSI drivers that mount secrets/configmaps with subPath
  • Any workflow encountering projected volume race conditions

Testing

All existing tests pass, plus new tests added:

  • TestIsTransientErr/OCIRuntimeCreateFailedErr
  • TestIsTransientErr/MountErrorNoSuchFileErr
  • TestIsTransientErr/NonTransientOCIErr ✅ (ensures we don't over-match)

Fixes #12793

This PR adds support for detecting OCI runtime container creation errors
as transient errors. This addresses issue argoproj#12793 where workflows using
retryStrategy with OnTransientError policy fail instead of retrying when
init containers encounter race conditions with projected volume mounts.

## Problem

When using projected volumes (secrets, configmaps) with subPath mounts,
there's a known race condition in Kubernetes (kubernetes/kubernetes#63726,
kubernetes/kubernetes#68211) where the kubelet may not have finished
projecting the volume contents when the container runtime tries to create
the container. This results in errors like:

```
OCI runtime create failed: error mounting "/var/lib/kubelet/pods/.../
volume-subpaths/jfs-check-mount/jfs-mount/3" to rootfs at "/check_mount.sh":
mount src=..., dst=/check_mount.sh: no such file or directory
```

Kubernetes automatically retries the container and subsequent attempts
usually succeed. However, Argo Workflows marks the workflow node as failed
before Kubernetes has a chance to retry, even when using OnTransientError
retry policy.

## Solution

Add `isTransientContainerRuntimeErr()` function that detects:
1. "OCI runtime create failed" errors
2. Mount errors with "no such file or directory"

These are now recognized as transient errors, allowing workflows with
`retryPolicy: OnTransientError` to properly retry affected steps.

## Impact

- Workflows using JuiceFS CSI driver with check_mount.sh script
- Workflows using other CSI drivers that mount secrets with subPath
- Any workflow encountering projected volume race conditions

Fixes argoproj#12793

Signed-off-by: Fabio Araujo <[email protected]>
@Joibel
Copy link
Member

Joibel commented Dec 17, 2025

Does this actually work in the cluster? See the discussion in #12572

@Joibel Joibel added the problem/more information needed Not enough information has been provide to diagnose this issue. label Dec 18, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jan 2, 2026
@github-actions
Copy link
Contributor

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

@github-actions github-actions bot closed this Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot handle OCI runtime create failed as OnTransientError.

2 participants