-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Pre-requisites
- I have double-checked my configuration
- I have tested with the
:latestimage tag (i.e.quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on:latest. If not, I have explained why, in detail, in my description below. - I have searched existing issues and could not find a match for this bug
- I'd like to contribute the fix myself (see contributing guide)
What happened? What did you expect to happen?
The workflow executor classifies HTTP 503 (Service Unavailable) from S3 as a non-transient error, bypassing the built-in retry strategy. 503 is a standard transient/retryable status code per both HTTP semantics and AWS S3 documentation.
Environment
Argo Workflows version: v3.7.4
S3-compatible storage: AWS S3
Platform: linux/amd64
Observed Behavior
During the artifact saving phase, the executor successfully uploads 2 out of 3 artifacts. The third upload receives a 503 Service Unavailable from S3. The executor logs it as non-transient and fails immediately without retrying, despite having a retry strategy configured (Duration=1s Factor=1.6 Jitter=0.5 Steps=5):
level=warning msg="Non-transient error: 503 Service Unavailable"
level=info msg="Save artifact" artifactName=jobKilledFile duration=64.037342ms error="failed to put file: 503 Service Unavailable"
level=error msg="executor error: failed to put file: 503 Service Unavailable; "
This causes the entire step to fail despite the main container completing successfully.
Expected Behavior
HTTP 503 should be classified as a transient error and retried according to the configured retry strategy, consistent with:
- AWS S3 error handling guidance which lists 503 (Slow Down / Service Unavailable) as retryable
- HTTP 503 semantics indicating a temporary condition
- AWS SDK default retry behavior which retries 503s
Impact
- Steps that completed successfully are marked as failed
- retryStrategy on the template re-runs the entire step including the main container, which may be expensive
- No way to retry only the artifact upload without re-running the whole step
- Particularly problematic for long-running or non-idempotent workloads
Version(s)
v3.7.4
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-upload
namespace: myargo
spec:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
ttlStrategy:
secondsAfterCompletion: 604800
secondsAfterSuccess: 604800
secondsAfterFailure: 604800
activeDeadlineSeconds: 32400
entrypoint: main
volumes:
- name: shared
emptyDir: { }
templates:
- name: main
inputs:
outputs:
artifacts:
- name: emailOutput
path: "/tmp/hello_world.txt"
archive:
none: {}
s3:
endpoint: s3.amazonaws.com
bucket: "my-bucket"
key: "hello_world.txt"
useSDKCreds: true
container:
image: busybox
command: [ sh, -c ]
args: [ "echo hello world | tee /tmp/hello_world.txt" ]
volumeMounts:
- mountPath: /tmp
name: sharedLogs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
time="2026-02-13T16:28:39 UTC" level=info msg="Starting Workflow Executor" version=v3.7.4
time="2026-02-13T16:28:39 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2026-02-13T16:28:39 UTC" level=info msg="Executor initialized" deadline="2026-02-15 04:28:10 +0000 UTC" includeScriptOutput=false namespace=mynamespace podName=mypodname templateName=run-kill-list-filter version="&Version{Version:v3.7.4,BuildDate:2025-11-13T14:24:21Z,GitCommit:9b9649b0af3d5006f3b6688cb2881db4fb324a96,GitTag:v3.7.4,GitTreeState:clean,GoVersion:go1.24.10,Compiler:gc,Platform:linux/amd64,}"
time="2026-02-13T16:28:39 UTC" level=info msg="Starting deadline monitor"
time="2026-02-13T16:29:00 UTC" level=info msg="Main container completed" error="<nil>"
time="2026-02-13T16:29:00 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2026-02-13T16:29:00 UTC" level=info msg="No output parameters"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving output artifacts"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging artifact: jobOutput"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging /tmp/output.jsonl from mirrored volume mount /mainctrfs/tmp/output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:00 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/output.jsonl, key: some/path/in/aws/kill-list-filtered-output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-filtered-output.jsonl path=/mainctrfs/tmp/output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="Save artifact" artifactName=jobOutput duration=289.23958ms error="<nil>" key=some/path/in/aws/kill-list-filtered-output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="not deleting local artifact" localArtPath=/mainctrfs/tmp/output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="Successfully saved file: /mainctrfs/tmp/output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging artifact: jobSummaryFile"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging /tmp/summary.json from mirrored volume mount /mainctrfs/tmp/summary.json"
time="2026-02-13T16:29:00 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:00 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/summary.json, key: some/path/in/aws/kill-list-summary-output.json"
time="2026-02-13T16:29:00 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-summary-output.json path=/mainctrfs/tmp/summary.json
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=jobSummaryFile duration=147.609854ms error="<nil>" key=some/path/in/aws/kill-list-summary-output.json
time="2026-02-13T16:29:01 UTC" level=info msg="not deleting local artifact" localArtPath=/mainctrfs/tmp/summary.json
time="2026-02-13T16:29:01 UTC" level=info msg="Successfully saved file: /mainctrfs/tmp/summary.json"
time="2026-02-13T16:29:01 UTC" level=info msg="Staging artifact: jobKilledFile"
time="2026-02-13T16:29:01 UTC" level=info msg="Staging /tmp/output_killed.jsonl from mirrored volume mount /mainctrfs/tmp/output_killed.jsonl"
time="2026-02-13T16:29:01 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:01 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/output_killed.jsonl, key: some/path/in/aws/kill-list-rejected-output.jsonl"
time="2026-02-13T16:29:01 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:01 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-rejected-output.jsonl path=/mainctrfs/tmp/output_killed.jsonl
time="2026-02-13T16:29:01 UTC" level=warning msg="Non-transient error: 503 Service Unavailable"
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=jobKilledFile duration=64.037342ms error="failed to put file: 503 Service Unavailable" key=some/path/in/aws/kill-list-rejected-output.jsonl
time="2026-02-13T16:29:01 UTC" level=error msg="executor error: failed to put file: 503 Service Unavailable; "
time="2026-02-13T16:29:01 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log"
time="2026-02-13T16:29:01 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:01 UTC" level=info msg="Saving file to s3" bucket=argo-bucket endpoint=s3.amazonaws.com key=nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log path=/tmp/argo/outputs/logs/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=main-logs duration=110.976312ms error="<nil>" key=nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2026-02-13T16:29:01 UTC" level=info msg="Alloc=10997 TotalAlloc=27279 Sys=33877 NumGC=6 Goroutines=25"
Error: failed to put file: 503 Service Unavailable;
failed to put file: 503 Service Unavailable;