Skip to content

S3 artifact upload: 503 Service Unavailable incorrectly classified as non-transient error, skipping retries #15565

@IRus

Description

@IRus

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

The workflow executor classifies HTTP 503 (Service Unavailable) from S3 as a non-transient error, bypassing the built-in retry strategy. 503 is a standard transient/retryable status code per both HTTP semantics and AWS S3 documentation.

Environment
Argo Workflows version: v3.7.4
S3-compatible storage: AWS S3
Platform: linux/amd64

Observed Behavior

During the artifact saving phase, the executor successfully uploads 2 out of 3 artifacts. The third upload receives a 503 Service Unavailable from S3. The executor logs it as non-transient and fails immediately without retrying, despite having a retry strategy configured (Duration=1s Factor=1.6 Jitter=0.5 Steps=5):

level=warning msg="Non-transient error: 503 Service Unavailable"
level=info msg="Save artifact" artifactName=jobKilledFile duration=64.037342ms error="failed to put file: 503 Service Unavailable"
level=error msg="executor error: failed to put file: 503 Service Unavailable; "

This causes the entire step to fail despite the main container completing successfully.

Expected Behavior

HTTP 503 should be classified as a transient error and retried according to the configured retry strategy, consistent with:

Impact

  • Steps that completed successfully are marked as failed
  • retryStrategy on the template re-runs the entire step including the main container, which may be expensive
  • No way to retry only the artifact upload without re-running the whole step
  • Particularly problematic for long-running or non-idempotent workloads

Version(s)

v3.7.4

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-upload
  namespace: myargo
spec:
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  ttlStrategy:
    secondsAfterCompletion: 604800
    secondsAfterSuccess: 604800
    secondsAfterFailure: 604800
  activeDeadlineSeconds: 32400
  entrypoint: main
  volumes:
    - name: shared
      emptyDir: { }
  templates:
    - name: main
      inputs:
      outputs:
        artifacts:
          - name: emailOutput
            path: "/tmp/hello_world.txt"
            archive:
              none: {}
            s3:
              endpoint: s3.amazonaws.com
              bucket: "my-bucket"
              key: "hello_world.txt"
              useSDKCreds: true
      container:
        image: busybox
        command: [ sh, -c ]
        args: [ "echo hello world | tee /tmp/hello_world.txt" ]
        volumeMounts:
          - mountPath: /tmp
            name: shared

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

time="2026-02-13T16:28:39 UTC" level=info msg="Starting Workflow Executor" version=v3.7.4
time="2026-02-13T16:28:39 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2026-02-13T16:28:39 UTC" level=info msg="Executor initialized" deadline="2026-02-15 04:28:10 +0000 UTC" includeScriptOutput=false namespace=mynamespace podName=mypodname templateName=run-kill-list-filter version="&Version{Version:v3.7.4,BuildDate:2025-11-13T14:24:21Z,GitCommit:9b9649b0af3d5006f3b6688cb2881db4fb324a96,GitTag:v3.7.4,GitTreeState:clean,GoVersion:go1.24.10,Compiler:gc,Platform:linux/amd64,}"
time="2026-02-13T16:28:39 UTC" level=info msg="Starting deadline monitor"
time="2026-02-13T16:29:00 UTC" level=info msg="Main container completed" error="<nil>"
time="2026-02-13T16:29:00 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2026-02-13T16:29:00 UTC" level=info msg="No output parameters"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving output artifacts"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging artifact: jobOutput"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging /tmp/output.jsonl from mirrored volume mount /mainctrfs/tmp/output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:00 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/output.jsonl, key: some/path/in/aws/kill-list-filtered-output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-filtered-output.jsonl path=/mainctrfs/tmp/output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="Save artifact" artifactName=jobOutput duration=289.23958ms error="<nil>" key=some/path/in/aws/kill-list-filtered-output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="not deleting local artifact" localArtPath=/mainctrfs/tmp/output.jsonl
time="2026-02-13T16:29:00 UTC" level=info msg="Successfully saved file: /mainctrfs/tmp/output.jsonl"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging artifact: jobSummaryFile"
time="2026-02-13T16:29:00 UTC" level=info msg="Staging /tmp/summary.json from mirrored volume mount /mainctrfs/tmp/summary.json"
time="2026-02-13T16:29:00 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:00 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/summary.json, key: some/path/in/aws/kill-list-summary-output.json"
time="2026-02-13T16:29:00 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:00 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-summary-output.json path=/mainctrfs/tmp/summary.json
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=jobSummaryFile duration=147.609854ms error="<nil>" key=some/path/in/aws/kill-list-summary-output.json
time="2026-02-13T16:29:01 UTC" level=info msg="not deleting local artifact" localArtPath=/mainctrfs/tmp/summary.json
time="2026-02-13T16:29:01 UTC" level=info msg="Successfully saved file: /mainctrfs/tmp/summary.json"
time="2026-02-13T16:29:01 UTC" level=info msg="Staging artifact: jobKilledFile"
time="2026-02-13T16:29:01 UTC" level=info msg="Staging /tmp/output_killed.jsonl from mirrored volume mount /mainctrfs/tmp/output_killed.jsonl"
time="2026-02-13T16:29:01 UTC" level=info msg="No compression strategy needed. Staging skipped"
time="2026-02-13T16:29:01 UTC" level=info msg="S3 Save path: /mainctrfs/tmp/output_killed.jsonl, key: some/path/in/aws/kill-list-rejected-output.jsonl"
time="2026-02-13T16:29:01 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:01 UTC" level=info msg="Saving file to s3" bucket=mybacket endpoint=s3.amazonaws.com key=some/path/in/aws/kill-list-rejected-output.jsonl path=/mainctrfs/tmp/output_killed.jsonl
time="2026-02-13T16:29:01 UTC" level=warning msg="Non-transient error: 503 Service Unavailable"
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=jobKilledFile duration=64.037342ms error="failed to put file: 503 Service Unavailable" key=some/path/in/aws/kill-list-rejected-output.jsonl
time="2026-02-13T16:29:01 UTC" level=error msg="executor error: failed to put file: 503 Service Unavailable; "
time="2026-02-13T16:29:01 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log"
time="2026-02-13T16:29:01 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2026-02-13T16:29:01 UTC" level=info msg="Saving file to s3" bucket=argo-bucket endpoint=s3.amazonaws.com key=nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log path=/tmp/argo/outputs/logs/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="Save artifact" artifactName=main-logs duration=110.976312ms error="<nil>" key=nucleus/artifacts/2026/02/13/amp-upload-template-stxxj/mypodname/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2026-02-13T16:29:01 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2026-02-13T16:29:01 UTC" level=info msg="Alloc=10997 TotalAlloc=27279 Sys=33877 NumGC=6 Goroutines=25"
Error: failed to put file: 503 Service Unavailable; 
failed to put file: 503 Service Unavailable;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions