Skip to content

[bug] SeaweedFS S3 Authentication Race Condition in CI #12771

@hbelmiro

Description

@hbelmiro

Problem

E2E tests intermittently fail with the error:

failed to put file: Signed request requires setting up SeaweedFS S3 authentication

This happens because there's a race condition between the SeaweedFS S3 authentication setup and the test execution.

Root Cause Analysis

  1. PR fix(CI): ensure SeaweedFS S3 auth is set up before tests  #12322 (7e64da9a9) added a wait_for_seaweedfs_init() function that waits for an init-seaweedfs Job to complete before tests start.

  2. PR Add pod postStart lifecycle for SeaweedFS and remove Job initializer #12387 (caf854eed) removed the init-seaweedfs Job and switched to a postStart lifecycle hook for SeaweedFS configuration. However, the wait_for_seaweedfs_init() function was not updated to reflect this change.

  3. The current wait_for_seaweedfs_init() function is essentially a no-op:

wait_for_seaweedfs_init () {
    local namespace="$1"
    local timeout="$2"
    # This condition is NEVER true because the Job was removed in PR #12387
    if kubectl -n "$namespace" get job init-seaweedfs > /dev/null 2>&1; then
        if ! kubectl -n "$namespace" wait --for=condition=complete --timeout="$timeout" job/init-seaweedfs; then
            return 1
        fi
    fi
    # Falls through immediately - no waiting happens!
}
  1. The race condition occurs because:
    • SeaweedFS pod starts
    • Readiness probe passes (checks /status endpoint)
    • wait_for_pods() sees all pods are Ready and returns
    • wait_for_seaweedfs_init() returns immediately (Job doesn't exist)
    • Tests start running
    • Meanwhile, the postStart lifecycle hook is still configuring S3 auth via s3.configure command
    • Tests fail because S3 auth isn't configured yet

Affected Files

  • .github/resources/scripts/helper-functions.sh - Contains the broken wait_for_seaweedfs_init() function
  • .github/resources/scripts/deploy-kfp.sh - Calls wait_for_seaweedfs_init() for SeaweedFS storage backend

Proposed Solution

Update the wait_for_seaweedfs_init() function to verify that SeaweedFS S3 authentication is actually configured by testing S3 connectivity with credentials.

wait_for_seaweedfs_init () {
    local namespace="$1"
    local timeout="$2"
    local start_time=$(date +%s)
    local timeout_seconds=${timeout%s}  # Remove 's' suffix
    
    echo "Waiting for SeaweedFS S3 authentication to be configured..."
    
    # Get credentials from the secret
    local access_key=$(kubectl -n "$namespace" get secret mlpipeline-minio-artifact -o jsonpath='{.data.accesskey}' | base64 -d)
    local secret_key=$(kubectl -n "$namespace" get secret mlpipeline-minio-artifact -o jsonpath='{.data.secretkey}' | base64 -d)
    
    # Test S3 authentication by listing the bucket
    while true; do
        local current_time=$(date +%s)
        local elapsed=$((current_time - start_time))
        
        if [ "$elapsed" -ge "$timeout_seconds" ]; then
            echo "ERROR: Timeout waiting for SeaweedFS S3 authentication"
            return 1
        fi
        
        # Port-forward to SeaweedFS and test S3 auth
        # Using kubectl run with aws-cli or curl to test signed S3 requests
        if kubectl -n "$namespace" exec deploy/seaweedfs -- \
            /bin/sh -c "echo 's3.bucket.list' | /usr/bin/weed shell 2>/dev/null | grep -q mlpipeline"; then
            echo "SeaweedFS S3 authentication is configured"
            return 0
        fi
        
        echo "Waiting for SeaweedFS S3 auth... (${elapsed}s elapsed)"
        sleep 5
    done
}

Alternative Solutions

  1. Create a marker ConfigMap: Have the postStart hook create a ConfigMap after S3 configuration completes, then wait for that ConfigMap.

  2. Add startup probe: Configure a startup probe that only passes after S3 auth is configured (would require modifying the SeaweedFS deployment).

  3. Synchronous S3 configuration: Move S3 configuration from postStart to an init container that blocks until complete.

Impact

  • Affects all CI jobs using SeaweedFS storage backend
  • Causes intermittent/flaky test failures
  • Can block PRs from merging due to false-positive test failures

References


Reproduction Steps

  1. Run E2E tests with storage=seaweedfs configuration
  2. The test may fail intermittently with S3 authentication errors
  3. The failure is more likely when:
    • SeaweedFS pod takes longer to configure S3 auth
    • Tests start immediately after pods become Ready

Labels

/area testing


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions