Skip to content

Commit 3117865

Browse files
committed
chore(ci): use rootful podman for quay.io login in build-notebooks workflow
## Root Cause Analysis The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`: ```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 PYLOCK_FLAVOR=cuda ``` This is a **private** image on `quay.io/aipcc`, requiring authentication. ## The Authentication Path (and where it breaks) There are **two separate auth setups** that run before the build, and a mismatch between client/server podman: ### 1. The rootful podman architecture The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**. ### 2. Auth setup step order **Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json: ```bash sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json ``` This file is created as **root-owned** (because `sudo cp`). **Step B** (lines 207-212) — login to quay.io/aipcc: ```bash echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin ``` Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`). ### 3. The server-side gap When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either: - The client must forward credentials (via the `X-Registry-Config` header in the API call), or - The server must have its own auth.json with the right credentials Neither is reliably guaranteed here: - Credential forwarding in podman remote mode is version-dependent and has had bugs - The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up ## Why "some succeed, some fail" The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously: 1. Authenticate with the same robot account credentials against quay.io 2. Try to pull the same base images Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected. ## Potential Fixes 1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting. 2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward: ```bash sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json ``` 3. **Pre-pull the base image explicitly** before `make`, with retry logic: ```bash for i in 1 2 3; do podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break sleep $((i * 30)) done ``` 4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy). 5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials: ```bash sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" ``` This would bypass the client/server auth split entirely by logging in directly on the server side.
1 parent df6e42d commit 3117865

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

.github/workflows/build-notebooks-TEMPLATE.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,11 @@ jobs:
132132
printf "${PWD}/entitlement:/etc/pki/entitlement\n${PWD}/consumer:/etc/pki/consumer\n" | sudo tee /usr/share/containers/mounts.conf
133133
134134
mkdir -p $HOME/.config/containers/
135-
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
135+
136+
# Don't use sudo here.
137+
# With CONTAINER_HOST set, podman is a remote client that forwards auth to the
138+
# rootful server during builds. Credentials must be readable by the runner user.
139+
cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
136140
env:
137141
SUBSCRIPTION_ORG: ${{ secrets.SUBSCRIPTION_ORG }}
138142
SUBSCRIPTION_ACTIVATION_KEY: ${{ secrets.SUBSCRIPTION_ACTIVATION_KEY }}

0 commit comments

Comments
 (0)