Skip to content

Commit 473796d

Browse files
committed
fix(ci): set correct permissions on podman's credentials file for red hat subscription and quay.io/aipcc login in build-notebooks workflow
## Root Cause Analysis The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`: ```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 PYLOCK_FLAVOR=cuda ``` This is a **private** image on `quay.io/aipcc`, requiring authentication. ## The Authentication Path (and where it breaks) There are **two separate auth setups** that run before the build, and a mismatch between client/server podman: ### 1. The rootful podman architecture The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**. ### 2. Auth setup step order **Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json: ```bash sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json ``` This file is created as **root-owned** (because `sudo cp`). **Step B** (lines 207-212) — login to quay.io/aipcc: ```bash echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin ``` Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`). ### 3. The server-side gap When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either: - The client must forward credentials (via the `X-Registry-Config` header in the API call), or - The server must have its own auth.json with the right credentials Neither is reliably guaranteed here: - Credential forwarding in podman remote mode is version-dependent and has had bugs - The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up ## Why "some succeed, some fail" The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously: 1. Authenticate with the same robot account credentials against quay.io 2. Try to pull the same base images Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected. ## Potential Fixes 1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting. 2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward: ```bash sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json ``` 3. **Pre-pull the base image explicitly** before `make`, with retry logic: ```bash for i in 1 2 3; do podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break sleep $((i * 30)) done ``` 4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy). 5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials: ```bash sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" ``` This would bypass the client/server auth split entirely by logging in directly on the server side. --- Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions. **Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug: - `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json` - When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file - So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead - Now credentials are **split across two files**: pull-secret in one, aipcc creds in another - Podman reads from the first file it finds in priority order, potentially missing credentials from the other Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue. **However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.
1 parent 382e0bf commit 473796d

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

.github/workflows/build-notebooks-TEMPLATE.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,11 @@ jobs:
132132
printf "${PWD}/entitlement:/etc/pki/entitlement\n${PWD}/consumer:/etc/pki/consumer\n" | sudo tee /usr/share/containers/mounts.conf
133133
134134
mkdir -p $HOME/.config/containers/
135-
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
135+
136+
# Don't use sudo here.
137+
# With CONTAINER_HOST set, podman is a remote client that forwards auth to the
138+
# rootful server during builds. Credentials must be readable by the runner user.
139+
cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
136140
env:
137141
SUBSCRIPTION_ORG: ${{ secrets.SUBSCRIPTION_ORG }}
138142
SUBSCRIPTION_ACTIVATION_KEY: ${{ secrets.SUBSCRIPTION_ACTIVATION_KEY }}

0 commit comments

Comments
 (0)