Commit 18da822
committed
fix(ci): set correct permissions on podman's credentials file for red hat subscription and quay.io/aipcc login in build-notebooks workflow
## Root Cause Analysis
The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`:
```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf
BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345
PYLOCK_FLAVOR=cuda
```
This is a **private** image on `quay.io/aipcc`, requiring authentication.
## The Authentication Path (and where it breaks)
There are **two separate auth setups** that run before the build, and a mismatch between client/server podman:
### 1. The rootful podman architecture
The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**.
### 2. Auth setup step order
**Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
```
This file is created as **root-owned** (because `sudo cp`).
**Step B** (lines 207-212) — login to quay.io/aipcc:
```bash
echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin
```
Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`).
### 3. The server-side gap
When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either:
- The client must forward credentials (via the `X-Registry-Config` header in the API call), or
- The server must have its own auth.json with the right credentials
Neither is reliably guaranteed here:
- Credential forwarding in podman remote mode is version-dependent and has had bugs
- The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up
## Why "some succeed, some fail"
The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously:
1. Authenticate with the same robot account credentials against quay.io
2. Try to pull the same base images
Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected.
## Potential Fixes
1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting.
2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json
```
3. **Pre-pull the base image explicitly** before `make`, with retry logic:
```bash
for i in 1 2 3; do
podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break
sleep $((i * 30))
done
```
4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy).
5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials:
```bash
sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}"
```
This would bypass the client/server auth split entirely by logging in directly on the server side.
---
Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions.
**Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug:
- `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json`
- When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file
- So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead
- Now credentials are **split across two files**: pull-secret in one, aipcc creds in another
- Podman reads from the first file it finds in priority order, potentially missing credentials from the other
Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue.
**However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.1 parent 8648b25 commit 18da822
1 file changed
+5
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
135 | | - | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
136 | 140 | | |
137 | 141 | | |
138 | 142 | | |
| |||
0 commit comments