chore(ci, build): add support for Dockerfile.konflux.* builds in Makefile and tests#2933
Conversation
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThis PR introduces Konflux build support across GitHub Actions workflows and the Makefile build system. It adds a new KONFLUX flag to conditionally build from Dockerfile.konflux.* variants, updates workflow inputs and job configurations to enable this feature, modifies the Makefile build pipeline to support variant-specific Dockerfile and build-args selection, and updates corresponding tests to validate the new behavior. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dbbdb6e to
37e4942
Compare
…ture when we introduced base-images Co-authored-by: Cursor <cursoragent@cursor.com>
37e4942 to
473796d
Compare
…file and tests - Introduced `KONFLUX` flag in Makefile for building images with `Dockerfile.konflux.*`. - Added `TestMakefile` class to verify `KONFLUX`-specific configurations. - Updated helper methods for handling new makefile logic and assertions. Co-authored-by: Cursor <cursoragent@cursor.com>
…bases in red-hat-data-services * add new checkboxes to let user choose Dockerfile.konflux.* in on-demand build * fixup, env is not allowed in workflow YAML composite action, have to use input parameter
… hat subscription and quay.io/aipcc login in build-notebooks workflow
## Root Cause Analysis
The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`:
```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf
BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345
PYLOCK_FLAVOR=cuda
```
This is a **private** image on `quay.io/aipcc`, requiring authentication.
## The Authentication Path (and where it breaks)
There are **two separate auth setups** that run before the build, and a mismatch between client/server podman:
### 1. The rootful podman architecture
The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**.
### 2. Auth setup step order
**Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
```
This file is created as **root-owned** (because `sudo cp`).
**Step B** (lines 207-212) — login to quay.io/aipcc:
```bash
echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin
```
Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`).
### 3. The server-side gap
When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either:
- The client must forward credentials (via the `X-Registry-Config` header in the API call), or
- The server must have its own auth.json with the right credentials
Neither is reliably guaranteed here:
- Credential forwarding in podman remote mode is version-dependent and has had bugs
- The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up
## Why "some succeed, some fail"
The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously:
1. Authenticate with the same robot account credentials against quay.io
2. Try to pull the same base images
Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected.
## Potential Fixes
1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting.
2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json
```
3. **Pre-pull the base image explicitly** before `make`, with retry logic:
```bash
for i in 1 2 3; do
podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break
sleep $((i * 30))
done
```
4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy).
5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials:
```bash
sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}"
```
This would bypass the client/server auth split entirely by logging in directly on the server side.
---
Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions.
**Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug:
- `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json`
- When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file
- So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead
- Now credentials are **split across two files**: pull-secret in one, aipcc creds in another
- Podman reads from the first file it finds in priority order, potentially missing credentials from the other
Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue.
**However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.
473796d to
f5c417f
Compare
|
@jiridanek: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Dockerfile.konflux.* builds in Makefile and tests
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ysok The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The PR looks clean and well-structured. Here's my AI's review:
Overall: Looks good
4 commits, each logically scoped. The cleanup removed all the over-engineered auth fixes from our debugging session and kept only what matters.
Per-file review
Makefile— KONFLUX Dockerfile selectionThe
imagefunction refactoring is clean. The key logic:This correctly maps:
KONFLUX=no→Dockerfile.cpu+build-args/cpu.confKONFLUX=yes→Dockerfile.konflux.cpu+build-args/konflux.cpu.confThe refactoring to pass
CONF_FILEas a 3rd arg tobuild_imagewas necessary — the oldcut -d. -f2onDockerfile.konflux.cpuwould have extractedkonfluxinstead ofcpu.build-notebooks-push.yaml—build/build-aipccsplitThe
if:conditions are correct for all event types:buildruns?build-aipccruns?odhcheckedrhdscheckedOne minor observation: if someone triggers
workflow_dispatchwith neitherodhnorrhdschecked, nothing builds. That's a valid user choice but could be surprising. You might consider making one default totrue, but it's not a blocker.build-notebooks-TEMPLATE.yaml— auth fixThe
sudo cp→cpchange is the right minimal fix. Good comment explaining why.makefile_helper.py— new testThe
test_makefile__build_image__konfluxtest is a nice addition that validates the KONFLUX Makefile logic end-to-end via dry-run. The_assert_subdicthelper is clean.One nit:
_extract_assignmentsonly matches:=assignments, which is correct for Make's--print-data-baseoutput for immediately-expanded variables.gen_gha_matrix_jobs.py— test fixThe changed file path was updated from
cuda/ubi9-python-3.12/NGC-DL-CONTAINER-LICENSE(which no longer exists after the base-images restructuring) tojupyter/utils/addons/dist/pf.css. The expected targets are updated accordingly. Straightforward fix.No issues found
The PR is ready as-is.
How Has This Been Tested?
Self checklist (all need to be checked):
make test(gmakeon macOS) before asking for reviewDockerfile.konfluxfiles should be done inodh/notebooksand automatically synced torhds/notebooks. For Konflux-specific changes, modifyDockerfile.konfluxfiles directly inrhds/notebooksas these require special attention in the downstream repository and flow to the upcoming RHOAI release.Merge criteria:
Summary by CodeRabbit
New Features
Chores