chore(ci, build): add support for KONFLUX Dockerfile builds in Makefile and tests#1879
Draft
chore(ci, build): add support for KONFLUX Dockerfile builds in Makefile and tests#1879
Conversation
…nance and flexibility
…pdate-20260204-0008 Update lock files
Externalize indexes outside pylock_generator script for better maintenance and flexibility
…n the corresponding cuda or rocm
Remove fallback logic to cpu index if some packages are not include i…
Update minimal image to new cuda 13
Final fixes for codeserver: ppc and s360x platforms build failures, updates on the versions and manifest annotations
…dependencies (opendatahub-io#2917) * chore(buildinputs): `go mod tidy`: update Go to 1.25 and bump module dependencies The Platform field moved from sourceresolver.Opt to sourceresolver.Opt.ImageOpt.Platform in buildkit v0.27.x. * This error occurs when upgrading OpenTelemetry SDK dependencies - internal packages have been reorganized in v1.40.0. Let me examine the Go module files to understand the situation. The issue is a version compatibility problem between OpenTelemetry packages. When `go get -u all` tries to upgrade `go.opentelemetry.io/otel/sdk` from v1.38.0 to v1.40.0, it creates an incompatibility with the contrib instrumentation packages (v0.63.0) that still reference internal packages that were reorganized in v1.40.0. The `go get -u all` command is not recommended because it can cause these kinds of version mismatches. Here are your options: **Option 1: Upgrade contrib packages together with SDK (recommended)** Run these commands to upgrade both the SDK and contrib packages to compatible versions: ``` cd scripts/buildinputs go get go.opentelemetry.io/otel@latest go get go.opentelemetry.io/otel/sdk@latest go get go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@latest go get go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace@latest go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@latest go mod tidy ``` **Option 2: Reset and let moby/buildkit determine compatible versions** Since these are indirect dependencies pulled in by `moby/buildkit`, you can let the buildkit dependency dictate the versions: ``` cd scripts/buildinputs git checkout go.mod go.sum go get github.com/moby/buildkit@latest go mod tidy ``` **Option 3: Upgrade specific packages instead of all** Avoid `go get -u all` and instead upgrade only what you need: ``` cd scripts/buildinputs go get github.com/moby/buildkit@latest go mod tidy ``` * update to new release https://github.com/openshift/check-payload/releases/tag/0.3.12 ``` git ls-remote --tags https://github.com/openshift/check-payload.git | grep -E '0\.3\.12|refs/tags/$' | head -20 92e2a43f840ef4aa471baa42e392d803f816aa58 refs/tags/0.3.12 go get github.com/openshift/check-payload@92e2a43f840ef4aa471baa42e392d803f816aa58 go mod tidy ```
…ex) 3.4-EA1 version, update feast back to 0.59 (opendatahub-io#2922) * update `INDEX_URL` and `CPU_INDEX_URL` to AIPCC 3.4-EA1 across all runtime and Jupyter build configurations * update URLs in `pylock.cuda.toml` to use `packages.redhat.com` endpoint instead of `console.redhat.com`.
…, and pnpm lockfile with latest dependency versions (opendatahub-io#2926)
…nd default test image (opendatahub-io#2926) - Introduced a reusable GitHub Action for Playwright tests to simplify workflow configurations. - Centralized DEFAULT_TEST_IMAGE in `playwright.config.ts` for easier maintenance. - Updated workflows to use the new action, removing duplicate logic, and enabling artifact uploads for CI.
…stopped including 1.x versions
…pdate-20260206-1029 Update lock files
…-io#2929) - Uncomment codeflare-sdk in jupyter/datascience and runtimes/datascience - Add Codeflare-SDK to jupyter-datascience-notebook imagestream manifest
|
Skipping CI for Draft Pull Request. |
5 tasks
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
6442cef to
3117865
Compare
…ss all Dockerfile.konflux.*
…ture when we introduced base-images Co-authored-by: Cursor <cursoragent@cursor.com>
…file and tests - Introduced `KONFLUX` flag in Makefile for building images with `Dockerfile.konflux.*`. - Added `TestMakefile` class to verify `KONFLUX`-specific configurations. - Updated helper methods for handling new makefile logic and assertions. Co-authored-by: Cursor <cursoragent@cursor.com>
…bases in red-hat-data-services * add new checkboxes to let user choose Dockerfile.konflux.* in on-demand build * fixup, env is not allowed in workflow YAML composite action, have to use input parameter
… hat subscription and quay.io/aipcc login in build-notebooks workflow
## Root Cause Analysis
The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`:
```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf
BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345
PYLOCK_FLAVOR=cuda
```
This is a **private** image on `quay.io/aipcc`, requiring authentication.
## The Authentication Path (and where it breaks)
There are **two separate auth setups** that run before the build, and a mismatch between client/server podman:
### 1. The rootful podman architecture
The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**.
### 2. Auth setup step order
**Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
```
This file is created as **root-owned** (because `sudo cp`).
**Step B** (lines 207-212) — login to quay.io/aipcc:
```bash
echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin
```
Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`).
### 3. The server-side gap
When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either:
- The client must forward credentials (via the `X-Registry-Config` header in the API call), or
- The server must have its own auth.json with the right credentials
Neither is reliably guaranteed here:
- Credential forwarding in podman remote mode is version-dependent and has had bugs
- The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up
## Why "some succeed, some fail"
The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously:
1. Authenticate with the same robot account credentials against quay.io
2. Try to pull the same base images
Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected.
## Potential Fixes
1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting.
2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json
```
3. **Pre-pull the base image explicitly** before `make`, with retry logic:
```bash
for i in 1 2 3; do
podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break
sleep $((i * 30))
done
```
4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy).
5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials:
```bash
sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}"
```
This would bypass the client/server auth split entirely by logging in directly on the server side.
---
Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions.
**Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug:
- `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json`
- When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file
- So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead
- Now credentials are **split across two files**: pull-secret in one, aipcc creds in another
- Podman reads from the first file it finds in priority order, potentially missing credentials from the other
Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue.
**However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.
fece15a to
4d5e61a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dockerfile.konflux.*builds in Makefile and tests opendatahub-io/notebooks#2933Description
How Has This Been Tested?
Self checklist (all need to be checked):
make test(gmakeon macOS) before asking for reviewDockerfile.konfluxfiles should be done inodh/notebooksand automatically synced torhds/notebooks. For Konflux-specific changes, modifyDockerfile.konfluxfiles directly inrhds/notebooksas these require special attention in the downstream repository and flow to the upcoming RHOAI release.Merge criteria: