Skip to content

chore(ci, build): add support for KONFLUX Dockerfile builds in Makefile and tests#1879

Draft
jiridanek wants to merge 34 commits intomainfrom
jd/26/02/makefile_dockerfile_konflux_rhds
Draft

chore(ci, build): add support for KONFLUX Dockerfile builds in Makefile and tests#1879
jiridanek wants to merge 34 commits intomainfrom
jd/26/02/makefile_dockerfile_konflux_rhds

Conversation

@jiridanek
Copy link
Member

Description

How Has This Been Tested?

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

github-actions bot and others added 28 commits February 4, 2026 00:08
Externalize indexes outside pylock_generator script for better maintenance and flexibility
Remove fallback logic to cpu index if some packages are not include i…
Final fixes for codeserver: ppc and s360x platforms build failures, updates on the versions and manifest annotations
…dependencies (opendatahub-io#2917)

* chore(buildinputs): `go mod tidy`: update Go to 1.25 and bump module dependencies

The Platform field moved from sourceresolver.Opt to sourceresolver.Opt.ImageOpt.Platform in buildkit v0.27.x.

* This error occurs when upgrading OpenTelemetry SDK dependencies - internal packages have been reorganized in v1.40.0. Let me examine the Go module files to understand the situation.

The issue is a version compatibility problem between OpenTelemetry packages. When `go get -u all` tries to upgrade `go.opentelemetry.io/otel/sdk` from v1.38.0 to v1.40.0, it creates an incompatibility with the contrib instrumentation packages (v0.63.0) that still reference internal packages that were reorganized in v1.40.0.

The `go get -u all` command is not recommended because it can cause these kinds of version mismatches. Here are your options:

**Option 1: Upgrade contrib packages together with SDK (recommended)**

Run these commands to upgrade both the SDK and contrib packages to compatible versions:

```
cd scripts/buildinputs
go get go.opentelemetry.io/otel@latest
go get go.opentelemetry.io/otel/sdk@latest
go get go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@latest
go get go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace@latest
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@latest
go mod tidy
```

**Option 2: Reset and let moby/buildkit determine compatible versions**

Since these are indirect dependencies pulled in by `moby/buildkit`, you can let the buildkit dependency dictate the versions:

```
cd scripts/buildinputs
git checkout go.mod go.sum
go get github.com/moby/buildkit@latest
go mod tidy
```

**Option 3: Upgrade specific packages instead of all**

Avoid `go get -u all` and instead upgrade only what you need:

```
cd scripts/buildinputs
go get github.com/moby/buildkit@latest
go mod tidy
```

* update to new release https://github.com/openshift/check-payload/releases/tag/0.3.12

```
git ls-remote --tags https://github.com/openshift/check-payload.git | grep -E '0\.3\.12|refs/tags/$' | head -20
92e2a43f840ef4aa471baa42e392d803f816aa58	refs/tags/0.3.12

go get github.com/openshift/check-payload@92e2a43f840ef4aa471baa42e392d803f816aa58
go mod tidy
```
…ex) 3.4-EA1 version, update feast back to 0.59 (opendatahub-io#2922)

* update `INDEX_URL` and `CPU_INDEX_URL` to AIPCC 3.4-EA1 across all runtime and Jupyter build configurations
* update URLs in `pylock.cuda.toml` to use `packages.redhat.com` endpoint instead of `console.redhat.com`.
…nd default test image (opendatahub-io#2926)

- Introduced a reusable GitHub Action for Playwright tests to simplify workflow configurations.
- Centralized DEFAULT_TEST_IMAGE in `playwright.config.ts` for easier maintenance.
- Updated workflows to use the new action, removing duplicate logic, and enabling artifact uploads for CI.
…-io#2929)

- Uncomment codeflare-sdk in jupyter/datascience and runtimes/datascience
- Add Codeflare-SDK to jupyter-datascience-notebook imagestream manifest
@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ysok for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jiridanek jiridanek force-pushed the jd/26/02/makefile_dockerfile_konflux_rhds branch 3 times, most recently from 6442cef to 3117865 Compare February 6, 2026 21:15
jiridanek and others added 6 commits February 7, 2026 03:32
…ture when we introduced base-images

Co-authored-by: Cursor <cursoragent@cursor.com>
…file and tests

- Introduced `KONFLUX` flag in Makefile for building images with `Dockerfile.konflux.*`.
- Added `TestMakefile` class to verify `KONFLUX`-specific configurations.
- Updated helper methods for handling new makefile logic and assertions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…bases in red-hat-data-services

* add new checkboxes to let user choose Dockerfile.konflux.* in on-demand build

* fixup, env is not allowed in workflow YAML composite action, have to use input parameter
… hat subscription and quay.io/aipcc login in build-notebooks workflow

## Root Cause Analysis

The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`:

```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf
BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345
PYLOCK_FLAVOR=cuda
```

This is a **private** image on `quay.io/aipcc`, requiring authentication.

## The Authentication Path (and where it breaks)

There are **two separate auth setups** that run before the build, and a mismatch between client/server podman:

### 1. The rootful podman architecture

The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**.

### 2. Auth setup step order

**Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
```
This file is created as **root-owned** (because `sudo cp`).

**Step B** (lines 207-212) — login to quay.io/aipcc:
```bash
echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin
```
Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`).

### 3. The server-side gap

When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either:
- The client must forward credentials (via the `X-Registry-Config` header in the API call), or
- The server must have its own auth.json with the right credentials

Neither is reliably guaranteed here:
- Credential forwarding in podman remote mode is version-dependent and has had bugs
- The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up

## Why "some succeed, some fail"

The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously:
1. Authenticate with the same robot account credentials against quay.io
2. Try to pull the same base images

Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected.

## Potential Fixes

1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting.

2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward:

```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json
```

3. **Pre-pull the base image explicitly** before `make`, with retry logic:

```bash
for i in 1 2 3; do
  podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break
  sleep $((i * 30))
done
```

4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy).

5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials:

```bash
sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}"
```

This would bypass the client/server auth split entirely by logging in directly on the server side.

---

Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions.

**Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug:

- `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json`
- When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file
- So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead
- Now credentials are **split across two files**: pull-secret in one, aipcc creds in another
- Podman reads from the first file it finds in priority order, potentially missing credentials from the other

Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue.

**However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.
@jiridanek jiridanek force-pushed the jd/26/02/makefile_dockerfile_konflux_rhds branch from fece15a to 4d5e61a Compare February 7, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants