Skip to content

chore(ci, build): add support for Dockerfile.konflux.* builds in Makefile and tests#2933

Merged
jiridanek merged 4 commits intoopendatahub-io:mainfrom
jiridanek:jd/26/02/makefile_dockerfile_konflux
Feb 9, 2026
Merged

chore(ci, build): add support for Dockerfile.konflux.* builds in Makefile and tests#2933
jiridanek merged 4 commits intoopendatahub-io:mainfrom
jiridanek:jd/26/02/makefile_dockerfile_konflux

Conversation

@jiridanek
Copy link
Member

@jiridanek jiridanek commented Feb 6, 2026

The PR looks clean and well-structured. Here's my AI's review:

Overall: Looks good

4 commits, each logically scoped. The cleanup removed all the over-engineered auth fixes from our debugging session and kept only what matters.

Per-file review

Makefile — KONFLUX Dockerfile selection

The image function refactoring is clean. The key logic:

$(eval DOCKERFILE := $(BUILD_DIRECTORY)/Dockerfile$(if $(KONFLUX:no=),.konflux,$()).$(VARIANT))
$(eval CONF_FILE := $(BUILD_DIRECTORY)/build-args/$(if $(KONFLUX:no=),konflux.,$())$(VARIANT).conf)

This correctly maps:

  • KONFLUX=noDockerfile.cpu + build-args/cpu.conf
  • KONFLUX=yesDockerfile.konflux.cpu + build-args/konflux.cpu.conf

The refactoring to pass CONF_FILE as a 3rd arg to build_image was necessary — the old cut -d. -f2 on Dockerfile.konflux.cpu would have extracted konflux instead of cpu.

build-notebooks-push.yamlbuild / build-aipcc split

The if: conditions are correct for all event types:

Event Repo build runs? build-aipcc runs?
push/schedule opendatahub-io yes no
push/schedule red-hat-data-services no yes
workflow_dispatch any if odh checked if rhds checked

One minor observation: if someone triggers workflow_dispatch with neither odh nor rhds checked, nothing builds. That's a valid user choice but could be surprising. You might consider making one default to true, but it's not a blocker.

build-notebooks-TEMPLATE.yaml — auth fix

The sudo cpcp change is the right minimal fix. Good comment explaining why.

makefile_helper.py — new test

The test_makefile__build_image__konflux test is a nice addition that validates the KONFLUX Makefile logic end-to-end via dry-run. The _assert_subdict helper is clean.

One nit: _extract_assignments only matches := assignments, which is correct for Make's --print-data-base output for immediately-expanded variables.

gen_gha_matrix_jobs.py — test fix

The changed file path was updated from cuda/ubi9-python-3.12/NGC-DL-CONTAINER-LICENSE (which no longer exists after the base-images restructuring) to jupyter/utils/addons/dist/pf.css. The expected targets are updated accordingly. Straightforward fix.

No issues found

The PR is ready as-is.

How Has This Been Tested?

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added Konflux build support with configurable flag for alternative Dockerfile variants.
    • Added manual workflow dispatch inputs to control build paths (odh, rhds).
    • Introduced new build-aipcc job for dedicated build pipeline handling.
  • Chores

    • Updated build workflow configurations and Makefile build system structure.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@github-actions github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Feb 6, 2026
@openshift-ci openshift-ci bot added the size/l label Feb 6, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

This PR introduces Konflux build support across GitHub Actions workflows and the Makefile build system. It adds a new KONFLUX flag to conditionally build from Dockerfile.konflux.* variants, updates workflow inputs and job configurations to enable this feature, modifies the Makefile build pipeline to support variant-specific Dockerfile and build-args selection, and updates corresponding tests to validate the new behavior.

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
.github/workflows/build-notebooks-TEMPLATE.yaml, .github/workflows/build-notebooks-pr-aipcc.yaml, .github/workflows/build-notebooks-push.yaml
Added konflux input flag (boolean, default false) and propagated it through workflow calls. Replaced privileged file copy with non-privileged alternative. Split build-notebooks-push workflow into separate odh and rhds paths with independent job conditions and konflux/subscription configurations.
Makefile Build System
Makefile
Introduced KONFLUX flag (default no) to conditionally select Dockerfile.konflux.* variants. Updated build_image function signature to accept three parameters (image name, Dockerfile path, build-args config) instead of two. Modified image target to compute Dockerfile path and CONF_FILE based on KONFLUX flag and pass them explicitly to build_image.
Build Infrastructure Tests
ci/cached-builds/gen_gha_matrix_jobs.py, ci/cached-builds/makefile_helper.py
Updated test expectations in matrix job generator for changed-files filtering. Changed exec_makefile signature to accept Sequence[str] instead of list[str]. Added TestMakefile class with tests validating dry-run behavior with and without KONFLUX environment variable, including assertions on Makefile assignments and docker build arguments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description fails to properly follow the required template structure. Critical sections are incomplete or missing checkboxes. Complete all required sections: check all self-checklist items, explicitly verify testing was performed, check all merge criteria boxes, and add clear testing instructions in the PR body for non-obvious changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding KONFLUX support for Dockerfile builds in Makefile and CI/test infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ture when we introduced base-images

Co-authored-by: Cursor <cursoragent@cursor.com>
@jiridanek jiridanek force-pushed the jd/26/02/makefile_dockerfile_konflux branch from 37e4942 to 473796d Compare February 7, 2026 12:03
@openshift-ci openshift-ci bot added size/l and removed size/l labels Feb 7, 2026
jiridanek and others added 3 commits February 7, 2026 13:06
…file and tests

- Introduced `KONFLUX` flag in Makefile for building images with `Dockerfile.konflux.*`.
- Added `TestMakefile` class to verify `KONFLUX`-specific configurations.
- Updated helper methods for handling new makefile logic and assertions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…bases in red-hat-data-services

* add new checkboxes to let user choose Dockerfile.konflux.* in on-demand build

* fixup, env is not allowed in workflow YAML composite action, have to use input parameter
… hat subscription and quay.io/aipcc login in build-notebooks workflow

## Root Cause Analysis

The failing job is `build-aipcc (cuda-jupyter-minimal-ubi9-python-3.12, ...)`. This is the `build-aipcc` job defined in `build-notebooks-push.yaml`, which builds with `konflux: true` and `subscription: true`. With `konflux: true`, the build uses `build-args/konflux.cuda.conf`:

```5:6:jupyter/minimal/ubi9-python-3.12/build-args/konflux.cuda.conf
BASE_IMAGE=quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345
PYLOCK_FLAVOR=cuda
```

This is a **private** image on `quay.io/aipcc`, requiring authentication.

## The Authentication Path (and where it breaks)

There are **two separate auth setups** that run before the build, and a mismatch between client/server podman:

### 1. The rootful podman architecture

The workflow sets `CONTAINER_HOST: unix:///var/run/podman/podman.sock` (line 59), meaning all `podman` commands run as a **remote client** talking to a **rootful podman daemon** via socket. This is critical because the client and server have **separate auth stores**.

### 2. Auth setup step order

**Step A** (lines 134-135) — subscription step copies a pull-secret to the client-side auth.json:
```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
```
This file is created as **root-owned** (because `sudo cp`).

**Step B** (lines 207-212) — login to quay.io/aipcc:
```bash
echo "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}" | podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin
```
Running as the `runner` user, this tries to merge aipcc credentials into the auth.json. But the auth.json is **root-owned** from Step A, so `podman login` may fail or silently write to a fallback location (`$XDG_RUNTIME_DIR/containers/auth.json`).

### 3. The server-side gap

When `podman build` runs (line 261: `make ${{ inputs.target }}`), the build happens on the **rootful podman server**. For the server to pull `quay.io/aipcc/...`, either:
- The client must forward credentials (via the `X-Registry-Config` header in the API call), or
- The server must have its own auth.json with the right credentials

Neither is reliably guaranteed here:
- Credential forwarding in podman remote mode is version-dependent and has had bugs
- The server's auth store (`/run/containers/0/auth.json` or similar) was never explicitly set up

## Why "some succeed, some fail"

The most likely explanation for the intermittent nature is **quay.io rate limiting / throttling**. When the workflow matrix spawns ~20+ parallel jobs, they all simultaneously:
1. Authenticate with the same robot account credentials against quay.io
2. Try to pull the same base images

Quay.io is known to return **401 Unauthorized instead of 429 Too Many Requests** when rate-limiting. This manifests as the "unauthorized" error you see, even though the credentials are correct. The jobs that happen to execute their pull before the rate limit kicks in succeed; the rest get rejected.

## Potential Fixes

1. **Add retry logic to the build step** — e.g., retry the `make` target 2-3 times with a backoff. This is the simplest mitigation for registry rate limiting.

2. **Fix auth.json file ownership** — change the subscription step to not use `sudo`, or `chown` the file afterward:

```bash
sudo cp ${PWD}/ci/secrets/pull-secret.json $HOME/.config/containers/auth.json
sudo chown $(id -u):$(id -g) $HOME/.config/containers/auth.json
```

3. **Pre-pull the base image explicitly** before `make`, with retry logic:

```bash
for i in 1 2 3; do
  podman pull quay.io/aipcc/base-images/cuda-12.9-el9.6:3.3.0-1768412345 && break
  sleep $((i * 30))
done
```

4. **Stagger the matrix jobs** to avoid all hitting quay.io simultaneously (e.g., using `max-parallel` in the strategy).

5. **Set up server-side auth explicitly** so the rootful podman daemon has credentials:

```bash
sudo podman login quay.io/aipcc -u "${{ secrets.AIPCC_QUAY_BOT_USERNAME }}" --password-stdin <<< "${{ secrets.AIPCC_QUAY_BOT_PASSWORD }}"
```

This would bypass the client/server auth split entirely by logging in directly on the server side.

---

Good find on the actual root cause being missing repository-level access on the robot account. That explains the "some succeed, some fail" pattern perfectly — it wasn't rate limiting or auth setup at all, just different repositories having different permissions.

**Yes, `994be26aa` (the `sudo cp` -> `cp` change) is still valuable.** It fixes a real latent bug:

- `sudo cp` creates a **root-owned** auth.json at `$HOME/.config/containers/auth.json`
- When `podman login quay.io/aipcc` later runs as the runner user, it **can't write** to a root-owned file
- So podman login silently writes to the fallback `$XDG_RUNTIME_DIR/containers/auth.json` instead
- Now credentials are **split across two files**: pull-secret in one, aipcc creds in another
- Podman reads from the first file it finds in priority order, potentially missing credentials from the other

Using `cp` (no sudo) keeps everything owned by the runner user so `podman login` can merge into the same file. It's a correctness fix independent of the repository permissions issue.

**However**, the additional changes we made in this conversation (consolidated login step, `REGISTRY_AUTH_FILE`, server-side copy to `/root/.config/`) were all attempts to work around what turned out to be the wrong root cause. Those are defensive but add complexity. You may want to revert them and keep only the `sudo cp` -> `cp` fix from the commit, now that the robot account will have proper access.
@jiridanek jiridanek force-pushed the jd/26/02/makefile_dockerfile_konflux branch from 473796d to f5c417f Compare February 8, 2026 14:29
@openshift-ci openshift-ci bot added size/l and removed size/l labels Feb 8, 2026
@jiridanek jiridanek marked this pull request as ready for review February 8, 2026 14:35
@openshift-ci openshift-ci bot requested review from atheo89 and daniellutz February 8, 2026 14:35
@openshift-ci openshift-ci bot added size/l and removed size/l labels Feb 8, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 8, 2026

@jiridanek: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images f5c417f link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added size/l and removed size/l labels Feb 8, 2026
@jiridanek jiridanek changed the title chore(ci, build): add support for KONFLUX Dockerfile builds in Makefile and tests chore(ci, build): add support for Dockerfile.konflux.* builds in Makefile and tests Feb 8, 2026
@openshift-ci openshift-ci bot added size/l and removed size/l labels Feb 8, 2026
@openshift-ci openshift-ci bot added the lgtm label Feb 9, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ysok

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 9, 2026
@jiridanek jiridanek merged commit 18da822 into opendatahub-io:main Feb 9, 2026
15 of 18 checks passed
@jiridanek jiridanek deleted the jd/26/02/makefile_dockerfile_konflux branch February 9, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/l

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants