Skip to content

Conversation

@MStokluska
Copy link
Contributor

@MStokluska MStokluska commented Nov 17, 2025

Description

Universal image konflux builds fix
Dockerfile based of current state of #502

Merge criteria:

  • Konflux successfully completes

  • The commits are squashed in a cohesive manner and have meaningful messages.

  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Infrastructure

    • Added enhanced multi-platform build and security scanning pipelines for automated testing and deployment workflows.
  • Chores

    • Updated training container image with improved build tooling and dependency updates, including newer versions of key packages (PyTorch transformers, tokenizers, numba) and additional runtime components for enhanced functionality.

@red-hat-konflux
Copy link

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
odh-training-th03-cuda128-torch28-py312-rhel9-on-pull-request CEL expression evaluation error: failed to parse expression "event == \"pull_request\" && target_branch == \"main\" && (\"images/universal/training/th03-cuda128-torch280-py312/Dockerfile\".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh\".pathChanged() || \".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml\".pathChanged())": ERROR: <input>:1:217: Syntax error: mismatched input '".pathChanged() || "' expecting ')' | event == "pull_request" && target_branch == "main" && ("images/universal/training/th03-cuda128-torch280-py312/Dockerfile".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh".pathChanged() || ".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml".pathChanged()) | ........................................................................................................................................................................................................................^ ERROR: <input>:1:308: Syntax error: token recognition error at: '".pathChanged())' | event == "pull_request" && target_branch == "main" && ("images/universal/training/th03-cuda128-torch280-py312/Dockerfile".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh".pathChanged() || ".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml".pathChanged()) | ...................................................................................................................................................................................................................................................................................................................^
odh-training-th03-cuda128-torch28-py312-rhel9-on-push CEL expression evaluation error: failed to parse expression "event == \"push\" && target_branch == \"main\" && (\"images/universal/training/th03-cuda128-torch280-py312/Dockerfile\".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh\".pathChanged() || \".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml\".pathChanged())": ERROR: <input>:1:209: Syntax error: mismatched input '".pathChanged() || "' expecting ')' | event == "push" && target_branch == "main" && ("images/universal/training/th03-cuda128-torch280-py312/Dockerfile".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh".pathChanged() || ".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml".pathChanged()) | ................................................................................................................................................................................................................^ ERROR: <input>:1:292: Syntax error: token recognition error at: '".pathChanged())' | event == "push" && target_branch == "main" && ("images/universal/training/th03-cuda128-torch280-py312/Dockerfile".pathChanged() || images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh".pathChanged() || ".tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml".pathChanged()) | ...................................................................................................................................................................................................................................................................................................^

@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Two new Tekton CI/CD pipeline configurations for a training image (th03-cuda128-torch28-py312-rhel9) are introduced for pull request and push workflows. The accompanying Dockerfile is updated with dependency version bumps, additional runtime packages, build tooling additions, and modified source references.

Changes

Cohort / File(s) Summary
Tekton Pipeline Configurations
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml, .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml
New PipelineRun manifests defining multi-platform build orchestration with init, clone, dependency prefetch, matrix-based image builds, indexing, source image generation, security scanning (clair-scan, sast-\*, clamav, coverity), tagging, and artifact publishing; uses Konflux Tekton catalog bundles with conditional task execution via when clauses.
Training Image Dockerfile
images/universal/training/th03-cuda128-torch280-py312/Dockerfile
Dependency updates (transformers 4.55.2→4.57.1, tokenizers 0.21.4→0.22.1, numba 0.61.2→0.62.1, liger-kernel→0.6.2, training-hub→0.3.0); added packages (deprecated, typer, einops, kernels, instructlab-training, rhai-innovation-mini-trainer); build tooling additions (pip, setuptools, wheel, ninja, cmake); deterministic two-step installs for causal-conv1d and mamba-ssm; SDK branch reference updated; entrypoint path adjusted; ninja runtime installation removed.

Sequence Diagram

sequenceDiagram
    participant User as Developer
    participant Git as Git
    participant Tekton as Tekton Pipeline
    participant Build as Build Task
    participant Scan as Scan Tasks
    participant Registry as Image Registry
    
    User->>Git: Push/PR Code
    Git->>Tekton: Trigger Pipeline
    Tekton->>Tekton: Init & Clone
    Tekton->>Build: Prefetch Dependencies
    Tekton->>Build: Build Images (Multi-platform)
    Build->>Build: Build Image Index
    Tekton->>Scan: Run Security Checks<br/>(clair-scan, sast, clamav)
    Scan->>Scan: Deprecated Base Image Check
    Scan->>Scan: Ecosystem Cert Preflight
    Tekton->>Registry: Apply Tags & Push
    Tekton->>Registry: Push Dockerfile
    Registry->>User: Artifact Published
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Tekton files (pull-request.yaml, push.yaml): Large manifests (300+ lines each) with multiple task definitions, conditional logic, and matrix-based parallelization requiring verification of parameter flow, result passing, workspace configuration, and resolver references
  • Dockerfile changes: Multiple heterogeneous modifications spanning dependency version pins, package additions, build tool configuration, and source path references that require individual validation
  • Cross-file consistency: Verify alignment between two similar but distinct Tekton pipeline definitions (PR vs push workflows)
  • Dependency compatibility: Validate that new/updated packages (transformers, tokenizers, liger-kernel, etc.) are compatible with CUDA 12.8 and PyTorch 2.8
  • Build tool implications: Assess impact of ninja removal from runtime vs. addition in build helpers stage

Poem

🐰 A training image grows in might,
With multi-platform builds taking flight,
Dependencies dance in versions new,
Tekton pipelines pull them through—
Ninja cmake, torch and more,
Our scanner's knocking at the door! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'fix the universal image build' is vague and generic, using non-descriptive language that doesn't convey specific information about what was fixed or how. Use a more specific title that describes the actual changes, such as 'Update universal image Dockerfile with dependency upgrades and SDK reference changes' or 'Fix universal image build with Dockerfile and Tekton pipeline updates'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MStokluska MStokluska force-pushed the universal_image_multiplatform branch from 79abe4d to 409b486 Compare November 17, 2025 08:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
images/universal/training/th03-cuda128-torch280-py312/Dockerfile (1)

119-157: Address Hadolint warning: multiple redirections.

Hadolint (SC2261) flagged potential multiple redirections in this RUN block. While the code appears syntactically correct, ensure there are no unintended shell redirections. The multi-line continuation is valid, but verify the intent is correct.

If this is a false positive from Hadolint, we can document it. If there is a shell issue, I can help refactor to resolve it.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 942459e and 409b486.

📒 Files selected for processing (3)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (1 hunks)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (1 hunks)
  • images/universal/training/th03-cuda128-torch280-py312/Dockerfile (3 hunks)
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/th03-cuda128-torch280-py312/Dockerfile

[error] 119-119: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Red Hat Konflux / odh-training-th03-cuda128-torch28-py312-rhel9-on-pull-request
🔇 Additional comments (6)
images/universal/training/th03-cuda128-torch280-py312/Dockerfile (3)

111-111: Verify the SDK git branch reference.

The git URL now references add-training-hub branch instead of a stable ref. For production builds, confirm this branch is stable and the intent is not to use main or a tagged release.


167-169: Approve the deterministic dependency install sequence.

The 2-step no-build-isolation install for causal-conv1d and mamba-ssm followed by fix-permissions is a good practice for ensuring reproducible builds and handling build-time dependencies correctly.


172-172: No action required—entrypoint file location is correct.

The search confirms that entrypoint-universal.sh exists at images/universal/training/th03-cuda128-torch280-py312/entrypoint-universal.sh, which is in the same directory as the Dockerfile. The COPY command on line 172 correctly references the file from the build context root and will succeed without modification.

.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (2)

1-50: Push workflow configuration is appropriate for release builds.

Key differences from PR pipeline are correct:

  • Output tag is latest (no revision suffix)
  • No image expiration (permanent release)
  • cancel-in-progress: false (allow concurrent pushes)
  • Trigger on push to main vs pull_request

Configuration aligns with release workflow expectations.


10-11: Verify file path watchers in on-push trigger.

The pathChanged() expressions monitor Dockerfile, entrypoint, and pipeline files. Confirm these paths match your repository structure. If the entrypoint file is not at the expected location (see Dockerfile review comment), update the path here as well.

Also applies to: 12-12

.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (1)

1-50: Bundle verification requires live Konflux environment access.

The verification script could not reach the quay.io registry due to sandbox environment limitations. The task bundle references appear syntactically valid with correct SHA256 digests, but this does not confirm their availability or compatibility with your actual Konflux instance. Run the verification script in an environment with network access to quay.io and appropriate credentials to confirm bundle accessibility and compatibility before merging.


# Provide ninja via pip (RHEL/UBI repo ninja-build may be unavailable)
RUN pip install --retries 5 --timeout 300 --no-cache-dir ninja
# RUN pip install --retries 5 --timeout 300 --no-cache-dir ninja
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove redundant ninja installation.

Ninja is commented out on line 114 but re-installed on line 161 via pip install -U pip setuptools wheel ninja cmake. Consolidate by removing the comment on line 114 (since ninja will be installed via pip), or clarify the intent.

Apply this diff:

- # Provide ninja via pip (RHEL/UBI repo ninja-build may be unavailable)
- # RUN pip install --retries 5 --timeout 300 --no-cache-dir ninja

  # Install remaining runtime packages (resolved from default PyPI), including FlashAttention

Also applies to: 161-161

🤖 Prompt for AI Agents
In images/universal/training/th03-cuda128-torch280-py312/Dockerfile around lines
114 and 161, the Dockerfile redundantly references ninja: line 114 has a
commented pip install of ninja while line 161 installs ninja via a combined pip
command; remove the commented-out ninja line (or if intent was to pre-install
ninja earlier, instead uncomment and remove it from the later combined pip
install) so ninja is installed exactly once and the intent is clear—update the
Dockerfile to either delete line 114 or adjust line 161 accordingly and keep a
single pip install for ninja.

@openshift-ci
Copy link

openshift-ci bot commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: briangallagher

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit ebf7bc5 into opendatahub-io:main Nov 17, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants