Skip to content

Conversation

@MStokluska
Copy link
Contributor

@MStokluska MStokluska commented Nov 13, 2025

Description

Konflux universal image must leverage a higher spec platform, which is only available on multiplaform pipelines.
Also, dockerfile update to match whats in #502 to properly test it.

Summary by CodeRabbit

  • New Features

    • Pipeline now builds multi-architecture images with OCI artifact-based workflows and parallelized matrix builds.
  • Chores

    • CI pipeline metadata and service account updated to the new training pipeline.
    • Expanded security/quality checks and artifact propagation across the pipeline.
  • Dependencies

    • Updated training image dependencies (transformers, numba, bitsandbytes, liger-kernel, tokenizers) and added new runtime packages to enhance training capabilities.

@openshift-ci
Copy link

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign laurafitzgerald for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

Walkthrough

Replace single-image Docker-build Tekton PipelineRuns with multi-platform, OCI-artifact-driven pipelines (PR and push triggers), rename component/pipeline identifiers, swap many tasks to OCI-ta variants and matrixed execution, and update the training Dockerfile with dependency bumps and build tweaks.

Changes

Cohort / File(s) Summary
Tekton PipelineRun — pull-request
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml
Renamed pipeline/component metadata; switched to multi-platform OCI-artifact flow; added dockerfile and path-context params; replaced many taskRef and bundle images with OCI-ta variants; introduced matrix-based build and scan tasks; added SOURCE_ARTIFACT/CACHI2_ARTIFACT/ociStorage params; removed several PVC/workspace usages; updated serviceAccountName.
Tekton PipelineRun — push
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml
Mirrored push-trigger variant of the PR pipeline: new metadata/name/component, multi-arch/matrix execution, OCI-ta task swaps, artifact propagation (SOURCE_ARTIFACT, IMAGE_REF, ociArtifactExpiresAfter), matrix-driven build-images and scans, updated task bundles and serviceAccountName, removed legacy workspace sections.
Training Dockerfile
images/universal/training/py312-cuda128-torch280/Dockerfile
Bumped runtime packages (e.g., transformers→4.57.1, numba→0.62.1, bitsandbytes→0.48.1, liger-kernel→0.6.2), added packages (training-hub, instructlab-training, rhai-innovation-mini-trainer, einops, kernels), introduced two-step deterministic sub-dependency installs for causal-conv1d and mamba-ssm, adjusted SDK install target and entrypoint copy path, commented/unified some install lines.

Sequence Diagram(s)

sequenceDiagram
  participant PR as PipelineRun (PR/Push)
  participant Git as git-clone-oci-ta
  participant Prefetch as prefetch-dependencies-oci-ta
  participant BuildMatrix as build-images (matrix)
  participant OCI as OCI storage / artifact registry
  participant Index as build-image-index
  participant Scans as scans (clair, SAST, clamav, etc.)
  rect rgb(230,245,255)
    PR->>Git: fetch source (git-url, revision)
    Git-->>PR: SOURCE_ARTIFACT
    PR->>Prefetch: prefetch deps (uses SOURCE_ARTIFACT)
    Prefetch-->>OCI: push CACHI2_ARTIFACT
  end
  rect rgb(245,255,230)
    PR->>BuildMatrix: start matrix builds (per PLATFORM)
    BuildMatrix->>OCI: publish per-platform OCI artifacts
    BuildMatrix-->>PR: IMAGE_REF(s), IMAGE_DIGEST(s)
  end
  rect rgb(255,245,230)
    PR->>Scans: run parallel scans (consume IMAGE_REF / artifacts)
    Scans-->>PR: scan results
    PR->>Index: aggregate IMAGE_REF and artifacts
    Index-->>OCI: push image index / manifest list
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Pay special attention to:
    • Correct propagation and naming of artifact results (SOURCE_ARTIFACT, CACHI2_ARTIFACT, IMAGE_REF, IMAGE_DIGEST).
    • Matrix definitions and matrix-to-task param mappings (platform/arch combinations).
    • All renamed taskRef/bundle image replacements to OCI-ta variants.
    • Service account and removed workspace changes affecting TaskRunTemplate behavior.
    • Dockerfile dependency bumps and new two-step install steps for reproducibility.

Poem

🐰 A tiny hop for CI, a matrix in the glen,
Artifacts travel OCI trails again.
Builds split like carrots, one per platform bright,
Scans nibble each byte, signatures snug and tight.
I tuck versions in pockets and whisper: ship it right. 🚀

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: moving the Konflux universal pipeline to multiplatform and updating the Docker image with dependency bumps.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)

22-23: TODO: Complete license file addition.

The license file COPY directive is commented out with a TODO marker. This should be addressed in a follow-up to properly include license compliance artifacts in the image.

Would you like me to help generate a script to identify the appropriate license file(s) and add them to this section?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54395f4 and 0a0838b.

📒 Files selected for processing (3)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (1 hunks)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (1 hunks)
  • images/universal/training/py312-cuda128-torch280/Dockerfile (3 hunks)
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile

[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.

(SC2261)

🔇 Additional comments (7)
images/universal/training/py312-cuda128-torch280/Dockerfile (5)

112-112: Verify SDK source branch selection is correct.

The SDK installation now targets the add-training-hub branch instead of the previous branch. Confirm this branch exists, contains the required add-training-hub install target, and has been tested against the other changes in this Dockerfile.

As per PR objectives mentioning alignment with PR #502, this branch selection should be verified to ensure compatibility with the updated dependencies and build environment.


115-115: Ninja installation refactored: verify build ordering is correct.

Ninja has been moved from an early pip install (line 115, commented out) to a consolidated build-helpers group at line 162. This changes when ninja becomes available relative to other packages. Confirm the build order and dependency resolution for the PyTorch extensions (causal-conv1d, mamba-ssm at lines 168-170) still work correctly with this new sequencing.

Test that the Docker image builds successfully with this new ordering, particularly validating that the deterministic 2-step build at lines 168-170 completes without ninja-related issues.

Also applies to: 162-162


120-157: Verify compatibility of new packages and version updates.

The Dockerfile introduces several new packages (training-hub, instructlab-training, rhai-innovation-mini-trainer, deprecated, typer) and updates multiple dependencies. While these additions align with the PR's goal of updating to match PR #502, confirm:

  1. All new packages are compatible with the PyTorch 2.8.0 + CUDA 12.8 stack
  2. instructlab-training and training-hub don't introduce conflicting transitive dependencies
  3. The change from bitsandbytes >=0.45.3 to exact 0.48.1 works correctly with the CUDA 12.8 environment

Build and test the Docker image to verify all package imports succeed and there are no runtime conflicts between the training-focused packages.


160-170: LGTM: Deterministic build pattern for PyTorch extensions.

The new deterministic 2-step installation pattern for causal-conv1d and mamba-ssm is a solid engineering practice. Explicit versioning, the --no-deps flag, and proper permission handling ensure reproducible builds for these C/CUDA-compiled extensions.


173-173: This review comment should be ignored—the flagged line was not modified in this PR.

The line at 173 in the Dockerfile contains code that was not changed by this pull request. Git diff confirms zero modifications to this file, meaning the entrypoint reference has not been altered as part of these changes. Review this comment against only the actual modifications introduced by the PR.

Likely an incorrect or invalid review comment.

.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (1)

1-631: LGTM: Well-structured Tekton PipelineRun for multi-platform PR builds.

This new PipelineRun follows Konflux best practices with:

  • Clear PipelinesAsCode trigger conditions (lines 11-12) for PR automation
  • Proper task sequencing and dependencies using runAfter
  • Matrix-driven multi-platform build execution
  • Comprehensive security scanning and quality gates (clair, SAST tools, RPM signature scan)
  • Modern OCI artifact-based data passing between tasks

The pipeline structure correctly aligns with the pipeline-docker-build-multi-platform-oci-ta pattern referenced in the description.

Verify that all task bundle references from quay.io/konflux-ci/tekton-catalog are accessible and use the correct versions before merging.

.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (1)

1-631: LGTM: Push pipeline correctly mirrors PR pipeline with appropriate production settings.

The push-triggered PipelineRun is well-configured with:

  • Conservative settings for production (cancel-in-progress: false)
  • Appropriate :latest image tagging for push-to-main scenarios
  • Correct PipelinesAsCode trigger for push events to main branch (lines 10-11)
  • Identical comprehensive task structure to the PR pipeline variant

Verify that both pull-request and push pipelines remain in sync across future updates (they appear to be template-duplicated, so consider tooling for single-source-of-truth if this pattern repeats).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0838b and a0bae6b.

📒 Files selected for processing (2)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (19 hunks)
  • .tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (19 hunks)
🔇 Additional comments (10)
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (5)

1-62: Systematic OCI-TA migration is well-structured.

The migration from single-image docker-build to multiplatform OCI-artifact-driven pipeline is comprehensive and internally consistent. Task bundles are updated to OCI-TA variants, artifact parameters (SOURCE_ARTIFACT, CACHI2_ARTIFACT) are correctly wired between dependent tasks, and the pipeline description reflects the architectural shift to trusted artifacts and the multi-platform-controller requirement.


630-630: Manual verification required: Service account does not exist in repository.

The service account build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 is referenced at line 630 but is not defined anywhere in this repository. No ServiceAccount resource, RBAC bindings (Role, RoleBinding, ClusterRole, or ClusterRoleBinding), or setup documentation for this service account exists in the codebase.

This service account must be created externally in the open-data-hub-tenant namespace with appropriate permissions for OCI-artifact tasks and multi-platform-controller interaction. Verify in your target Kubernetes cluster that this service account exists and has the required permissions before deployment.


30-36: Platform value override requires verification with your Konflux deployment.

Confirmed: lines 31–32 set build-platforms: [linux-extra-fast/amd64], which overrides the pipeline parameter default at lines 114–115 (linux/x86_64).

This override is valid Tekton syntax, but linux-extra-fast/amd64 does not appear elsewhere in the codebase. Verify that this platform name is registered in your multi-platform-controller configuration. If it's a custom platform name, document it; if it's a typo, correct it to match an available platform (e.g., linux/x86_64).


202-206: Verify task versions support the matrix parameter names used.

The matrix parameter names are correct but only for specific task versions. Confirm the taskRef versions in the pipeline match these requirements:

  • clair-scan: version 0.3+ required for image-platform parameter (versions 0.1-0.2 lack this)
  • clamav-scan: version 0.3+ required for image-arch parameter (versions 0.1-0.2 lack this)
  • ecosystem-cert-preflight-checks: version 0.2+ required for platform parameter (version 0.1 lacks this)
  • build-images: verify PLATFORM parameter name in referenced task version

If the taskRef versions don't match the above, the matrix will silently fail to pass platform information to the tasks.


147-175: All OCI-TA bundle references are syntactically correct and properly formatted.

Verification confirmed that all 17 bundle references, including the three OCI-TA variants mentioned (git-clone-oci-ta:0.1, prefetch-dependencies-oci-ta:0.2, buildah-remote-oci-ta:0.6), have valid SHA256 digest formats with no typos or malformed hashes. Registry accessibility will be verified automatically when the Tekton pipeline attempts to resolve these bundles at runtime, at which point any digest mismatches or unavailable images will surface with clear error messages.

.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (5)

27-29: Verify build-platforms default and parameter consistency.

The build-platforms parameter is defined with two different defaults: linux-extra-fast/amd64 (line 28) and linux/x86_64 (line 108). While the spec-level value takes precedence, the inconsistency is confusing. Additionally, verify that linux-extra-fast/amd64 is a valid platform identifier supported by your multi-platform-controller configuration.

Also applies to: 107-112


270-279: Clarify build-image-index task selection.

Most tasks have been transitioned to -oci-ta variants (git-clone-oci-ta, buildah-remote-oci-ta, push-dockerfile-oci-ta, etc.), but build-image-index remains on the non-oci-ta version (line 273). Verify this is intentional or if it should also be build-image-index-oci-ta to align with the OCI-artifact-driven pipeline design.


336-362: Clarify intent of matrix parameters in scanning tasks.

The clair-scan (lines 336-340), ecosystem-cert-preflight-checks (lines 363-367), and clamav-scan (lines 414-418) tasks are now matrix-driven with platform-specific parameter names (image-platform, platform, image-arch), but they all consume the same IMAGE_URL and IMAGE_DIGEST from build-image-index. Clarify whether the matrix is intended for:

  • Platform-specific validation: Each task should receive a platform-specific image reference (e.g., from build-images.results indexed by platform).
  • Parallel execution: The platform parameters are cosmetic and all matrix instances validate the same final image.

If the latter, the matrix parameters may be unnecessary or misleading.


199-252: I'll search for the Tekton task definitions to verify the result names.

Based on web search results, I found that standard Tekton tasks and documentation focus on result naming conventions, but I could not locate definitive documentation for the specific buildah-remote-oci-ta task used in your YAML. Standard buildah tasks produce an IMAGE_DIGEST result, not IMAGE_REF. However, the Konflux-specific buildah-remote-oci-ta variant may use different result names.

Verify the exact result names produced by buildah-remote-oci-ta and git-clone-oci-ta tasks.

The review comment flags a critical dependency: the build-image-index task (line 265) expects $(tasks.build-images.results.IMAGE_REF[*]) from a matrix task, but I cannot confirm that buildah-remote-oci-ta produces a result named IMAGE_REF rather than IMAGE_DIGEST or another identifier. Additionally, prefetch-dependencies-oci-ta (lines 177–182) must produce SOURCE_ARTIFACT as consumed by downstream tasks. Access the task bundle definitions to confirm all result names match what the pipeline expects.


599-615: Verify rpms-signature-scan task compatibility in your Konflux pipeline.

The task bundle reference has legitimately migrated from the generic tekton-catalog to Konflux-CI's own container images in quay.io/konflux-ci/konflux-vanguard. The registry is reachable (HTTP 401 response confirms valid path), and the bundle is pinned to a specific digest for reproducibility. However, you should manually verify that this task version (0.2) functions correctly with your multi-platform pipeline configuration and that your CI environment has appropriate access to the Konflux-vanguard registry.

optional: true
taskRunTemplate:
serviceAccountName: build-pipeline-universal-image-py312-cuda128-torch280
serviceAccountName: build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify service account provisioning.

The serviceAccountName (line 627) is set to build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9. Ensure this service account is provisioned in the open-data-hub-tenant namespace (line 18) before this PipelineRun executes. The workspaces (lines 621-625) are marked optional, confirming they are not required for basic execution.

Also applies to: 621-625


🏁 Script executed:

#!/bin/bash
# Description: Verify service account exists in target namespace

kubectl get serviceaccount build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 \
  -n open-data-hub-tenant 2>/dev/null && echo "Service account found" || echo "Service account NOT found"

Length of output: 25


Service account does not exist—PipelineRun will fail.

The service account build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 referenced at line 627 is not provisioned in the open-data-hub-tenant namespace. Either:

  1. Provision this service account in the namespace before executing this PipelineRun, or
  2. Change the reference to use an existing service account that has the necessary permissions for this pipeline.

Without this fix, the PipelineRun will fail with authentication errors.

@MStokluska
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant