-
Notifications
You must be signed in to change notification settings - Fork 70
move konflux universal to multiplaform and update docker image #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
move konflux universal to multiplaform and update docker image #512
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughReplace single-image Docker-build Tekton PipelineRuns with multi-platform, OCI-artifact-driven pipelines (PR and push triggers), rename component/pipeline identifiers, swap many tasks to OCI-ta variants and matrixed execution, and update the training Dockerfile with dependency bumps and build tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant PR as PipelineRun (PR/Push)
participant Git as git-clone-oci-ta
participant Prefetch as prefetch-dependencies-oci-ta
participant BuildMatrix as build-images (matrix)
participant OCI as OCI storage / artifact registry
participant Index as build-image-index
participant Scans as scans (clair, SAST, clamav, etc.)
rect rgb(230,245,255)
PR->>Git: fetch source (git-url, revision)
Git-->>PR: SOURCE_ARTIFACT
PR->>Prefetch: prefetch deps (uses SOURCE_ARTIFACT)
Prefetch-->>OCI: push CACHI2_ARTIFACT
end
rect rgb(245,255,230)
PR->>BuildMatrix: start matrix builds (per PLATFORM)
BuildMatrix->>OCI: publish per-platform OCI artifacts
BuildMatrix-->>PR: IMAGE_REF(s), IMAGE_DIGEST(s)
end
rect rgb(255,245,230)
PR->>Scans: run parallel scans (consume IMAGE_REF / artifacts)
Scans-->>PR: scan results
PR->>Index: aggregate IMAGE_REF and artifacts
Index-->>OCI: push image index / manifest list
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
22-23: TODO: Complete license file addition.The license file COPY directive is commented out with a TODO marker. This should be addressed in a follow-up to properly include license compliance artifacts in the image.
Would you like me to help generate a script to identify the appropriate license file(s) and add them to this section?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml(1 hunks).tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (7)
images/universal/training/py312-cuda128-torch280/Dockerfile (5)
112-112: Verify SDK source branch selection is correct.The SDK installation now targets the
add-training-hubbranch instead of the previous branch. Confirm this branch exists, contains the requiredadd-training-hubinstall target, and has been tested against the other changes in this Dockerfile.As per PR objectives mentioning alignment with PR #502, this branch selection should be verified to ensure compatibility with the updated dependencies and build environment.
115-115: Ninja installation refactored: verify build ordering is correct.Ninja has been moved from an early pip install (line 115, commented out) to a consolidated build-helpers group at line 162. This changes when ninja becomes available relative to other packages. Confirm the build order and dependency resolution for the PyTorch extensions (causal-conv1d, mamba-ssm at lines 168-170) still work correctly with this new sequencing.
Test that the Docker image builds successfully with this new ordering, particularly validating that the deterministic 2-step build at lines 168-170 completes without ninja-related issues.
Also applies to: 162-162
120-157: Verify compatibility of new packages and version updates.The Dockerfile introduces several new packages (training-hub, instructlab-training, rhai-innovation-mini-trainer, deprecated, typer) and updates multiple dependencies. While these additions align with the PR's goal of updating to match PR #502, confirm:
- All new packages are compatible with the PyTorch 2.8.0 + CUDA 12.8 stack
- instructlab-training and training-hub don't introduce conflicting transitive dependencies
- The change from bitsandbytes
>=0.45.3to exact0.48.1works correctly with the CUDA 12.8 environmentBuild and test the Docker image to verify all package imports succeed and there are no runtime conflicts between the training-focused packages.
160-170: LGTM: Deterministic build pattern for PyTorch extensions.The new deterministic 2-step installation pattern for causal-conv1d and mamba-ssm is a solid engineering practice. Explicit versioning, the
--no-depsflag, and proper permission handling ensure reproducible builds for these C/CUDA-compiled extensions.
173-173: This review comment should be ignored—the flagged line was not modified in this PR.The line at 173 in the Dockerfile contains code that was not changed by this pull request. Git diff confirms zero modifications to this file, meaning the entrypoint reference has not been altered as part of these changes. Review this comment against only the actual modifications introduced by the PR.
Likely an incorrect or invalid review comment.
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (1)
1-631: LGTM: Well-structured Tekton PipelineRun for multi-platform PR builds.This new PipelineRun follows Konflux best practices with:
- Clear PipelinesAsCode trigger conditions (lines 11-12) for PR automation
- Proper task sequencing and dependencies using
runAfter- Matrix-driven multi-platform build execution
- Comprehensive security scanning and quality gates (clair, SAST tools, RPM signature scan)
- Modern OCI artifact-based data passing between tasks
The pipeline structure correctly aligns with the
pipeline-docker-build-multi-platform-oci-tapattern referenced in the description.Verify that all task bundle references from
quay.io/konflux-ci/tekton-catalogare accessible and use the correct versions before merging..tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (1)
1-631: LGTM: Push pipeline correctly mirrors PR pipeline with appropriate production settings.The push-triggered PipelineRun is well-configured with:
- Conservative settings for production (cancel-in-progress: false)
- Appropriate
:latestimage tagging for push-to-main scenarios- Correct PipelinesAsCode trigger for push events to main branch (lines 10-11)
- Identical comprehensive task structure to the PR pipeline variant
Verify that both pull-request and push pipelines remain in sync across future updates (they appear to be template-duplicated, so consider tooling for single-source-of-truth if this pattern repeats).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml(19 hunks).tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml(19 hunks)
🔇 Additional comments (10)
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-pull-request.yaml (5)
1-62: Systematic OCI-TA migration is well-structured.The migration from single-image docker-build to multiplatform OCI-artifact-driven pipeline is comprehensive and internally consistent. Task bundles are updated to OCI-TA variants, artifact parameters (
SOURCE_ARTIFACT,CACHI2_ARTIFACT) are correctly wired between dependent tasks, and the pipeline description reflects the architectural shift to trusted artifacts and the multi-platform-controller requirement.
630-630: Manual verification required: Service account does not exist in repository.The service account
build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9is referenced at line 630 but is not defined anywhere in this repository. No ServiceAccount resource, RBAC bindings (Role, RoleBinding, ClusterRole, or ClusterRoleBinding), or setup documentation for this service account exists in the codebase.This service account must be created externally in the
open-data-hub-tenantnamespace with appropriate permissions for OCI-artifact tasks and multi-platform-controller interaction. Verify in your target Kubernetes cluster that this service account exists and has the required permissions before deployment.
30-36: Platform value override requires verification with your Konflux deployment.Confirmed: lines 31–32 set
build-platforms: [linux-extra-fast/amd64], which overrides the pipeline parameter default at lines 114–115 (linux/x86_64).This override is valid Tekton syntax, but
linux-extra-fast/amd64does not appear elsewhere in the codebase. Verify that this platform name is registered in your multi-platform-controller configuration. If it's a custom platform name, document it; if it's a typo, correct it to match an available platform (e.g.,linux/x86_64).
202-206: Verify task versions support the matrix parameter names used.The matrix parameter names are correct but only for specific task versions. Confirm the taskRef versions in the pipeline match these requirements:
clair-scan: version 0.3+ required forimage-platformparameter (versions 0.1-0.2 lack this)clamav-scan: version 0.3+ required forimage-archparameter (versions 0.1-0.2 lack this)ecosystem-cert-preflight-checks: version 0.2+ required forplatformparameter (version 0.1 lacks this)build-images: verifyPLATFORMparameter name in referenced task versionIf the taskRef versions don't match the above, the matrix will silently fail to pass platform information to the tasks.
147-175: All OCI-TA bundle references are syntactically correct and properly formatted.Verification confirmed that all 17 bundle references, including the three OCI-TA variants mentioned (git-clone-oci-ta:0.1, prefetch-dependencies-oci-ta:0.2, buildah-remote-oci-ta:0.6), have valid SHA256 digest formats with no typos or malformed hashes. Registry accessibility will be verified automatically when the Tekton pipeline attempts to resolve these bundles at runtime, at which point any digest mismatches or unavailable images will surface with clear error messages.
.tekton/odh-training-th03-cuda128-torch28-py312-rhel9-push.yaml (5)
27-29: Verify build-platforms default and parameter consistency.The
build-platformsparameter is defined with two different defaults:linux-extra-fast/amd64(line 28) andlinux/x86_64(line 108). While the spec-level value takes precedence, the inconsistency is confusing. Additionally, verify thatlinux-extra-fast/amd64is a valid platform identifier supported by your multi-platform-controller configuration.Also applies to: 107-112
270-279: Clarify build-image-index task selection.Most tasks have been transitioned to
-oci-tavariants (git-clone-oci-ta, buildah-remote-oci-ta, push-dockerfile-oci-ta, etc.), butbuild-image-indexremains on the non-oci-ta version (line 273). Verify this is intentional or if it should also bebuild-image-index-oci-tato align with the OCI-artifact-driven pipeline design.
336-362: Clarify intent of matrix parameters in scanning tasks.The
clair-scan(lines 336-340),ecosystem-cert-preflight-checks(lines 363-367), andclamav-scan(lines 414-418) tasks are now matrix-driven with platform-specific parameter names (image-platform,platform,image-arch), but they all consume the sameIMAGE_URLandIMAGE_DIGESTfrombuild-image-index. Clarify whether the matrix is intended for:
- Platform-specific validation: Each task should receive a platform-specific image reference (e.g., from
build-images.resultsindexed by platform).- Parallel execution: The platform parameters are cosmetic and all matrix instances validate the same final image.
If the latter, the matrix parameters may be unnecessary or misleading.
199-252: I'll search for the Tekton task definitions to verify the result names.Based on web search results, I found that standard Tekton tasks and documentation focus on result naming conventions, but I could not locate definitive documentation for the specific
buildah-remote-oci-tatask used in your YAML. Standard buildah tasks produce anIMAGE_DIGESTresult, notIMAGE_REF. However, the Konflux-specificbuildah-remote-oci-tavariant may use different result names.Verify the exact result names produced by buildah-remote-oci-ta and git-clone-oci-ta tasks.
The review comment flags a critical dependency: the
build-image-indextask (line 265) expects$(tasks.build-images.results.IMAGE_REF[*])from a matrix task, but I cannot confirm thatbuildah-remote-oci-taproduces a result namedIMAGE_REFrather thanIMAGE_DIGESTor another identifier. Additionally,prefetch-dependencies-oci-ta(lines 177–182) must produceSOURCE_ARTIFACTas consumed by downstream tasks. Access the task bundle definitions to confirm all result names match what the pipeline expects.
599-615: Verify rpms-signature-scan task compatibility in your Konflux pipeline.The task bundle reference has legitimately migrated from the generic
tekton-catalogto Konflux-CI's own container images inquay.io/konflux-ci/konflux-vanguard. The registry is reachable (HTTP 401 response confirms valid path), and the bundle is pinned to a specific digest for reproducibility. However, you should manually verify that this task version (0.2) functions correctly with your multi-platform pipeline configuration and that your CI environment has appropriate access to the Konflux-vanguard registry.
| optional: true | ||
| taskRunTemplate: | ||
| serviceAccountName: build-pipeline-universal-image-py312-cuda128-torch280 | ||
| serviceAccountName: build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify service account provisioning.
The serviceAccountName (line 627) is set to build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9. Ensure this service account is provisioned in the open-data-hub-tenant namespace (line 18) before this PipelineRun executes. The workspaces (lines 621-625) are marked optional, confirming they are not required for basic execution.
Also applies to: 621-625
🏁 Script executed:
#!/bin/bash
# Description: Verify service account exists in target namespace
kubectl get serviceaccount build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 \
-n open-data-hub-tenant 2>/dev/null && echo "Service account found" || echo "Service account NOT found"Length of output: 25
Service account does not exist—PipelineRun will fail.
The service account build-pipeline-odh-training-th03-cuda128-torch28-py312-rhel9 referenced at line 627 is not provisioned in the open-data-hub-tenant namespace. Either:
- Provision this service account in the namespace before executing this PipelineRun, or
- Change the reference to use an existing service account that has the necessary permissions for this pipeline.
Without this fix, the PipelineRun will fail with authentication errors.
|
/retest |
Description
Konflux universal image must leverage a higher spec platform, which is only available on multiplaform pipelines.
Also, dockerfile update to match whats in #502 to properly test it.
Summary by CodeRabbit
New Features
Chores
Dependencies