Skip to content

Commit f67684c

Browse files
add instructlab and th
1 parent 0c9d6f4 commit f67684c

File tree

6 files changed

+292
-46
lines changed

6 files changed

+292
-46
lines changed

.tekton/odh-training-rocm64-torch29-py312-rhel9-pull-request.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ metadata:
1919
namespace: open-data-hub-tenant
2020
spec:
2121
timeouts:
22-
pipeline: 24h
23-
tasks: 20h
22+
pipeline: 40h
23+
tasks: 30h
2424
params:
2525
- name: git-url
2626
value: '{{source_url}}'
@@ -39,8 +39,8 @@ spec:
3939
value: images/universal/training/rocm64-torch290-py312
4040
pipelineSpec:
4141
timeouts:
42-
pipeline: 24h
43-
tasks: 20h
42+
pipeline: 40h
43+
tasks: 30h
4444
description: |
4545
This pipeline is ideal for building multi-arch container images from a Containerfile while maintaining trust after pipeline customization.
4646
_Uses `buildah` to create a multi-platform container image leveraging [trusted artifacts](https://konflux-ci.dev/architecture/ADR/0036-trusted-artifacts.html). It also optionally creates a source image and runs some build-time tests. This pipeline requires that the [multi platform controller](https://github.com/konflux-ci/multi-platform-controller) is deployed and configured on your Konflux instance. Information is shared between tasks using OCI artifacts instead of PVCs. EC will pass the [`trusted_task.trusted`](https://conforma.dev/docs/policy/packages/release_trusted_task.html#trusted_task__trusted) policy as long as all data used to build the artifact is generated from trusted tasks.
@@ -216,7 +216,7 @@ spec:
216216
value:
217217
- $(params.build-platforms)
218218
name: build-images
219-
timeout: 20h
219+
timeout: 30h
220220
params:
221221
- name: IMAGE
222222
value: $(params.output-image)

.tekton/odh-training-rocm64-torch29-py312-rhel9-push.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ metadata:
1818
namespace: open-data-hub-tenant
1919
spec:
2020
timeouts:
21-
pipeline: 24h
22-
tasks: 20h
21+
pipeline: 40h
22+
tasks: 30h
2323
params:
2424
- name: git-url
2525
value: '{{source_url}}'
@@ -36,8 +36,8 @@ spec:
3636
value: images/universal/training/rocm64-torch290-py312
3737
pipelineSpec:
3838
timeouts:
39-
pipeline: 24h
40-
tasks: 20h
39+
pipeline: 40h
40+
tasks: 30h
4141
description: |
4242
This pipeline is ideal for building multi-arch container images from a Containerfile while maintaining trust after pipeline customization.
4343
_Uses `buildah` to create a multi-platform container image leveraging [trusted artifacts](https://konflux-ci.dev/architecture/ADR/0036-trusted-artifacts.html). It also optionally creates a source image and runs some build-time tests. This pipeline requires that the [multi platform controller](https://github.com/konflux-ci/multi-platform-controller) is deployed and configured on your Konflux instance. Information is shared between tasks using OCI artifacts instead of PVCs. EC will pass the [`trusted_task.trusted`](https://conforma.dev/docs/policy/packages/release_trusted_task.html#trusted_task__trusted) policy as long as all data used to build the artifact is generated from trusted tasks.
@@ -213,7 +213,7 @@ spec:
213213
value:
214214
- $(params.build-platforms)
215215
name: build-images
216-
timeout: 20h
216+
timeout: 30h
217217
params:
218218
- name: IMAGE
219219
value: $(params.output-image)

images/universal/training/rocm64-torch290-py312/Dockerfile

Lines changed: 13 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ COPY mellanox.repo rocm.repo /etc/yum.repos.d/
5757

5858
# Install ROCm development tools
5959
# Using individual packages instead of metapackages to avoid python3-wheel dependency issue
60-
# hipcc is the HIP compiler needed for flash-attention build
60+
# hipcc is the HIP compiler (may be needed for building ROCm packages)
6161
# rocm-device-libs provides the GPU device library required by clang for ROCm compilation
6262
RUN dnf install -y --setopt=install_weak_deps=False \
6363
hipcc \
@@ -131,40 +131,25 @@ WORKDIR /opt/app-root/src
131131
# This syncs the environment to match exactly what's in the lockfile
132132
# pylock.toml was compiled with --find-links=https://download.pytorch.org/whl/rocm6.4
133133
# so torch comes from ROCm index
134-
ENV UV_NO_CACHE=1
135-
RUN uv pip sync --python-platform=linux --python-version=3.12 /tmp/deps/pylock.toml
134+
#
135+
# flash-attn requires torch at build time and GPU architecture info, so we:
136+
# 1. First install torch from ROCm index
137+
# 2. Set GPU_ARCHS so flash-attn knows what to build for (no GPU needed at build time)
138+
# 3. Then sync all dependencies with --no-build-isolation
139+
ENV UV_NO_CACHE=1 \
140+
GPU_ARCHS="gfx90a;gfx942" \
141+
PYTORCH_ROCM_ARCH="gfx90a;gfx942"
142+
RUN uv pip install --index-strategy=unsafe-best-match --index-url=https://download.pytorch.org/whl/rocm6.4 --extra-index-url=https://pypi.org/simple "torch==2.9.0+rocm6.4"
143+
RUN uv pip sync --python-platform=linux --python-version=3.12 --no-build-isolation /tmp/deps/pylock.toml
136144
ENV UV_NO_CACHE=
137145

138146
# Install kubeflow-sdk from Git (not in pylock.toml or requirements-special.txt)
139147
# TODO: use aipcc index
140148
RUN pip install --retries 5 --timeout 300 --no-cache-dir \
141149
"git+https://github.com/opendatahub-io/kubeflow-sdk@main"
142150

143-
# Install Flash Attention from ROCm fork with Triton AMD backend
144-
# This is faster to build and optimized for AMD GPUs
145-
USER 0
146-
147-
# Set build parallelism environment variables
148-
# MAX_JOBS: Controls PyTorch extension build parallelism
149-
# CMAKE_BUILD_PARALLEL_LEVEL: Controls CMake parallelism
150-
# GPU_ARCHS: Target GPU architectures (gfx942=MI300, gfx90a=MI200/MI250)
151-
ENV GPU_ARCHS="gfx90a;gfx942" \
152-
MAX_JOBS=12 \
153-
CMAKE_BUILD_PARALLEL_LEVEL=12
154-
155-
# Install Triton and ninja (required for ROCm flash-attention build)
156-
RUN /opt/app-root/bin/pip install --no-cache-dir triton==3.2.0 ninja
157-
158-
# Enable Triton AMD backend for flash-attention
159-
ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
160-
161-
RUN cd /tmp \
162-
&& git clone https://github.com/ROCm/flash-attention.git \
163-
&& cd flash-attention \
164-
&& git checkout main_perf \
165-
&& /opt/app-root/bin/python setup.py install \
166-
&& cd / && rm -rf /tmp/flash-attention
167-
151+
# flash-attn is included as a transitive dependency from instructlab-training[rocm]
152+
# in pylock.toml (version 2.8.3), so no separate install needed
168153

169154
# Fix permissions for OpenShift
170155
ARG PYTHON_VERSION

0 commit comments

Comments
 (0)