Skip to content

Commit 8cfb923

Browse files
committed
Merge remote-tracking branch 'upstream/main' into rhoai-3.2
2 parents 11069d6 + 4ee6fc5 commit 8cfb923

File tree

22 files changed

+10968
-139
lines changed

22 files changed

+10968
-139
lines changed

images/universal/training/th03-cuda128-torch280-py312/Dockerfile

Lines changed: 225 additions & 126 deletions
Large diffs are not rendered by default.
Lines changed: 205 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,207 @@
1-
# Universal Training Container Image
1+
# Universal ML Workbench Image
22

3-
CUDA enabled container image for Training Workbench and Training Runtime in OpenShift AI.
3+
A FIPS-friendly container image for machine learning workloads, built on top of OpenDataHub's CUDA-enabled Jupyter workbench base image.
44

5-
It includes the following layers:
6-
* UBI 9
7-
* Python 3.12
8-
* CUDA 12.8
9-
* PyTorch 2.8.0
5+
## Image Overview
6+
7+
**Key Features:**
8+
- ✅ FIPS-friendly multi-stage build (no build tools in runtime)
9+
- ✅ Python 3.12 with PyTorch 2.8.0 + CUDA 12.8
10+
- ✅ GPU-accelerated ML: flash-attention, Mamba SSM, Transformers
11+
- ✅ Dependency management via `uv` and `pylock.toml`
12+
- ✅ RDMA/InfiniBand support for distributed training
13+
- ✅ Reproducible builds with locked dependencies
14+
15+
**Installed ML Packages:**
16+
- PyTorch 2.8.0 (CUDA 12.8)
17+
- Transformers 4.57.1
18+
- Accelerate 1.10.0
19+
- flash-attn 2.8.3
20+
- mamba-ssm 2.2.6.post3
21+
- causal-conv1d 1.5.3.post1
22+
- vLLM, DeepSpeed, and more (287 total packages)
23+
24+
---
25+
26+
## Building the Image
27+
28+
### Prerequisites
29+
- Podman or Docker
30+
- 20GB+ disk space
31+
32+
### Build Command
33+
34+
```bash
35+
podman build -t universal-ml:latest .
36+
```
37+
38+
**Build arguments** (optional - image ref example):
39+
```bash
40+
podman build \
41+
--build-arg BASE_IMAGE=quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
42+
--build-arg PYTHON_VERSION=3.12 \
43+
--build-arg CUDA_VERSION=12.8 \
44+
-t universal-ml:latest .
45+
```
46+
---
47+
48+
## Updating Dependencies
49+
50+
This image uses `uv` for Python dependency management with `pyproject.toml` and `pylock.toml` for reproducible builds.
51+
52+
### Understanding the Dependency Layers
53+
54+
The image has **two layers** of dependencies:
55+
56+
1. **Base Image Dependencies** (from `workbench-images`)
57+
- JupyterLab, notebook extensions, authentication
58+
- Pre-installed in the base image via its own `pylock.toml`
59+
60+
2. **ML Dependencies** (this image)
61+
- PyTorch, Transformers, flash-attn, Mamba
62+
- Defined in `pyproject.toml` and locked in `pylock.toml`
63+
64+
**Important:** `pyproject.toml` includes BOTH base and ML dependencies with exact versions to prevent `uv pip sync` from removing base packages.
65+
66+
### Option 1: Add/Update ML Packages (Recommended)
67+
68+
If you want to add or update ML packages **without touching base packages**:
69+
70+
1. **Edit `pyproject.toml`** - Add/modify packages in the `dependencies` section:
71+
```toml
72+
dependencies = [
73+
# ... existing packages ...
74+
"new-package==1.2.3", # Add new package
75+
]
76+
```
77+
78+
2. **Regenerate the lockfile** (requires `uv` installed locally):
79+
```bash
80+
# Install uv if you don't have it
81+
curl -LsSf https://astral.sh/uv/install.sh | sh
82+
83+
# Regenerate pylock.toml
84+
uv pip compile pyproject.toml -o pylock.toml
85+
```
86+
87+
3. **Rebuild the image**:
88+
```bash
89+
podman build -t universal-ml:latest .
90+
```
91+
92+
### Option 2: Update Base Image
93+
94+
If the base image (`workbench-images`) has been updated with new versions:
95+
96+
1. **Extract base image dependencies:**
97+
```bash
98+
# Pull the latest base image (example)
99+
podman pull quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903
100+
101+
# Get bases exact installed versions
102+
podman run --rm --entrypoint pip \
103+
quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
104+
list --format freeze > base-freeze.txt
105+
106+
# Get specific packages versions
107+
podman run --rm --entrypoint sh \
108+
quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
109+
-lc 'python -m pip show jupyterlab jupyterlab-server jupyter-server jupyterlab-git jupyterlab-pygments | egrep "Name|Version"'
110+
```
111+
112+
2. **Update `pyproject.toml`** with new base package versions from `base-pylock.toml`
113+
114+
3. **Regenerate lockfile** and **rebuild** (same as Option 1)
115+
116+
### Option 3: Full Dependency Refresh
117+
118+
To update ALL packages to their latest compatible versions:
119+
120+
1. **Backup current versions:**
121+
```bash
122+
cp pyproject.toml pyproject.toml.backup
123+
cp pylock.toml pylock.toml.backup
124+
```
125+
126+
2. **Remove version pins** in `pyproject.toml` (change `==` to `>=` or remove version)
127+
128+
3. **Regenerate lockfile:**
129+
```bash
130+
uv pip compile pyproject.toml -o pylock.toml
131+
```
132+
133+
4. **Test thoroughly** - major version bumps can break compatibility
134+
135+
5. **Rebuild the image**
136+
137+
---
138+
139+
## Special Packages: flash-attn, mamba-ssm, causal-conv1d
140+
141+
These packages require special handling because they:
142+
- Have CUDA build-time dependencies
143+
- Cannot be resolved by `uv pip compile` (CUDA check fails)
144+
- Need `--no-build-isolation` flag during installation
145+
146+
**How they're handled:**
147+
1. Listed in `requirements-special.txt` (for version tracking)
148+
2. Excluded from `pyproject.toml` dependencies
149+
3. Installed separately in Dockerfile with `--no-build-isolation`
150+
151+
**To update these packages:**
152+
1. Edit `requirements-special.txt`:
153+
```
154+
flash-attn==2.9.0
155+
causal-conv1d==1.6.0
156+
mamba-ssm==2.3.0
157+
```
158+
159+
2. Rebuild the image (no lockfile regeneration needed)
160+
161+
---
162+
163+
## File Manifest
164+
165+
| File | Purpose |
166+
|------|---------|
167+
| `Dockerfile` | Multi-stage build definition (FIPS-friendly) |
168+
| `pyproject.toml` | Python project metadata and dependencies |
169+
| `pylock.toml` | Locked dependency versions (generated by `uv`) |
170+
| `requirements-special.txt` | CUDA-dependent packages (flash-attn, mamba-ssm, etc.) |
171+
| `cuda.repo` | NVIDIA CUDA repository configuration |
172+
| `mellanox.repo` | Mellanox OFED repository configuration |
173+
| `entrypoint-universal.sh` | Container entrypoint script |
174+
| `LICENSE.md` | License information |
175+
| `README.md` | This file |
176+
177+
---
178+
179+
## Build Architecture
180+
181+
The Dockerfile uses a **5-stage multi-stage build** for FIPS-friendliness:
182+
183+
1. **`builder`** - Installs `uv` tool (isolated from runtime)
184+
2. **`base`** - Base image with Python environment
185+
3. **`system-deps`** - Installs CUDA and RDMA system packages
186+
4. **`python-deps`** - Installs Python packages with `uv`, then removes build tools
187+
5. **`final`** - Clean runtime image with only necessary artifacts
188+
189+
---
190+
191+
## Pushing the Image
192+
193+
```bash
194+
# Tag for your registry
195+
podman tag universal-ml:latest quay.io/your-org/universal-ml:latest
196+
197+
# Login to registry
198+
podman login quay.io
199+
200+
# Push
201+
podman push quay.io/your-org/universal-ml:latest
202+
```
203+
---
204+
205+
## License
206+
207+
See `LICENSE.md` for CUDA and component licenses.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[cuda-rhel9-x86_64]
2+
name=CUDA Repository for RHEL9 x86_64
3+
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64
4+
enabled=1
5+
gpgcheck=0
6+
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[mlnx_ofed_24.10-1.1.4.0_base]
2+
name=Mellanox OFED Repository 24.10-1.1.4.0
3+
baseurl=https://linux.mellanox.com/public/repo/mlnx_ofed/24.10-1.1.4.0/rhel9.5/x86_64
4+
enabled=1
5+
gpgcheck=0
6+

0 commit comments

Comments
 (0)