|
1 | | -# Universal Training Container Image |
| 1 | +# Universal ML Workbench Image |
2 | 2 |
|
3 | | -CUDA enabled container image for Training Workbench and Training Runtime in OpenShift AI. |
| 3 | +A FIPS-friendly container image for machine learning workloads, built on top of OpenDataHub's CUDA-enabled Jupyter workbench base image. |
4 | 4 |
|
5 | | -It includes the following layers: |
6 | | -* UBI 9 |
7 | | -* Python 3.12 |
8 | | -* CUDA 12.8 |
9 | | -* PyTorch 2.8.0 |
| 5 | +## Image Overview |
| 6 | + |
| 7 | +**Key Features:** |
| 8 | +- ✅ FIPS-friendly multi-stage build (no build tools in runtime) |
| 9 | +- ✅ Python 3.12 with PyTorch 2.8.0 + CUDA 12.8 |
| 10 | +- ✅ GPU-accelerated ML: flash-attention, Mamba SSM, Transformers |
| 11 | +- ✅ Dependency management via `uv` and `pylock.toml` |
| 12 | +- ✅ RDMA/InfiniBand support for distributed training |
| 13 | +- ✅ Reproducible builds with locked dependencies |
| 14 | + |
| 15 | +**Installed ML Packages:** |
| 16 | +- PyTorch 2.8.0 (CUDA 12.8) |
| 17 | +- Transformers 4.57.1 |
| 18 | +- Accelerate 1.10.0 |
| 19 | +- flash-attn 2.8.3 |
| 20 | +- mamba-ssm 2.2.6.post3 |
| 21 | +- causal-conv1d 1.5.3.post1 |
| 22 | +- vLLM, DeepSpeed, and more (287 total packages) |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Building the Image |
| 27 | + |
| 28 | +### Prerequisites |
| 29 | +- Podman or Docker |
| 30 | +- 20GB+ disk space |
| 31 | + |
| 32 | +### Build Command |
| 33 | + |
| 34 | +```bash |
| 35 | +podman build -t universal-ml:latest . |
| 36 | +``` |
| 37 | + |
| 38 | +**Build arguments** (optional - image ref example): |
| 39 | +```bash |
| 40 | +podman build \ |
| 41 | + --build-arg BASE_IMAGE=quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \ |
| 42 | + --build-arg PYTHON_VERSION=3.12 \ |
| 43 | + --build-arg CUDA_VERSION=12.8 \ |
| 44 | + -t universal-ml:latest . |
| 45 | +``` |
| 46 | +--- |
| 47 | + |
| 48 | +## Updating Dependencies |
| 49 | + |
| 50 | +This image uses `uv` for Python dependency management with `pyproject.toml` and `pylock.toml` for reproducible builds. |
| 51 | + |
| 52 | +### Understanding the Dependency Layers |
| 53 | + |
| 54 | +The image has **two layers** of dependencies: |
| 55 | + |
| 56 | +1. **Base Image Dependencies** (from `workbench-images`) |
| 57 | + - JupyterLab, notebook extensions, authentication |
| 58 | + - Pre-installed in the base image via its own `pylock.toml` |
| 59 | + |
| 60 | +2. **ML Dependencies** (this image) |
| 61 | + - PyTorch, Transformers, flash-attn, Mamba |
| 62 | + - Defined in `pyproject.toml` and locked in `pylock.toml` |
| 63 | + |
| 64 | +**Important:** `pyproject.toml` includes BOTH base and ML dependencies with exact versions to prevent `uv pip sync` from removing base packages. |
| 65 | + |
| 66 | +### Option 1: Add/Update ML Packages (Recommended) |
| 67 | + |
| 68 | +If you want to add or update ML packages **without touching base packages**: |
| 69 | + |
| 70 | +1. **Edit `pyproject.toml`** - Add/modify packages in the `dependencies` section: |
| 71 | + ```toml |
| 72 | + dependencies = [ |
| 73 | + # ... existing packages ... |
| 74 | + "new-package==1.2.3", # Add new package |
| 75 | + ] |
| 76 | + ``` |
| 77 | + |
| 78 | +2. **Regenerate the lockfile** (requires `uv` installed locally): |
| 79 | + ```bash |
| 80 | + # Install uv if you don't have it |
| 81 | + curl -LsSf https://astral.sh/uv/install.sh | sh |
| 82 | + |
| 83 | + # Regenerate pylock.toml |
| 84 | + uv pip compile pyproject.toml -o pylock.toml |
| 85 | + ``` |
| 86 | + |
| 87 | +3. **Rebuild the image**: |
| 88 | + ```bash |
| 89 | + podman build -t universal-ml:latest . |
| 90 | + ``` |
| 91 | + |
| 92 | +### Option 2: Update Base Image |
| 93 | + |
| 94 | +If the base image (`workbench-images`) has been updated with new versions: |
| 95 | + |
| 96 | +1. **Extract base image dependencies:** |
| 97 | + ```bash |
| 98 | + # Pull the latest base image (example) |
| 99 | + podman pull quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 |
| 100 | + |
| 101 | + # Get bases exact installed versions |
| 102 | + podman run --rm --entrypoint pip \ |
| 103 | + quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \ |
| 104 | + list --format freeze > base-freeze.txt |
| 105 | + |
| 106 | + # Get specific packages versions |
| 107 | + podman run --rm --entrypoint sh \ |
| 108 | + quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \ |
| 109 | + -lc 'python -m pip show jupyterlab jupyterlab-server jupyter-server jupyterlab-git jupyterlab-pygments | egrep "Name|Version"' |
| 110 | + ``` |
| 111 | + |
| 112 | +2. **Update `pyproject.toml`** with new base package versions from `base-pylock.toml` |
| 113 | + |
| 114 | +3. **Regenerate lockfile** and **rebuild** (same as Option 1) |
| 115 | + |
| 116 | +### Option 3: Full Dependency Refresh |
| 117 | + |
| 118 | +To update ALL packages to their latest compatible versions: |
| 119 | + |
| 120 | +1. **Backup current versions:** |
| 121 | + ```bash |
| 122 | + cp pyproject.toml pyproject.toml.backup |
| 123 | + cp pylock.toml pylock.toml.backup |
| 124 | + ``` |
| 125 | + |
| 126 | +2. **Remove version pins** in `pyproject.toml` (change `==` to `>=` or remove version) |
| 127 | + |
| 128 | +3. **Regenerate lockfile:** |
| 129 | + ```bash |
| 130 | + uv pip compile pyproject.toml -o pylock.toml |
| 131 | + ``` |
| 132 | + |
| 133 | +4. **Test thoroughly** - major version bumps can break compatibility |
| 134 | + |
| 135 | +5. **Rebuild the image** |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## Special Packages: flash-attn, mamba-ssm, causal-conv1d |
| 140 | + |
| 141 | +These packages require special handling because they: |
| 142 | +- Have CUDA build-time dependencies |
| 143 | +- Cannot be resolved by `uv pip compile` (CUDA check fails) |
| 144 | +- Need `--no-build-isolation` flag during installation |
| 145 | + |
| 146 | +**How they're handled:** |
| 147 | +1. Listed in `requirements-special.txt` (for version tracking) |
| 148 | +2. Excluded from `pyproject.toml` dependencies |
| 149 | +3. Installed separately in Dockerfile with `--no-build-isolation` |
| 150 | + |
| 151 | +**To update these packages:** |
| 152 | +1. Edit `requirements-special.txt`: |
| 153 | + ``` |
| 154 | + flash-attn==2.9.0 |
| 155 | + causal-conv1d==1.6.0 |
| 156 | + mamba-ssm==2.3.0 |
| 157 | + ``` |
| 158 | + |
| 159 | +2. Rebuild the image (no lockfile regeneration needed) |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +## File Manifest |
| 164 | + |
| 165 | +| File | Purpose | |
| 166 | +|------|---------| |
| 167 | +| `Dockerfile` | Multi-stage build definition (FIPS-friendly) | |
| 168 | +| `pyproject.toml` | Python project metadata and dependencies | |
| 169 | +| `pylock.toml` | Locked dependency versions (generated by `uv`) | |
| 170 | +| `requirements-special.txt` | CUDA-dependent packages (flash-attn, mamba-ssm, etc.) | |
| 171 | +| `cuda.repo` | NVIDIA CUDA repository configuration | |
| 172 | +| `mellanox.repo` | Mellanox OFED repository configuration | |
| 173 | +| `entrypoint-universal.sh` | Container entrypoint script | |
| 174 | +| `LICENSE.md` | License information | |
| 175 | +| `README.md` | This file | |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## Build Architecture |
| 180 | + |
| 181 | +The Dockerfile uses a **5-stage multi-stage build** for FIPS-friendliness: |
| 182 | + |
| 183 | +1. **`builder`** - Installs `uv` tool (isolated from runtime) |
| 184 | +2. **`base`** - Base image with Python environment |
| 185 | +3. **`system-deps`** - Installs CUDA and RDMA system packages |
| 186 | +4. **`python-deps`** - Installs Python packages with `uv`, then removes build tools |
| 187 | +5. **`final`** - Clean runtime image with only necessary artifacts |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +## Pushing the Image |
| 192 | + |
| 193 | +```bash |
| 194 | +# Tag for your registry |
| 195 | +podman tag universal-ml:latest quay.io/your-org/universal-ml:latest |
| 196 | + |
| 197 | +# Login to registry |
| 198 | +podman login quay.io |
| 199 | + |
| 200 | +# Push |
| 201 | +podman push quay.io/your-org/universal-ml:latest |
| 202 | +``` |
| 203 | +--- |
| 204 | + |
| 205 | +## License |
| 206 | + |
| 207 | +See `LICENSE.md` for CUDA and component licenses. |
0 commit comments