red-hat-data-services
diff --git a/‎images/universal/training/th03-cuda128-torch280-py312/Dockerfile‎
Lines changed: 225 additions & 126 deletions b/‎images/universal/training/th03-cuda128-torch280-py312/Dockerfile‎
Lines changed: 225 additions & 126 deletions
diff --git a/‎images/universal/training/th03-cuda128-torch280-py312/README.md‎
Lines changed: 205 additions & 7 deletions b/‎images/universal/training/th03-cuda128-torch280-py312/README.md‎
Lines changed: 205 additions & 7 deletions
diff --git a/‎images/universal/training/th03-cuda128-torch280-py312/cuda.repo‎
Lines changed: 6 additions & 0 deletions b/‎images/universal/training/th03-cuda128-torch280-py312/cuda.repo‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎images/universal/training/th03-cuda128-torch280-py312/mellanox.repo‎
Lines changed: 6 additions & 0 deletions b/‎images/universal/training/th03-cuda128-torch280-py312/mellanox.repo‎
Lines changed: 6 additions & 0 deletions
@@ -1,9 +1,207 @@
-# Universal Training Container Image
+# Universal ML Workbench Image
 
-CUDA enabled container image for Training Workbench and Training Runtime in OpenShift AI.
+A FIPS-friendly container image for machine learning workloads, built on top of OpenDataHub's CUDA-enabled Jupyter workbench base image.
 
-It includes the following layers:
-* UBI 9
-* Python 3.12
-* CUDA 12.8
-* PyTorch 2.8.0
+## Image Overview
+
+**Key Features:**
+- ✅ FIPS-friendly multi-stage build (no build tools in runtime)
+- ✅ Python 3.12 with PyTorch 2.8.0 + CUDA 12.8
+- ✅ GPU-accelerated ML: flash-attention, Mamba SSM, Transformers
+- ✅ Dependency management via `uv` and `pylock.toml`
+- ✅ RDMA/InfiniBand support for distributed training
+- ✅ Reproducible builds with locked dependencies
+
+**Installed ML Packages:**
+- PyTorch 2.8.0 (CUDA 12.8)
+- Transformers 4.57.1
+- Accelerate 1.10.0
+- flash-attn 2.8.3
+- mamba-ssm 2.2.6.post3
+- causal-conv1d 1.5.3.post1
+- vLLM, DeepSpeed, and more (287 total packages)
+
+---
+
+## Building the Image
+
+### Prerequisites
+- Podman or Docker
+- 20GB+ disk space
+
+### Build Command
+
+```bash
+podman build -t universal-ml:latest .
+```
+
+**Build arguments** (optional - image ref example):
+```bash
+podman build \
+  --build-arg BASE_IMAGE=quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
+  --build-arg PYTHON_VERSION=3.12 \
+  --build-arg CUDA_VERSION=12.8 \
+  -t universal-ml:latest .
+```
+---
+
+## Updating Dependencies
+
+This image uses `uv` for Python dependency management with `pyproject.toml` and `pylock.toml` for reproducible builds.
+
+### Understanding the Dependency Layers
+
+The image has **two layers** of dependencies:
+
+1. **Base Image Dependencies** (from `workbench-images`)
+   - JupyterLab, notebook extensions, authentication
+   - Pre-installed in the base image via its own `pylock.toml`
+
+2. **ML Dependencies** (this image)
+   - PyTorch, Transformers, flash-attn, Mamba
+   - Defined in `pyproject.toml` and locked in `pylock.toml`
+
+**Important:** `pyproject.toml` includes BOTH base and ML dependencies with exact versions to prevent `uv pip sync` from removing base packages.
+
+### Option 1: Add/Update ML Packages (Recommended)
+
+If you want to add or update ML packages **without touching base packages**:
+
+1. **Edit `pyproject.toml`** - Add/modify packages in the `dependencies` section:
+   ```toml
+   dependencies = [
+       # ... existing packages ...
+       "new-package==1.2.3",  # Add new package
+   ]
+   ```
+
+2. **Regenerate the lockfile** (requires `uv` installed locally):
+   ```bash
+   # Install uv if you don't have it
+   curl -LsSf https://astral.sh/uv/install.sh | sh
+   
+   # Regenerate pylock.toml
+   uv pip compile pyproject.toml -o pylock.toml
+   ```
+
+3. **Rebuild the image**:
+   ```bash
+   podman build -t universal-ml:latest .
+   ```
+
+### Option 2: Update Base Image
+
+If the base image (`workbench-images`) has been updated with new versions:
+
+1. **Extract base image dependencies:**
+   ```bash
+   # Pull the latest base image (example)
+   podman pull quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903
+   
+   # Get bases exact installed versions
+   podman run --rm --entrypoint pip \
+   quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
+   list --format freeze > base-freeze.txt
+
+   # Get specific packages versions
+   podman run --rm --entrypoint sh \
+   quay.io/opendatahub/workbench-images:cuda-jupyter-minimal-ubi9-python-3.12-2025a_20250903 \
+   -lc 'python -m pip show jupyterlab jupyterlab-server jupyter-server jupyterlab-git jupyterlab-pygments | egrep "Name|Version"'
+   ```
+
+2. **Update `pyproject.toml`** with new base package versions from `base-pylock.toml`
+
+3. **Regenerate lockfile** and **rebuild** (same as Option 1)
+
+### Option 3: Full Dependency Refresh
+
+To update ALL packages to their latest compatible versions:
+
+1. **Backup current versions:**
+   ```bash
+   cp pyproject.toml pyproject.toml.backup
+   cp pylock.toml pylock.toml.backup
+   ```
+
+2. **Remove version pins** in `pyproject.toml` (change `==` to `>=` or remove version)
+
+3. **Regenerate lockfile:**
+   ```bash
+   uv pip compile pyproject.toml -o pylock.toml
+   ```
+
+4. **Test thoroughly** - major version bumps can break compatibility
+
+5. **Rebuild the image**
+
+---
+
+## Special Packages: flash-attn, mamba-ssm, causal-conv1d
+
+These packages require special handling because they:
+- Have CUDA build-time dependencies
+- Cannot be resolved by `uv pip compile` (CUDA check fails)
+- Need `--no-build-isolation` flag during installation
+
+**How they're handled:**
+1. Listed in `requirements-special.txt` (for version tracking)
+2. Excluded from `pyproject.toml` dependencies
+3. Installed separately in Dockerfile with `--no-build-isolation`
+
+**To update these packages:**
+1. Edit `requirements-special.txt`:
+   ```
+   flash-attn==2.9.0
+   causal-conv1d==1.6.0
+   mamba-ssm==2.3.0
+   ```
+
+2. Rebuild the image (no lockfile regeneration needed)
+
+---
+
+## File Manifest
+
+| File | Purpose |
+|------|---------|
+| `Dockerfile` | Multi-stage build definition (FIPS-friendly) |
+| `pyproject.toml` | Python project metadata and dependencies |
+| `pylock.toml` | Locked dependency versions (generated by `uv`) |
+| `requirements-special.txt` | CUDA-dependent packages (flash-attn, mamba-ssm, etc.) |
+| `cuda.repo` | NVIDIA CUDA repository configuration |
+| `mellanox.repo` | Mellanox OFED repository configuration |
+| `entrypoint-universal.sh` | Container entrypoint script |
+| `LICENSE.md` | License information |
+| `README.md` | This file |
+
+---
+
+## Build Architecture
+
+The Dockerfile uses a **5-stage multi-stage build** for FIPS-friendliness:
+
+1. **`builder`** - Installs `uv` tool (isolated from runtime)
+2. **`base`** - Base image with Python environment
+3. **`system-deps`** - Installs CUDA and RDMA system packages
+4. **`python-deps`** - Installs Python packages with `uv`, then removes build tools
+5. **`final`** - Clean runtime image with only necessary artifacts
+
+---
+
+## Pushing the Image
+
+```bash
+# Tag for your registry
+podman tag universal-ml:latest quay.io/your-org/universal-ml:latest
+
+# Login to registry
+podman login quay.io
+
+# Push
+podman push quay.io/your-org/universal-ml:latest
+```
+---
+
+## License
+
+See `LICENSE.md` for CUDA and component licenses.
@@ -0,0 +1,6 @@
+[cuda-rhel9-x86_64]
+name=CUDA Repository for RHEL9 x86_64
+baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64
+enabled=1
+gpgcheck=0
+
@@ -0,0 +1,6 @@
+[mlnx_ofed_24.10-1.1.4.0_base]
+name=Mellanox OFED Repository 24.10-1.1.4.0
+baseurl=https://linux.mellanox.com/public/repo/mlnx_ofed/24.10-1.1.4.0/rhel9.5/x86_64
+enabled=1
+gpgcheck=0
+