Skip to content

Conversation

@LilyLinh
Copy link
Contributor

@LilyLinh LilyLinh commented Dec 9, 2025

Jira

Description

Changes:

  • CUDA Dockerfile: Remove development tools and devel packages to reduce image size by ~2.6-4.7 GB

    • Removed: nvcc compiler, cuda-libraries-devel, cuda-minimal-build, nsight-compute, nvprof, cuDNN-devel
    • Kept: Runtime libraries (cuBLAS, cuDNN, NCCL) required for ML workloads
    • Added: Python cache cleanup and debug symbol stripping (~200-500 MB additional savings)
  • ROCm Dockerfile:

    • Removed rocm-developer-tools, rocm-opencl-sdk, rocm-openmp-sdk, rocm-utils
    • Kept only rocm-ml-sdk
    • Added identical Python cleanup

    Detail docs

Impact:

  • Faster image pulls in Kubernetes/OpenShift
  • Lower storage and bandwidth costs
  • Reduced attack surface
  • Zero functionality impact - all ML workloads continue to work

How Has This Been Tested?

Need building images and testing

Summary by CodeRabbit

  • Chores
    • Optimized Ray runtime container images for CUDA and ROCm by removing development tooling and unnecessary packages.
    • Implemented automated cleanup of Python cache files and debug symbols to reduce image sizes.
    • Streamlined runtime dependencies across multiple image variants (Python 3.11/3.12, CUDA 12.1/12.8, ROCm 6.1/6.2).

✏️ Tip: You can customize this high-level summary in your review settings.

…mages for optimization[need testing]

Signed-off-by: lilylinh <[email protected]>
@openshift-ci openshift-ci bot requested review from kapil27 and kryanbeane December 9, 2025 11:32
@openshift-ci
Copy link

openshift-ci bot commented Dec 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Dec 9, 2025

Walkthrough

These changes optimize Ray runtime Docker images by removing CUDA/ROCm development tooling and packages, replacing them with runtime-focused alternatives, and adding post-installation cleanup steps to remove Python caches and debug symbols from shared libraries across four Dockerfile variants.

Changes

Cohort / File(s) Summary
CUDA runtime optimization
images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile, images/runtime/ray/cuda/2.52.1-py312-cu128/Dockerfile
Removed CUDA devel packages, development environment variables, and LIBRARY_PATH configuration. Simplified cudnn installation to runtime-only. Added post-install cleanup: removal of Python caches (pycache, .pyc, .pyo) and stripping debug symbols from .so files.
ROCm runtime optimization
images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile, images/runtime/ray/rocm/2.52.1-py312-rocm62/Dockerfile
Replaced multi-package ROCm developer tools with streamlined rocm-ml-sdk runtime package. Added comprehensive post-install cleanup: removal of Python cache files and debug symbol stripping from .so files using guarded commands (2>/dev/null, || true).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Areas requiring attention: Verify that removal of development packages (e.g., CUDA devel components, rocm-developer-tools, rocm-opencl-sdk, rocm-openmp-sdk) does not break runtime functionality for Ray workloads; confirm cleanup commands execute without errors and correctly target Python site-packages and library directories; ensure rocm-ml-sdk and streamlined cudnn installations provide adequate runtime support.

Possibly related issues

Poem

🐰 Off with the toolkits, the debug symbols too,
Runtime images lighter—leaner, sleeker, new!
Cache files vanish, .so files stripped clean,
Ray runs swifter in containers lean.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the PR: removing dependencies from CUDA and ROCm runtime Dockerfiles to optimize image size.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile (1)

53-58: Verify debug symbol stripping won't break ROCm runtime functionality.

Stripping debug symbols from .so files (line 58) can impact:

  • Runtime symbol resolution if ROCm libraries have runtime dependencies on symbols
  • Diagnostics and debugging if traces or profiling tools depend on symbols
  • Correctness of AMDGPU runtime operations

Before merging, confirm that stripping debug symbols from ROCm shared objects is safe for the runtime workloads. Additionally, the hardcoded Python path /opt/app-root/lib/python3.11/site-packages assumes Python 3.11 will always be the base image version; if the base image Python version changes in the future, this cleanup will silently fail to run. Consider parameterizing this path or documenting the assumption.

The extensive use of || true suppresses all errors during cleanup. While cleanup failures shouldn't block container startup, consider whether some errors warrant logging rather than silent suppression.

Consider parameterizing the Python path:

-    find /opt/app-root/lib/python3.11/site-packages -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true && \
-    find /opt/app-root/lib/python3.11/site-packages -name "*.pyc" -delete 2>/dev/null || true && \
-    find /opt/app-root/lib/python3.11/site-packages -name "*.pyo" -delete 2>/dev/null || true && \
-    find /opt/app-root/lib/python3.11/site-packages -name "*.so" -exec strip --strip-debug {} \; 2>/dev/null || true
+    PYTHON_LIB_PATH=$(python3 -c "import site; print(site.getsitepackages()[0])") && \
+    find "$PYTHON_LIB_PATH" -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true && \
+    find "$PYTHON_LIB_PATH" -name "*.pyc" -delete 2>/dev/null || true && \
+    find "$PYTHON_LIB_PATH" -name "*.pyo" -delete 2>/dev/null || true && \
+    find "$PYTHON_LIB_PATH" -name "*.so" -exec strip --strip-debug {} \; 2>/dev/null || true

This makes the cleanup more resilient to Python version changes in the base image.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0f477d and a532e7a.

📒 Files selected for processing (4)
  • images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile (2 hunks)
  • images/runtime/ray/cuda/2.52.1-py312-cu128/Dockerfile (2 hunks)
  • images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile (2 hunks)
  • images/runtime/ray/rocm/2.52.1-py312-rocm62/Dockerfile (2 hunks)
🔇 Additional comments (7)
images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile (1)

40-43: Verify that rocm-ml-sdk provides all required runtime libraries.

The change from the full ROCm installation to only rocm-ml-sdk is significant and directly impacts whether Ray/ML workloads will have access to necessary runtime components (e.g., cuBLAS, NCCL equivalents on ROCm). Ensure that testing (noted as still required in the PR) validates that rocm-ml-sdk alone provides sufficient runtime support for the expected workloads.

images/runtime/ray/rocm/2.52.1-py312-rocm62/Dockerfile (2)

53-58: Verify that debug symbol stripping and error suppression align with observability strategy.

The cleanup commands strip debug symbols from .so files and use broad error suppression. While reducing image size, stripping eliminates symbolication capability for production crash analysis without image rebuild. The 2>/dev/null || true pattern could mask file permission or format issues during cleanup.

Recommended verification:

  1. Confirm whether this is intentional and consistent across all runtime image variants (CUDA, ROCm)
  2. Verify .so file functionality post-stripping (test imports and library linking)
  3. Clarify whether observability strategy relies on alternative debugging mechanisms (logs, metrics, external symbols) to compensate for stripped binaries

Consider adding diagnostic output (e.g., find ... -print -exec) if error suppression must remain, to aid troubleshooting when cleanup silently skips files.


40-43: rocm-ml-sdk appropriately provides all necessary runtime dependencies for ML workloads.

The change to use rocm-ml-sdk is sound. This meta-package transitively includes:

  • rocm-hip-sdk (HIP runtime libraries and core math libraries: hipblas, hipfft, hiprand, hipsparse, hipsolver, hiptensor)
  • rocm-ml-libraries (MIOpen for ML operations)
  • rocm-core and miopen-hip

The removed packages (rocm-developer-tools, rocm-opencl-sdk, rocm-openmp-sdk, rocm-utils) are development tools or deprecated components; OpenMP runtime is included transitively via rocm-hip-sdk, and RCCL (ROCm's collective communications library for distributed training) is included in rocm-hip-sdk as well. The assertion of "zero functionality impact" is correct for ML runtime workloads.

images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile (2)

77-84: Verify cudnn package version compatibility with CUDA 12.1.

This file uses libcudnn8-8.9.0.131-1 for CUDA 12.1, while the py312-cu128 file uses libcudnn9-cuda-12 for CUDA 12.8. Confirm that:

  1. The package name libcudnn8-${NV_CUDNN_VERSION}.cuda12.1 is correct and available in RHEL 9 NVIDIA repos.
  2. cudnn 8.9.0.131 is appropriate/stable for CUDA 12.1 (and the version difference vs. file 1 is intentional).

94-99: Unable to verify claims about CUDA debug symbol stripping risk without repository access.

This review references a comparison file (py312 at lines 101–106) and asserts that stripping debug symbols from .so files in CUDA libraries is untested and high-risk. However, without access to the codebase, I cannot confirm whether the py312 file exists, compare the implementations, or validate the severity of this concern. A developer should manually verify: (1) whether both files use identical cleanup approaches, (2) whether stripping CUDA .so files has known issues or incompatibilities, and (3) whether this change has been tested in the target environment.

images/runtime/ray/cuda/2.52.1-py312-cu128/Dockerfile (2)

101-106: Verify scope of .so file stripping before applying recommendations.

The cleanup command find /opt/app-root/lib/python3.12/site-packages -name "*.so" ... only targets Python site-packages, not CUDA runtime libraries. Python extension modules in site-packages do not require debug symbols for runtime functionality. If untested images are a concern, that should be addressed independently of symbol stripping, which is a standard optimization for Python container images.


82-91: Package naming is correct; no verification issue identified.

The libcudnn9-cuda-12-${NV_CUDNN_VERSION} package name is valid according to NVIDIA's official cuDNN 9 package naming conventions. The full package name libcudnn9-cuda-12-9.7.0.66-1 correctly uses the runtime variant for CUDA 12.x with version suffix. cuDNN 9.7.0 is explicitly supported for CUDA 12.8 per NVIDIA's support matrix.

Comment on lines +75 to +76
# Install CUDA runtime cudnn from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix misleading comment reference to incorrect CUDA version.

Line 76 references the NVIDIA Dockerfile from "dist/12.8.0" but this file is for CUDA 12.1. This appears to be a copy-paste error from the other Dockerfile. Update the reference to "dist/12.1.1" for accuracy.

Apply this diff:

-# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.1.1/ubi9/runtime/cudnn/Dockerfile
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Install CUDA runtime cudnn from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile
# Install CUDA runtime cudnn from:
# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.1.1/ubi9/runtime/cudnn/Dockerfile
🤖 Prompt for AI Agents
In images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile around lines 75 to 76,
the comment references the NVIDIA Dockerfile path "dist/12.8.0" which is
incorrect for this CUDA 12.1 image; update the reference to "dist/12.1.1" to
accurately point to the matching NVIDIA cudnn runtime Dockerfile. Replace the
"dist/12.8.0" fragment in the URL with "dist/12.1.1" and keep the rest of the
comment unchanged.

@LilyLinh LilyLinh marked this pull request as draft December 9, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant