-
Notifications
You must be signed in to change notification settings - Fork 73
[concept]Remove dependencies in Dockerfile of CUDA and ROCm runtime images for optimization [need testing] #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mages for optimization[need testing] Signed-off-by: lilylinh <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThese changes optimize Ray runtime Docker images by removing CUDA/ROCm development tooling and packages, replacing them with runtime-focused alternatives, and adding post-installation cleanup steps to remove Python caches and debug symbols from shared libraries across four Dockerfile variants. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related issues
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile (1)
53-58: Verify debug symbol stripping won't break ROCm runtime functionality.Stripping debug symbols from
.sofiles (line 58) can impact:
- Runtime symbol resolution if ROCm libraries have runtime dependencies on symbols
- Diagnostics and debugging if traces or profiling tools depend on symbols
- Correctness of AMDGPU runtime operations
Before merging, confirm that stripping debug symbols from ROCm shared objects is safe for the runtime workloads. Additionally, the hardcoded Python path
/opt/app-root/lib/python3.11/site-packagesassumes Python 3.11 will always be the base image version; if the base image Python version changes in the future, this cleanup will silently fail to run. Consider parameterizing this path or documenting the assumption.The extensive use of
|| truesuppresses all errors during cleanup. While cleanup failures shouldn't block container startup, consider whether some errors warrant logging rather than silent suppression.Consider parameterizing the Python path:
- find /opt/app-root/lib/python3.11/site-packages -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true && \ - find /opt/app-root/lib/python3.11/site-packages -name "*.pyc" -delete 2>/dev/null || true && \ - find /opt/app-root/lib/python3.11/site-packages -name "*.pyo" -delete 2>/dev/null || true && \ - find /opt/app-root/lib/python3.11/site-packages -name "*.so" -exec strip --strip-debug {} \; 2>/dev/null || true + PYTHON_LIB_PATH=$(python3 -c "import site; print(site.getsitepackages()[0])") && \ + find "$PYTHON_LIB_PATH" -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true && \ + find "$PYTHON_LIB_PATH" -name "*.pyc" -delete 2>/dev/null || true && \ + find "$PYTHON_LIB_PATH" -name "*.pyo" -delete 2>/dev/null || true && \ + find "$PYTHON_LIB_PATH" -name "*.so" -exec strip --strip-debug {} \; 2>/dev/null || trueThis makes the cleanup more resilient to Python version changes in the base image.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile(2 hunks)images/runtime/ray/cuda/2.52.1-py312-cu128/Dockerfile(2 hunks)images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile(2 hunks)images/runtime/ray/rocm/2.52.1-py312-rocm62/Dockerfile(2 hunks)
🔇 Additional comments (7)
images/runtime/ray/rocm/2.52.1-py311-rocm61/Dockerfile (1)
40-43: Verify that rocm-ml-sdk provides all required runtime libraries.The change from the full ROCm installation to only
rocm-ml-sdkis significant and directly impacts whether Ray/ML workloads will have access to necessary runtime components (e.g., cuBLAS, NCCL equivalents on ROCm). Ensure that testing (noted as still required in the PR) validates thatrocm-ml-sdkalone provides sufficient runtime support for the expected workloads.images/runtime/ray/rocm/2.52.1-py312-rocm62/Dockerfile (2)
53-58: Verify that debug symbol stripping and error suppression align with observability strategy.The cleanup commands strip debug symbols from
.sofiles and use broad error suppression. While reducing image size, stripping eliminates symbolication capability for production crash analysis without image rebuild. The2>/dev/null || truepattern could mask file permission or format issues during cleanup.Recommended verification:
- Confirm whether this is intentional and consistent across all runtime image variants (CUDA, ROCm)
- Verify
.sofile functionality post-stripping (test imports and library linking)- Clarify whether observability strategy relies on alternative debugging mechanisms (logs, metrics, external symbols) to compensate for stripped binaries
Consider adding diagnostic output (e.g.,
find ... -print -exec) if error suppression must remain, to aid troubleshooting when cleanup silently skips files.
40-43: rocm-ml-sdk appropriately provides all necessary runtime dependencies for ML workloads.The change to use
rocm-ml-sdkis sound. This meta-package transitively includes:
rocm-hip-sdk(HIP runtime libraries and core math libraries: hipblas, hipfft, hiprand, hipsparse, hipsolver, hiptensor)rocm-ml-libraries(MIOpen for ML operations)rocm-coreandmiopen-hipThe removed packages (rocm-developer-tools, rocm-opencl-sdk, rocm-openmp-sdk, rocm-utils) are development tools or deprecated components; OpenMP runtime is included transitively via rocm-hip-sdk, and RCCL (ROCm's collective communications library for distributed training) is included in rocm-hip-sdk as well. The assertion of "zero functionality impact" is correct for ML runtime workloads.
images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile (2)
77-84: Verify cudnn package version compatibility with CUDA 12.1.This file uses
libcudnn8-8.9.0.131-1for CUDA 12.1, while the py312-cu128 file useslibcudnn9-cuda-12for CUDA 12.8. Confirm that:
- The package name
libcudnn8-${NV_CUDNN_VERSION}.cuda12.1is correct and available in RHEL 9 NVIDIA repos.- cudnn 8.9.0.131 is appropriate/stable for CUDA 12.1 (and the version difference vs. file 1 is intentional).
94-99: Unable to verify claims about CUDA debug symbol stripping risk without repository access.This review references a comparison file (py312 at lines 101–106) and asserts that stripping debug symbols from
.sofiles in CUDA libraries is untested and high-risk. However, without access to the codebase, I cannot confirm whether the py312 file exists, compare the implementations, or validate the severity of this concern. A developer should manually verify: (1) whether both files use identical cleanup approaches, (2) whether stripping CUDA.sofiles has known issues or incompatibilities, and (3) whether this change has been tested in the target environment.images/runtime/ray/cuda/2.52.1-py312-cu128/Dockerfile (2)
101-106: Verify scope of .so file stripping before applying recommendations.The cleanup command
find /opt/app-root/lib/python3.12/site-packages -name "*.so" ...only targets Python site-packages, not CUDA runtime libraries. Python extension modules in site-packages do not require debug symbols for runtime functionality. If untested images are a concern, that should be addressed independently of symbol stripping, which is a standard optimization for Python container images.
82-91: Package naming is correct; no verification issue identified.The
libcudnn9-cuda-12-${NV_CUDNN_VERSION}package name is valid according to NVIDIA's official cuDNN 9 package naming conventions. The full package namelibcudnn9-cuda-12-9.7.0.66-1correctly uses the runtime variant for CUDA 12.x with version suffix. cuDNN 9.7.0 is explicitly supported for CUDA 12.8 per NVIDIA's support matrix.
| # Install CUDA runtime cudnn from: | ||
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix misleading comment reference to incorrect CUDA version.
Line 76 references the NVIDIA Dockerfile from "dist/12.8.0" but this file is for CUDA 12.1. This appears to be a copy-paste error from the other Dockerfile. Update the reference to "dist/12.1.1" for accuracy.
Apply this diff:
-# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile
+# https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.1.1/ubi9/runtime/cudnn/Dockerfile📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Install CUDA runtime cudnn from: | |
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/ubi9/runtime/cudnn/Dockerfile | |
| # Install CUDA runtime cudnn from: | |
| # https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.1.1/ubi9/runtime/cudnn/Dockerfile |
🤖 Prompt for AI Agents
In images/runtime/ray/cuda/2.52.1-py311-cu121/Dockerfile around lines 75 to 76,
the comment references the NVIDIA Dockerfile path "dist/12.8.0" which is
incorrect for this CUDA 12.1 image; update the reference to "dist/12.1.1" to
accurately point to the matching NVIDIA cudnn runtime Dockerfile. Replace the
"dist/12.8.0" fragment in the URL with "dist/12.1.1" and keep the rest of the
comment unchanged.
Jira
Description
Changes:
CUDA Dockerfile: Remove development tools and devel packages to reduce image size by ~2.6-4.7 GB
ROCm Dockerfile:
Detail docs
Impact:
How Has This Been Tested?
Need building images and testing
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.