-
Notifications
You must be signed in to change notification settings - Fork 26
[RuntimeEnv] Support VGPU containers #728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces GPU support for containers, improves the Ray installation process within containers, and adds unit tests for the image_uri plugin. The changes are generally good, but I have some feedback primarily focused on security and code maintainability.
My main concerns are:
- The use of
--privilegedfor GPU containers poses a significant security risk. - The GPU setup logic in
image_uri.pyhas a lot of duplicated code and relies on hardcoded paths, which could be brittle. - There's an instance of
subprocess.runwithshell=Truewhich is not a best practice. - Some debugging
printstatements have been left in the code.
I've provided specific comments and suggestions to address these points.
| # Use sets to store unique mount destinations | ||
| volume_mounts: Set[str] = set() | ||
| device_mounts: Set[str] = set() | ||
| mount_commands = ["--privileged"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --privileged grants the container almost all the capabilities of the host machine, which is a significant security risk. Since you are already explicitly mounting necessary devices and libraries, could the --privileged flag be avoided? It's recommended to use more granular permissions, like specific --cap-add flags, instead of giving full privileges.
| mount_commands = ["--privileged"] | |
| mount_commands = [] |
| # Use glob patterns to discover and mount NVIDIA libraries dynamically | ||
| logger.info("Mounting gpu devices and drivers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This large block of code for GPU support contains many hardcoded paths and glob patterns for NVIDIA drivers and libraries. This approach can be brittle, as it depends on a specific driver installation layout which might vary across different Linux distributions or driver versions. Have you considered leveraging existing tools like nvidia-container-toolkit or nvidia-container-cli to get the required mount paths and devices? This would make the implementation more robust and less dependent on hardcoded paths.
cdf5b5b to
92a6fa4
Compare
| container_command.append(runtime_env_constants.CONTAINER_ENV_PLACEHOLDER) | ||
|
|
||
|
|
||
| if not os.getenv("NVIDIA_VISIBLE_DEVICES"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define constant for "NVIDIA_VISIBLE_DEVICES" is better.
| # mount ray package site path | ||
| host_site_packages_path = get_ray_site_packages_path() | ||
| # Mount only ray package path. | ||
| # Do NOT overwrite podmans' site packages because it may include necessary packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not overwrite podman's site packages because it may include necessary packages.
| context.env_vars = try_update_runtime_env_vars( | ||
| context.env_vars, redirected_pyenv_folder | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this modification unnecessary?
| container_command.append("--cap-add=AUDIT_WRITE") | ||
| else: | ||
| # GPU mode | ||
| # Use glob patterns to discover and mount NVIDIA libraries dynamically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use glob patterns to discover and mount NVIDIA libraries dynamically.
| if os.path.exists(path): | ||
| volume_mounts.add(f"{path}:{path}:ro") | ||
|
|
||
| # Define all NVIDIA library and file patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only support NVIDIA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually supports Ant's VGPU. This is an inner feature
4bc15a3 to
ce29f05
Compare
ce29f05 to
74e7241
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Description
image_uriplugin