Skip to content

Conversation

@harryge00
Copy link
Collaborator

@harryge00 harryge00 commented Jan 4, 2026

Description

Briefly describe what this PR accomplishes and why it's needed.

  1. Added supports for gpu containers. Mount CUDA drivers and devices to podman containers when gpu is enabled
  2. Always uninstall ray before installing ray, because users' containers may have installed ray, which will be conflicted with hosts' ant-ray
  3. Mount only ray package to containers. Previously, the whole site-package path will be mounted from host to containers. This is problematic: e.g., inside podman contaienrs, vllm is installed, but no vllm is installed on the host ( host images are more basic generally).
  4. Added UT for image_uri plugin

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU support for containers, improves the Ray installation process within containers, and adds unit tests for the image_uri plugin. The changes are generally good, but I have some feedback primarily focused on security and code maintainability.

My main concerns are:

  • The use of --privileged for GPU containers poses a significant security risk.
  • The GPU setup logic in image_uri.py has a lot of duplicated code and relies on hardcoded paths, which could be brittle.
  • There's an instance of subprocess.run with shell=True which is not a best practice.
  • Some debugging print statements have been left in the code.

I've provided specific comments and suggestions to address these points.

# Use sets to store unique mount destinations
volume_mounts: Set[str] = set()
device_mounts: Set[str] = set()
mount_commands = ["--privileged"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

Using --privileged grants the container almost all the capabilities of the host machine, which is a significant security risk. Since you are already explicitly mounting necessary devices and libraries, could the --privileged flag be avoided? It's recommended to use more granular permissions, like specific --cap-add flags, instead of giving full privileges.

Suggested change
mount_commands = ["--privileged"]
mount_commands = []

Comment on lines +288 to +287
# Use glob patterns to discover and mount NVIDIA libraries dynamically
logger.info("Mounting gpu devices and drivers")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of code for GPU support contains many hardcoded paths and glob patterns for NVIDIA drivers and libraries. This approach can be brittle, as it depends on a specific driver installation layout which might vary across different Linux distributions or driver versions. Have you considered leveraging existing tools like nvidia-container-toolkit or nvidia-container-cli to get the required mount paths and devices? This would make the implementation more robust and less dependent on hardcoded paths.

container_command.append(runtime_env_constants.CONTAINER_ENV_PLACEHOLDER)


if not os.getenv("NVIDIA_VISIBLE_DEVICES"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define constant for "NVIDIA_VISIBLE_DEVICES" is better.

# mount ray package site path
host_site_packages_path = get_ray_site_packages_path()
# Mount only ray package path.
# Do NOT overwrite podmans' site packages because it may include necessary packages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not overwrite podman's site packages because it may include necessary packages.

context.env_vars = try_update_runtime_env_vars(
context.env_vars, redirected_pyenv_folder
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this modification unnecessary?

container_command.append("--cap-add=AUDIT_WRITE")
else:
# GPU mode
# Use glob patterns to discover and mount NVIDIA libraries dynamically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use glob patterns to discover and mount NVIDIA libraries dynamically.

if os.path.exists(path):
volume_mounts.add(f"{path}:{path}:ro")

# Define all NVIDIA library and file patterns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only support NVIDIA?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually supports Ant's VGPU. This is an inner feature

@harryge00 harryge00 changed the title [RuntimeEnv] Support GPU containers; Uninstall ray before installing … [RuntimeEnv] Support GPU containers Jan 6, 2026
@harryge00 harryge00 changed the title [RuntimeEnv] Support GPU containers [RuntimeEnv] Support VGPU containers Jan 6, 2026
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants