Skip to content

worker_groups.py removes RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, breaking NCCL on H200/NVSwitch #1963

@dmvevents

Description

@dmvevents

Summary

nemo_rl/distributed/worker_groups.py (line ~347) unconditionally removes RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES from worker environment variables:

# Remove Ray-specific environment variables, let the worker itself set them.
worker_env_vars.pop("RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", None)

This forces Ray to set per-actor CUDA_VISIBLE_DEVICES masking (e.g., CUDA_VISIBLE_DEVICES=3 for the 4th GPU), which triggers three confirmed NCCL bugs on H200/NVSwitch (P5en) hardware. The result is that multi-node GRPO training is completely broken on H200.

NCCL Bugs Triggered by GPU Masking

When Ray masks CUDA_VISIBLE_DEVICES to a single GPU per actor, NCCL's internal device indexing diverges from the physical GPU topology. This triggers:

  1. cuMem import penalty (NVIDIA/nccl#1749) -- Confirmed by NVIDIA engineer. p2pMap() in src/transport/p2p.cc iterates over all devices when importing cuMem handles with non-overlapping CUDA_VISIBLE_DEVICES. Causes 3,660ms first-operation penalty (vs 1.5ms with torchrun).

  2. NVLS rank ordering corruption (NVIDIA/nccl#1906) -- Confirmed by NVIDIA engineer. src/transport/nvls.cc allgather is missing a user rank table when GPU indices are permuted by CUDA_VISIBLE_DEVICES. Causes hang or silent data corruption on NVSwitch systems. Only affects NVSwitch (H200 P5en), not NVLink-only (H100 P5).

  3. Multi-channel P2P hang at >8M elements -- Even with NCCL_CUMEM_ENABLE=0 and NCCL_NVLS_ENABLE=0, AllReduce hangs for tensors larger than ~32MB (8-12M float32 elements). This appears to be a separate multi-channel issue triggered by the same GPU masking.

Benchmarks: Same Hardware, Same NCCL, Different Results

Method AllReduce 4KB AllReduce 933MB Notes
torchrun (no GPU masking) 1.5ms 1.5ms All sizes work perfectly
Ray (GPU masking forced) 3,660ms HANGS forever 2400x slower, then hangs

Both tests run on the same P5en.48xlarge nodes (8x H200, NVSwitch), same NCCL 2.27.5, same EFA networking, same container image. The ONLY difference is whether CUDA_VISIBLE_DEVICES is masked per-process.

On P5.48xlarge (H100, NVLink PXN, no NVSwitch), Ray with GPU masking works fine -- confirming this is specific to NVSwitch topology.

Proposed Fix

Make RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES configurable instead of unconditionally removing it. For example:

# Allow users to opt out of Ray GPU masking (needed for H200/NVSwitch)
if not os.environ.get("NEMO_RL_KEEP_RAY_NOSET_CVD"):
    worker_env_vars.pop("RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", None)

Or expose it as a configuration parameter in the worker group config. When RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 is preserved, each worker sees all 8 GPUs and must call torch.cuda.set_device(local_rank) explicitly -- which NeMo RL workers already do.

Current Workaround

The only workaround is to patch the installed worker_groups.py in the container's venv to comment out the .pop() line, which is fragile and breaks on upgrades.

Environment

  • Hardware: 4x P5en.48xlarge (8x H200 per node, NVSwitch)
  • NCCL: 2.27.5 with aws-ofi-nccl 1.18.0
  • NeMo RL: Latest main branch
  • Ray: 2.44.1
  • Network: EFA with Libfabric 2.4

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions