Describe the bug
Unified placement (vllm model parallel > 8) causes an error in non-colocated code path.
It fails at the pynccl group initialization.
I think this code path has been buggy, but it used to work without crashing. I think #1264 reveals the bug to the surface.
Steps/Code to reproduce bug
Any non-colocated run with TP=16.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
qwen30B_GTP16_failure.log
Add any other context about the problem here.