Skip to content

Unified placement group breaks non-colocated code path #1352

@youngeunkwon0405

Description

@youngeunkwon0405

Describe the bug

Unified placement (vllm model parallel > 8) causes an error in non-colocated code path.
It fails at the pynccl group initialization.
I think this code path has been buggy, but it used to work without crashing. I think #1264 reveals the bug to the surface.

Steps/Code to reproduce bug

Any non-colocated run with TP=16.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

qwen30B_GTP16_failure.log

Add any other context about the problem here.

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions