[Scheduler][colocated batch gen] Is all_reduce necessary in PollBasedBarrier when non-leader GPUs are always noop? #17841
Closed
jiangyukunok
started this conversation in
General
Replies: 1 comment 1 reply
-
|
Looking further into the code, I think the SGLANG_ENABLE_COLOCATED_BATCH_GEN mode may not work when dp size > 1. When there are multiple scheduler process with attn_tp_rank == 0, we can end up holding messages in pending queue forever or having messages processed immediately before receiving the unblock signal. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Based on https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/scheduler.py#L817
My understanding is that non-leader (attn_tp_rank != 0) GPU process always has noop=True, which constantly contributes True to the MIN all-reduce operation: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/utils/poll_based_barrier.py#L26, meaning that the all_reduce result is effectively determined solely by the leader GPU's local_arrived value.
Would it be possible to simplify this by having only the leader GPU track the blocked/unblocked states and removing the synchronization for non-leader GPUs? Later the broadcast provides synchronization anyway.
Beta Was this translation helpful? Give feedback.
All reactions