Introduce all_reduce_hook to support gradient aggregation across replica groups. #7764

zhengchenyu · 2026-01-07T03:07:00Z

Using replica groups offers the following advantages:

For stage 3, it ensures that parameter gather during forward and backward occurs only within the replica group.
Checkpointing is performed only on replica_group_rank=0, guaranteeing constant checkpoint world size and avoiding the universal checkpoint transformations during scaling up or down.

We can achieve gradient all reduce within the replica group after backward and before optimizer.step, but we must wait for all buckets to complete, thus can not leverage concurrency advantages.

I know MICS has similar functionality, but currently only supports zero stage 3. Additionally, I want to use this feature for compatibility with architectures like TorchFT.

…ica groups. Signed-off-by: zhengchenyu <[email protected]>

Introduce all_reduce_hook to support gradient aggregation across repl…

67950ee

…ica groups. Signed-off-by: zhengchenyu <[email protected]>

zhengchenyu requested review from loadams, tjruwase and tohtana as code owners January 7, 2026 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce all_reduce_hook to support gradient aggregation across replica groups. #7764

Introduce all_reduce_hook to support gradient aggregation across replica groups. #7764

Uh oh!

zhengchenyu commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Introduce all_reduce_hook to support gradient aggregation across replica groups. #7764

Are you sure you want to change the base?

Introduce all_reduce_hook to support gradient aggregation across replica groups. #7764

Uh oh!

Conversation

zhengchenyu commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant