fixed issue with sharded_grads with multiple process groups (pytorch#3268)

Gavin Zhang · facebook-github-bot · commit 9bcde1bb95ff · 2025-08-07T21:12:12.000-07:00
Summary: Pull Request resolved: pytorch#3268 added a hot fix to an issue in clipping where sharded_grads was not appropriately initialized in the case with multiple process groups. Reviewed By: tsunghsienlee Differential Revision: D79853515 fbshipit-source-id: 5ad1ba34898b541cac9b01a746c29daafb1ed44f
diff --git a/torchrec/optim/clipping.py b/torchrec/optim/clipping.py
@@ -149,7 +149,9 @@ def clip_grad_norm_(self) -> Optional[Union[float, torch.Tensor]]:
         sharded_grads = {
             pgs: _get_grads(dist_params) for pgs, dist_params in sharded_params.items()
         }
-        all_grads.extend(*sharded_grads.values())
+
+        for grads in sharded_grads.values():
+            all_grads.extend(grads)
 
         # Process replicated parameters and gradients
         replicate_grads = _get_grads(replicate_params)