[Pipelining] Relax scale_grads assert (pytorch#145010)

wconstab · pytorchmergebot · commit 64e54d5af638 · 2025-01-17T21:33:28.000Z
The assert felt morally valid- if no gradients are scaled, then something is definitely wrong with the setup. In one instance, PP + optimizer-in-backward (in torchtitan) resulted in grad=None after running .backward() and before scaling grads. On the other hand, the existing assert is too restrictive. It's possible that a model used with pipelining would have some parameters that do not receieve gradients, and we shouldn't hard-error in these cases. (E.g. if the parameter is literally not used, or is frozen). In the extreme case, the whole stage could be frozen. So we do not complain if no grads are scaled. Pull Request resolved: pytorch#145010 Approved by: https://github.com/mori360, https://github.com/tianyu-l
diff --git a/torch/distributed/pipelining/stage.py b/torch/distributed/pipelining/stage.py
@@ -586,9 +586,9 @@ def scale_grads(self, grad_scale_factor: int) -> None:
         # PP scales only for its own contribution (microbatches), but relies on DP to scale further
         # for DP degree.
         if grad_scale_factor != 1:
-            for name, p in self.submod.named_parameters():
-                assert p.grad is not None, name
-                p.grad.div_(grad_scale_factor)
+            for p in self.submod.parameters():
+                if p.grad is not None:
+                    p.grad.div_(grad_scale_factor)
 
     def backward_maybe_with_nosync(
         self,