You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: deepspeed/runtime/sequence_parallel/ulysses_sp.py
+6-2Lines changed: 6 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -670,6 +670,8 @@ class SequenceTiledCompute(torch.autograd.Function):
670
670
"""
671
671
A generic autograd function to perform a tiled compute.
672
672
673
+
Please note this module re-computes `forward` in the `backward`. So the `forward` occurs twice each iteration. And if you're using activation checkpointing it then occurs trice.
674
+
673
675
Please note that this implementation doesn't require DeepSpeed and can work without it. `compute_params` can remain `None` in such a case.
674
676
675
677
For an easier to understand example see TiledMLP - which is the same as this autograd function but without the generalization code.
Perform a tiled MLP computation to massively reduce memory usage needed to compute MLP when using very long sequence lengths
840
+
Perform a tiled MLP computation to massively reduce memory usage needed to compute MLP when using very long sequence lengths.
841
+
842
+
Please note this module re-computes `forward` in the `backward`. So the `forward` occurs twice each iteration. And if you're using activation checkpointing it then occurs trice.
839
843
840
-
For a general tiled compute implementation that can handle any `forward` see `SequenceTiledCompute`
844
+
For a general tiled compute implementation that can handle any `forward` see `SequenceTiledCompute`.
0 commit comments