Compile guide for Fabric (#19330)

awaelchli · web-flow · commit c346f4d1594f · 2024-01-31T14:57:07.000-05:00
diff --git a/docs/source-fabric/advanced/compile.rst b/docs/source-fabric/advanced/compile.rst
@@ -0,0 +1,299 @@
+#################################
+Speed up models by compiling them
+#################################
+
+Compiling your PyTorch model can result in significant speedups, especially on the latest generations of GPUs.
+This guide shows you how to apply ``torch.compile`` correctly in your code.
+
+.. note::
+
+    This requires PyTorch >= 2.0.
+
+
+----
+
+
+*********************************
+Apply torch.compile to your model
+*********************************
+
+Compiling a model in a script together with Fabric is as simple as adding one line of code, calling :func:`torch.compile`:
+
+.. code-block:: python
+
+    import torch
+    import lightning as L
+
+    # Set up Fabric
+    fabric = L.Fabric(devices=1)
+
+    # Define the model
+    model = ...
+
+    # Compile the model
+    model = torch.compile(model)
+
+    # `fabric.setup()` should come after `torch.compile()`
+    model = fabric.setup(model)
+
+
+.. important::
+
+    You should compile the model **before** calling ``fabric.setup()`` as shown above for an optimal integration with features in Fabric.
+
+The newly added call to ``torch.compile()`` by itself doesn't do much. It just wraps the model in a "compiled model".
+The actual optimization will start when calling ``forward()`` on the model for the first time:
+
+.. code-block:: python
+
+    # 1st execution compiles the model (slow)
+    output = model(input)
+
+    # All future executions will be fast (for inputs of the same size)
+    output = model(input)
+    output = model(input)
+    ...
+
+This is important to know when you measure the speed of a compiled model and compare it to a regular model.
+You should always *exclude* the first call to ``forward()`` from your measurements, since it includes the compilation time.
+
+.. collapse:: Full example with benchmark
+
+    Below is an example that measures the speedup you get when compiling the InceptionV3 from TorchVision.
+
+    .. code-block:: python
+
+        import statistics
+        import torch
+        import torchvision.models as models
+        import lightning as L
+
+
+        @torch.no_grad()
+        def benchmark(model, input, num_iters=10):
+            """Runs the model on the input several times and returns the median execution time."""
+            start = torch.cuda.Event(enable_timing=True)
+            end = torch.cuda.Event(enable_timing=True)
+            times = []
+            for _ in range(num_iters):
+                start.record()
+                model(input)
+                end.record()
+                torch.cuda.synchronize()
+                times.append(start.elapsed_time(end) / 1000)
+            return statistics.median(times)
+
+
+        fabric = L.Fabric(accelerator="cuda", devices=1)
+
+        model = models.inception_v3()
+        input = torch.randn(16, 3, 510, 512, device=fabric.device)
+
+        # Compile!
+        compiled_model = torch.compile(model)
+
+        # Set up the model with Fabric
+        model = fabric.setup(model)
+        compiled_model = fabric.setup(compiled_model)
+
+        # warm up the compiled model before we benchmark
+        compiled_model(input)
+
+        # Run multiple forward passes and time them
+        eager_time = benchmark(model, input)
+        compile_time = benchmark(compiled_model, input)
+
+        # Compare the speedup for the compiled execution
+        speedup = eager_time / compile_time
+        print(f"Eager median time: {eager_time:.4f} seconds")
+        print(f"Compile median time: {compile_time:.4f} seconds")
+        print(f"Speedup: {speedup:.1f}x")
+
+    On an NVIDIA A100 SXM4 40GB with PyTorch 2.2.0, CUDA 12.1, we get the following speedup:
+
+    .. code-block:: text
+
+        Eager median time: 0.0254 seconds
+        Compile median time: 0.0185 seconds
+        Speedup: 1.4x
+
+
+----
+
+
+******************
+Avoid graph breaks
+******************
+
+When ``torch.compile`` looks at the code in your model's ``forward()`` method, it will try to compile as much of the code as possible.
+If there are regions in the code that it doesn't understand, it will introduce a so-called "graph break" that essentially splits the code in optimized and unoptimized parts.
+Graph breaks aren't a deal breaker, since the optimized parts should still run faster.
+But if you want to get the most out of ``torch.compile``, you might want to invest rewriting the problematic section of the code that produce the breaks.
+
+You can check whether your model produces graph breaks by calling ``torch.compile`` with ``fullraph=True``:
+
+.. code-block:: python
+
+    # Force an error if there is a graph break in the model
+    model = torch.compile(model, fullgraph=True)
+
+Be aware that the error messages produced here are often quite cryptic, so you will likely have to do some `troubleshooting <pytorch.org/docs/stable/torch.compiler_troubleshooting.html>`_ to fully optimize your model.
+
+
+----
+
+
+*******************
+Avoid recompilation
+*******************
+
+As mentioned before, the compilation of the model happens the first time you call ``forward()``.
+At this point, PyTorch will inspect the input tensor(s) and optimize the compiled code for the particular shape, data type and other properties the input has.
+If the shape of the input remains the same across all calls to ``forward()``, PyTorch will reuse the compiled code it generated and you will get the best speedup.
+However, if these properties change across subsequent calls to ``forward()``, PyTorch will be forced to recompile the model for the new shapes, and this will significantly slow down your training if it happens on every iteration.
+
+**When your training suddenly becomes slow, it's probably because PyTorch is recompiling the model!**
+Here are some common scenarios when this can happen:
+
+- Your Trainer code switches from training to validation/testing and the input shape changes, triggering a recompilation.
+- Your dataset size is not divisible by the batch size, and the dataloader has ``drop_last=False`` (the default).
+  The last batch in your training loop will be smaller and trigger a recompilation.
+
+Ideally, you should try to make the input shape(s) to ``forward()`` static.
+However, when this is not possible, you can request PyTorch to compile the code by taking into account possible changes to the input shapes.
+
+.. code-block:: python
+
+    # On PyTorch < 2.2
+    model = torch.compile(model, dynamic=True)
+
+A model compiled with ``dynamic=True`` will typically be slower than a model compiled with static shapes, but it will avoid the extreme cost of recompilation every iteration.
+On PyTorch 2.2 and later, ``torch.compile`` will detect dynamism automatically and you should no longer need to set this.
+
+.. collapse:: Example with dynamic shapes
+
+    The code below shows an example where the model recompiles for several seconds because the input shape changed.
+    You can compare the timing results by toggling ``dynamic=True/False`` in the call to ``torch.compile``:
+
+    .. code-block:: python
+
+        import time
+        import torch
+        import torchvision.models as models
+        import lightning as L
+
+        fabric = L.Fabric(accelerator="cuda", devices=1)
+
+        model = models.inception_v3()
+
+        # dynamic=False is the default
+        torch._dynamo.config.automatic_dynamic_shapes = False
+
+        compiled_model = torch.compile(model)
+        compiled_model = fabric.setup(compiled_model)
+
+        input = torch.randn(16, 3, 512, 512, device=fabric.device)
+        t0 = time.time()
+        compiled_model(input)
+        torch.cuda.synchronize()
+        print(f"1st forward: {time.time() - t0:.2f} seconds.")
+
+        input = torch.randn(8, 3, 512, 512, device=fabric.device)  # note the change in shape
+        t0 = time.time()
+        compiled_model(input)
+        torch.cuda.synchronize()
+        print(f"2nd forward: {time.time() - t0:.2f} seconds.")
+
+    With ``automatic_dynamic_shapes=True``:
+
+    .. code-block:: text
+
+        1st forward: 41.90 seconds.
+        2nd forward: 89.27 seconds.
+
+    With ``automatic_dynamic_shapes=False``:
+
+    .. code-block:: text
+
+        1st forward: 42.12 seconds.
+        2nd forward: 47.77 seconds.
+
+    Numbers produced with NVIDIA A100 SXM4 40GB, PyTorch 2.2.0, CUDA 12.1.
+
+----
+
+
+***********************************
+Experiment with compilation options
+***********************************
+
+There are optional settings that, depending on your model, can give additional speedups.
+
+**CUDA Graphs:** By enabling CUDA Graphs, CUDA will record all computations in a graph and replay it every time forward and backward is called.
+The requirement is that your model must be static, i.e., the input shape must not change and your model must execute the same operations every time.
+Enabling CUDA Graphs often results in a significant speedup, but sometimes also increases the memory usage of your model.
+
+.. code-block:: python
+
+    # Enable CUDA Graphs
+    compiled_model = torch.compile(model, mode="reduce-overhead")
+
+    # This does the same
+    compiled_model = torch.compile(model, options={"triton.cudagraphs": True})
+
+|
+
+**Shape padding:** The specific shape/size of the tensors involved in the computation of your model (input, activations, weights, gradients, etc.) can have an impact on the performance.
+With shape padding enabled, ``torch.compile`` can extend the tensors by padding to a size that gives a better memory alignment.
+Naturally, the tradoff here is that it will consume a bit more memory.
+
+.. code-block:: python
+
+    # Default is False
+    compiled_model = torch.compile(model, options={"shape_padding": True})
+
+
+You can find a full list of compile options in the `PyTorch documentation <https://pytorch.org/docs/stable/generated/torch.compile.html>`_.
+
+----
+
+
+*******************************************************
+(Experimental) Apply torch.compile over FSDP, DDP, etc.
+*******************************************************
+
+As stated earlier, we recommend that you compile the model before calling ``fabric.setup()``.
+However, if you are using DDP or FSDP with Fabric, the compilation won't incorporate the distributed calls inside these wrappers by default.
+In an experimental feature, you can let ``fabric.setup()`` reapply the ``torch.compile`` call after the model gets wrapped in DDP/FSDP internally.
+In the future, this option will become the default.
+
+.. code-block:: python
+
+    # Choose a distributed strategy like DDP or FSDP
+    fabric = L.Fabric(devices=2, strategy="ddp")
+
+    # Compile the model
+    model = torch.compile(model)
+
+    # Default: `fabric.setup()` will not reapply the compilation over DDP/FSDP
+    model = fabric.setup(model, _reapply_compile=False)
+
+    # Recompile the model over DDP/FSDP (experimental)
+    model = fabric.setup(model, _reapply_compile=True)
+
+
+----
+
+
+**************************************
+A note about torch.compile in practice
+**************************************
+
+In practice, you will find that ``torch.compile`` often doesn't work well and can even be counter-productive.
+Compilation may fail with cryptic error messages that are impossible to debug without help from the PyTorch team.
+It is also not uncommon that ``torch.compile`` will produce a significantly *slower* model or one with much higher memory usage.
+On top of that, the compilation phase itself can be incredibly slow, taking several minutes to finish.
+For these reasons, we recommend that you don't waste too much time trying to apply ``torch.compile`` during development, and rather evaluate its effectiveness toward the end when you are about to launch long-running, expensive experiments.
+Always compare the speed and memory usage of the compiled model against the original model!
+
+|
diff --git a/docs/source-fabric/glossary/index.rst b/docs/source-fabric/glossary/index.rst
@@ -69,6 +69,11 @@ Glossary
     :button_link: ../advanced/distributed_communication.html
     :col_css: col-md-4
 
+.. displayitem::
+    :header: Compile
+    :button_link: ../advanced/compile.html
+    :col_css: col-md-4
+
 .. displayitem::
     :header: CUDA
     :button_link: ../fundamentals/accelerators.html
diff --git a/docs/source-fabric/guide/index.rst b/docs/source-fabric/guide/index.rst
@@ -157,6 +157,14 @@ Advanced Topics
     :height: 160
     :tag: advanced
 
+.. displayitem::
+    :header: Speed up models by compiling them
+    :description: Use torch.compile to speed up models on modern hardware
+    :button_link: ../advanced/compile.html
+    :col_css: col-md-4
+    :height: 150
+    :tag: advanced
+
 .. displayitem::
     :header: Train models with billions of parameters
     :description: Train the largest models with FSDP across multiple GPUs and machines
diff --git a/docs/source-fabric/index.rst b/docs/source-fabric/index.rst
@@ -113,8 +113,6 @@ Get Started
     <div class="display-card-container">
         <div class="row">
 
-.. Add callout items below this line
-
 .. displayitem::
     :header: Convert to Fabric in 5 minutes
     :description: Learn how to add Fabric to your PyTorch code
@@ -168,8 +166,6 @@ Get Started
         </div>
     </div>
 
-.. End of callout item section
-
 |
 |
 
diff --git a/docs/source-fabric/levels/advanced.rst b/docs/source-fabric/levels/advanced.rst
@@ -5,6 +5,7 @@
     <../advanced/gradient_accumulation>
     <../advanced/distributed_communication>
     <../advanced/multiple_setup>
+    <../advanced/compile>
     <../advanced/model_parallel/fsdp>
     <../guide/checkpoint/distributed_checkpoint>
 
@@ -42,6 +43,14 @@ Advanced skills
     :height: 170
     :tag: advanced
 
+.. displayitem::
+    :header: Speed up models by compiling them
+    :description: Use torch.compile to speed up models on modern hardware
+    :button_link: ../advanced/compile.html
+    :col_css: col-md-4
+    :height: 170
+    :tag: advanced
+
 .. displayitem::
     :header: Train models with billions of parameters
     :description: Train the largest models with FSDP across multiple GPUs and machines

Original file line number	Diff line number	Diff line change
`@@ -113,8 +113,6 @@ Get Started`
`113`	`113`	`<div class="display-card-container">`
`114`	`114`	`<div class="row">`
`115`	`115`
`116`		`-.. Add callout items below this line`
`117`		`-`
`118`	`116`	`.. displayitem::`
`119`	`117`	`:header: Convert to Fabric in 5 minutes`
`120`	`118`	`:description: Learn how to add Fabric to your PyTorch code`
`@@ -168,8 +166,6 @@ Get Started`
`168`	`166`	`</div>`
`169`	`167`	`</div>`
`170`	`168`
`171`		`-.. End of callout item section`
`172`		`-`
`173`	`169`	`\|`
`174`	`170`	`\|`
`175`	`171`