You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Figure 1**: `torch.compile` is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in `torch.compile`.
27
+
<b>Figure 1</b>: <code>torch.compile</code> is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in <code>torch.compile</code>.
28
28
</p>
29
29
30
30
There are multiple ways to use `torch.compile`. You can use it as a kernel generator (like in Figure 1), where we compile a function. But you can also apply `torch.compile` to your full nn.Module model or submodules of it. Depending on the structure of the model and your requirements (e.g. compile times), [we recommend applying `torch.compile` in different places](https://docs.pytorch.org/docs/stable/`torch.compile`r_troubleshooting.html#setting-expectations).
@@ -37,7 +37,7 @@ One way of optimizing models is to write custom CPU/CUDA operations that perform
**Figure 3**: `torch.compile` captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
57
+
<b>Figure 3</b>: <code>torch.compile</code> captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
58
58
</p>
59
59
60
60
### 2\. Backend (TorchInductor): Optimization and Kernel Generation
@@ -82,7 +82,7 @@ The compiled artifacts and the cache can be reused across machines with the same
**Figure 4**: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
85
+
<b>Figure 4</b>: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
86
86
</p>
87
87
88
88
### Dynamic Batch Sizes and Specialization
@@ -96,7 +96,7 @@ Use `compile_sizes: [1, 2, 4]` in your config to trigger this specialization. Un
**Figure 6**: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
110
+
<b>Figure 6</b>: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
111
111
</p>
112
112
113
113
## Custom Compiler Passes in vLLM
@@ -131,14 +131,14 @@ A common pattern in quantized MLPs is SiLU activation followed by a quantized do
**Figure 7**: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (`fusion`, in yellow) outperformed both `default` (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and `custom` (unfused custom kernels).
134
+
<b>Figure 7</b>: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (<code>fusion</code>, in yellow) outperformed both <code>default</code> (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and <code>custom</code> (unfused custom kernels).
Detailed throughput speedup comparing `fusion` and `default` regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
141
+
<b>Figure 8</b>: Detailed throughput speedup comparing <code>fusion</code> and <code>default</code> regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
0 commit comments