Skip to content

Commit 5389715

Browse files
committed
Fix figure captions (html)
Signed-off-by: Luka Govedič <[email protected]>
1 parent c83c355 commit 5389715

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

_posts/2025-08-20-torch-compile.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: post
3-
title: "Introduction to `torch.compile` and How It Works with vLLM"
3+
title: "Introduction to <code>torch.compile</code> and How It Works with vLLM"
44
author: "[Luka Govedič](https://github.com/proexpertprog) (Red Hat), [Richard Zou](https://github.com/zou3519) (Meta), Addie Stevens (Red Hat), [Kaichao You](https://github.com/youkaichao) (Tsinghua University), [Michael Goin](https://github.com/mgoin) (Red Hat), Saša Zelenović (Red Hat)"
55
image: /assets/logos/vllm-logo-text-light.png
66
---
@@ -24,7 +24,7 @@ In the following example, `torch.compile` produces a single fused kernel for all
2424
<picture>
2525
<img src="/assets/figures/2025-torch-compile/figure1.png" width="80%">
2626
</picture><br>
27-
**Figure 1**: `torch.compile` is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in `torch.compile`.
27+
<b>Figure 1</b>: <code>torch.compile</code> is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in <code>torch.compile</code>.
2828
</p>
2929

3030
There are multiple ways to use `torch.compile`. You can use it as a kernel generator (like in Figure 1), where we compile a function. But you can also apply `torch.compile` to your full nn.Module model or submodules of it. Depending on the structure of the model and your requirements (e.g. compile times), [we recommend applying `torch.compile` in different places](https://docs.pytorch.org/docs/stable/`torch.compile`r_troubleshooting.html#setting-expectations).
@@ -37,7 +37,7 @@ One way of optimizing models is to write custom CPU/CUDA operations that perform
3737
<picture>
3838
<img src="/assets/figures/2025-torch-compile/figure2.png" width="80%">
3939
</picture><br>
40-
**Figure 2**: `torch.compile` gives you fast baseline performance to save YOU development time from tuning model performance.
40+
<b>Figure 2</b>: <code>torch.compile</code> gives you fast baseline performance to save YOU development time from tuning model performance.
4141
</p>
4242

4343
## How `torch.compile` Works
@@ -54,7 +54,7 @@ In the following code example, torch.save is an unsupported operation: `torch.co
5454
<picture>
5555
<img src="/assets/figures/2025-torch-compile/figure3.png" width="80%">
5656
</picture><br>
57-
**Figure 3**: `torch.compile` captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
57+
<b>Figure 3</b>: <code>torch.compile</code> captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
5858
</p>
5959

6060
### 2\. Backend (TorchInductor): Optimization and Kernel Generation
@@ -82,7 +82,7 @@ The compiled artifacts and the cache can be reused across machines with the same
8282
<picture>
8383
<img src="/assets/figures/2025-torch-compile/figure4.png" width="80%">
8484
</picture><br>
85-
**Figure 4**: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
85+
<b>Figure 4</b>: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
8686
</p>
8787

8888
### Dynamic Batch Sizes and Specialization
@@ -96,7 +96,7 @@ Use `compile_sizes: [1, 2, 4]` in your config to trigger this specialization. Un
9696
<img src="/assets/figures/2025-torch-compile/figure5_a.png" width="80%">
9797
<img src="/assets/figures/2025-torch-compile/figure5_b.png" width="80%">
9898
</picture><br>
99-
**Figure 5**: How to specify specializing compilation on specific batch sizes.
99+
<b>Figure 5</b>: How to specify specializing compilation on specific batch sizes.
100100
</p>
101101

102102
### Piecewise CUDA Graphs
@@ -107,7 +107,7 @@ Not all operations are compatible with CUDA Graphs; for example, [cascade attent
107107
<picture>
108108
<img src="/assets/figures/2025-torch-compile/figure6.png" width="80%">
109109
</picture><br>
110-
**Figure 6**: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
110+
<b>Figure 6</b>: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
111111
</p>
112112

113113
## Custom Compiler Passes in vLLM
@@ -131,14 +131,14 @@ A common pattern in quantized MLPs is SiLU activation followed by a quantized do
131131
<picture>
132132
<img src="/assets/figures/2025-torch-compile/figure7.png" width="80%">
133133
</picture><br>
134-
**Figure 7**: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (`fusion`, in yellow) outperformed both `default` (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and `custom` (unfused custom kernels).
134+
<b>Figure 7</b>: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (<code>fusion</code>, in yellow) outperformed both <code>default</code> (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and <code>custom</code> (unfused custom kernels).
135135
</p>
136136

137137
<p align="center">
138138
<picture>
139139
<img src="/assets/figures/2025-torch-compile/figure8.png" width="80%">
140140
</picture><br>
141-
Detailed throughput speedup comparing `fusion` and `default` regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
141+
<b>Figure 8</b>: Detailed throughput speedup comparing <code>fusion</code> and <code>default</code> regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
142142
</p>
143143

144144
> [!NOTE]

0 commit comments

Comments
 (0)