Skip to content

Commit c83c355

Browse files
committed
Fix figure link
Signed-off-by: Luka Govedič <[email protected]>
1 parent 9c2de3e commit c83c355

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

_posts/2025-08-20-torch-compile.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ In the following example, `torch.compile` produces a single fused kernel for all
2222

2323
<p align="center">
2424
<picture>
25-
<img src="/assets/figures/torch-compile/figure1.png" width="80%">
25+
<img src="/assets/figures/2025-torch-compile/figure1.png" width="80%">
2626
</picture><br>
2727
**Figure 1**: `torch.compile` is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in `torch.compile`.
2828
</p>
@@ -35,7 +35,7 @@ One way of optimizing models is to write custom CPU/CUDA operations that perform
3535

3636
<p align="center">
3737
<picture>
38-
<img src="/assets/figures/torch-compile/figure2.png" width="80%">
38+
<img src="/assets/figures/2025-torch-compile/figure2.png" width="80%">
3939
</picture><br>
4040
**Figure 2**: `torch.compile` gives you fast baseline performance to save YOU development time from tuning model performance.
4141
</p>
@@ -52,7 +52,7 @@ In the following code example, torch.save is an unsupported operation: `torch.co
5252

5353
<p align="center">
5454
<picture>
55-
<img src="/assets/figures/torch-compile/figure3.png" width="80%">
55+
<img src="/assets/figures/2025-torch-compile/figure3.png" width="80%">
5656
</picture><br>
5757
**Figure 3**: `torch.compile` captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
5858
</p>
@@ -80,7 +80,7 @@ The compiled artifacts and the cache can be reused across machines with the same
8080

8181
<p align="center">
8282
<picture>
83-
<img src="/assets/figures/torch-compile/figure4.png" width="80%">
83+
<img src="/assets/figures/2025-torch-compile/figure4.png" width="80%">
8484
</picture><br>
8585
**Figure 4**: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
8686
</p>
@@ -93,8 +93,8 @@ Use `compile_sizes: [1, 2, 4]` in your config to trigger this specialization. Un
9393

9494
<p align="center">
9595
<picture>
96-
<img src="/assets/figures/torch-compile/figure5_a.png" width="80%">
97-
<img src="/assets/figures/torch-compile/figure5_b.png" width="80%">
96+
<img src="/assets/figures/2025-torch-compile/figure5_a.png" width="80%">
97+
<img src="/assets/figures/2025-torch-compile/figure5_b.png" width="80%">
9898
</picture><br>
9999
**Figure 5**: How to specify specializing compilation on specific batch sizes.
100100
</p>
@@ -105,7 +105,7 @@ Not all operations are compatible with CUDA Graphs; for example, [cascade attent
105105

106106
<p align="center">
107107
<picture>
108-
<img src="/assets/figures/torch-compile/figure6.png" width="80%">
108+
<img src="/assets/figures/2025-torch-compile/figure6.png" width="80%">
109109
</picture><br>
110110
**Figure 6**: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
111111
</p>
@@ -129,14 +129,14 @@ A common pattern in quantized MLPs is SiLU activation followed by a quantized do
129129

130130
<p align="center">
131131
<picture>
132-
<img src="/assets/figures/torch-compile/figure7.png" width="80%">
132+
<img src="/assets/figures/2025-torch-compile/figure7.png" width="80%">
133133
</picture><br>
134134
**Figure 7**: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (`fusion`, in yellow) outperformed both `default` (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and `custom` (unfused custom kernels).
135135
</p>
136136

137137
<p align="center">
138138
<picture>
139-
<img src="/assets/figures/torch-compile/figure8.png" width="80%">
139+
<img src="/assets/figures/2025-torch-compile/figure8.png" width="80%">
140140
</picture><br>
141141
Detailed throughput speedup comparing `fusion` and `default` regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
142142
</p>

0 commit comments

Comments
 (0)