Skip to content

Commit 4f994c8

Browse files
committed
Img widths 100%
Signed-off-by: Luka Govedič <[email protected]>
1 parent 7ff945a commit 4f994c8

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

_posts/2025-08-20-torch-compile.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ In the following example, torch.compile produces a single fused kernel for all p
2222

2323
<p align="center">
2424
<picture>
25-
<img src="/assets/figures/2025-torch-compile/figure1.png" width="80%">
25+
<img src="/assets/figures/2025-torch-compile/figure1.png" width="100%">
2626
</picture><br>
2727
<b>Figure 1</b>: torch.compile is a JIT compiler for PyTorch code. You can wrap functions, nn.Modules, and other callables in torch.compile.
2828
</p>
@@ -35,7 +35,7 @@ One way of optimizing models is to write custom CPU/CUDA operations that perform
3535

3636
<p align="center">
3737
<picture>
38-
<img src="/assets/figures/2025-torch-compile/figure2.png" width="80%">
38+
<img src="/assets/figures/2025-torch-compile/figure2.png" width="100%">
3939
</picture><br>
4040
<b>Figure 2</b>: torch.compile gives you fast baseline performance to save YOU development time from tuning model performance.
4141
</p>
@@ -52,7 +52,7 @@ In the following code example, torch.save is an unsupported operation: torch.com
5252

5353
<p align="center">
5454
<picture>
55-
<img src="/assets/figures/2025-torch-compile/figure3.png" width="80%">
55+
<img src="/assets/figures/2025-torch-compile/figure3.png" width="100%">
5656
</picture><br>
5757
<b>Figure 3</b>: torch.compile captures straight-line graphs of Tensor operations and works around unsupported operations like torch.save.
5858
</p>
@@ -80,7 +80,7 @@ The compiled artifacts and the cache can be reused across machines with the same
8080

8181
<p align="center">
8282
<picture>
83-
<img src="/assets/figures/2025-torch-compile/figure4.png" width="80%">
83+
<img src="/assets/figures/2025-torch-compile/figure4.png" width="100%">
8484
</picture><br>
8585
<b>Figure 4</b>: Compiled artifacts are cached after cold start and can be reused across machines to ensure fast, consistent startup when set up correctly.
8686
</p>
@@ -93,8 +93,8 @@ Use `compile_sizes: [1, 2, 4]` in your config to trigger this specialization. Un
9393

9494
<p align="center">
9595
<picture>
96-
<img src="/assets/figures/2025-torch-compile/figure5_a.png" width="80%">
97-
<img src="/assets/figures/2025-torch-compile/figure5_b.png" width="80%">
96+
<img src="/assets/figures/2025-torch-compile/figure5_a.png" width="100%">
97+
<img src="/assets/figures/2025-torch-compile/figure5_b.png" width="100%">
9898
</picture><br>
9999
<b>Figure 5</b>: How to specify specializing compilation on specific batch sizes.
100100
</p>
@@ -105,7 +105,7 @@ Not all operations are compatible with CUDA Graphs; for example, [cascade attent
105105

106106
<p align="center">
107107
<picture>
108-
<img src="/assets/figures/2025-torch-compile/figure6.png" width="80%">
108+
<img src="/assets/figures/2025-torch-compile/figure6.png" width="100%">
109109
</picture><br>
110110
<b>Figure 6</b>: Piecewise CUDA Graphs in vLLM capture and replay supported GPU kernel sequences for low-overhead execution, while skipping unsupported operations like cascade attention.
111111
</p>
@@ -129,14 +129,14 @@ A common pattern in quantized MLPs is SiLU activation followed by a quantized do
129129

130130
<p align="center">
131131
<picture>
132-
<img src="/assets/figures/2025-torch-compile/figure7.png" width="80%">
132+
<img src="/assets/figures/2025-torch-compile/figure7.png" width="100%">
133133
</picture><br>
134134
<b>Figure 7</b>: On Llama 3.1 405B quantized to FP8, tested on 8x AMD MI300s, fused kernels (<code>fusion</code>, in yellow) outperformed both <code>default</code> (using torch ops for RMSNorm and SiLU and custom FP8 quant kernel) and <code>custom</code> (unfused custom kernels).
135135
</p>
136136

137137
<p align="center">
138138
<picture>
139-
<img src="/assets/figures/2025-torch-compile/figure8.png" width="80%">
139+
<img src="/assets/figures/2025-torch-compile/figure8.png" width="100%">
140140
</picture><br>
141141
<b>Figure 8</b>: Detailed throughput speedup comparing <code>fusion</code> and <code>default</code> regimes above. If all quantization overhead (8%) was removed via fusion, the theoretical maximum improvement to throughput would be 8%, and we can see that improvement reached in some cases.
142142
</p>

0 commit comments

Comments
 (0)