You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Thunder** makes optimizing PyTorch models easy, augmenting them with custom kernels, fusions, quantization, distributed strategies, and more.
23
-
24
-
For **end users**, Thunder comes with plugins that provide model speed-ups out of the box, for optimal utilization of last generation hardware.
25
-
26
-
For **performance experts**, Thunder is the most ergonomic framework for understanding, modifying, and optimizing AI models through composable transformations.
27
-
28
20
<divalign='center'>
29
21
30
22
<pre>
@@ -36,6 +28,28 @@ For **performance experts**, Thunder is the most ergonomic framework for underst
36
28
37
29
</div>
38
30
31
+
Thunder is a source-to-source deep learning compiler for PyTorch that focuses on making it simple to optimize models for training and inference.
32
+
33
+
It provides:
34
+
35
+
- a simple, Pythonic IR capturing the entire computation
36
+
- a rich system of transforms that simultaneously operate on the computation IR, the model, and the weights
37
+
- an extensible dispatch mechanism to fusers and optimized kernel libraries
38
+
39
+
With Thunder you can:
40
+
41
+
- profile deep learning programs easily, map individual ops to kernels and inspect programs interactively
42
+
- programmatically replace sequences of operations with optimized ones and see the effect on performance
43
+
- acquire full computation graphs without graph breaks by flexibly extending the interpreter
44
+
- modify programs to fully utilize bleeding edge kernel libraries on specific hardware
45
+
- write models for single GPU and transform them to run distributed
46
+
- quickly iterate on mixed precision and quantization strategies to search for combinations that minimally affect quality
47
+
- bundle all optimizations in composable recipes, so they can be ported across model families
48
+
49
+
Ultimately, you should think about Thunder as a highly efficient tool to go from “unoptimized” to “optimized”.
50
+
51
+
If that is of interest for you, read on to Install Thunder and get started quickly.
The script `examples/quickstart/hf_benchmarks.py` demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.
304
+
Although is Thunder a tool for optimizing models, rather than an opaque compiler that gets you speedups out of the box, here is a set of benchmarks.
306
305
307
-
On an H100 with torch=2.7.0 and nvfuser-cu126-torch27, running deepseek-ai/DeepSeek-R1-Distill-Llama-1.5B, the thunder executors (NVFuser and torch.compile) achieve the following speedups:
306
+
Perf-wise, out of the box Thunder is in the ballpark of torch compile, especially when using CUDAGraphs. Note however that Thunder is not a competitor to torchcompile! It can actually use torch compile as one of its fusion executors.
308
307
309
-
```
310
-
Text generation:
311
-
Thunder (nvfuser): 3.36× faster
312
-
Thunder (torch.compile): 3.42× faster
308
+
The script `examples/quickstart/hf_llm.py` demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.
313
309
314
-
Forward pass:
315
-
Thunder (nvfuser): 1.51× faster
316
-
Thunder (torch.compile): 1.63× faster
310
+
On an H100 with torch=2.8.0 and nvfuser-cu128-torch28 and Transformers 4.55.4 running Llama 3.2 1B we see the following timings:
317
311
318
-
Forward pass + loss:
319
-
Thunder (nvfuser): 1.55× faster
320
-
Thunder (torch.compile): 1.64× faster
321
-
322
-
Forward + backward:
323
-
Thunder (nvfuser): 1.51× faster
324
-
Thunder (torch.compile): 1.69× faster
312
+
```
313
+
Transformers with torch.compile and CUDAGraphs (reduce-overhead mode): 521ms
314
+
Transformers with torch.compile but no CUDAGraphs (default mode): 814ms
315
+
Transformers without torch.compile: 1493ms
316
+
Thunder with CUDAGraphs: 542ms
325
317
```
326
318
327
319
## Plugins
@@ -352,7 +344,7 @@ Thunder works in three stages:
352
344
353
345
1. ⚡️ It acquires your model by interpreting Python bytecode and producing a straight-line Python program
354
346
355
-
1. ️⚡️ It transforms the computation trace to make it distributed, change precision
347
+
1. ️⚡️ It transforms the model and computation trace to make it distributed, change precision
0 commit comments