Skip to content

Commit 3ad29c7

Browse files
authored
Merge branch 'main' into copilot/fix-dtensor-check-runtime
2 parents b6e761e + f946af6 commit 3ad29c7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1374
-272
lines changed

.azure/docker-build.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,17 +38,17 @@ jobs:
3838
- job: build_push
3939
strategy:
4040
matrix:
41-
"cuda 12.6 | torch 2.8.0 | cudnn FE v1.10.0":
42-
{ CUDA_VERSION: "12.6.3", TORCH_VERSION: "2.8.0", TRITON_VERSION: "3.4.0", CUDNN_FRONTEND_VERSION: "1.10.0" }
43-
"cuda 12.6 | torch nightly | cudnn FE v1.10.0":
44-
{ CUDA_VERSION: "12.6.3", TORCH_VERSION: "main", TORCH_INSTALL: "source", CUDNN_FRONTEND_VERSION: "1.10.0" }
41+
"cuda 12.8 | torch 2.8.0 | cudnn FE v1.15.0":
42+
{ CUDA_VERSION: "12.8.1", TORCH_VERSION: "2.8.0", TRITON_VERSION: "3.4.0", CUDNN_FRONTEND_VERSION: "1.15.0" }
43+
"cuda 12.8 | torch nightly | cudnn FE v1.15.0":
44+
{ CUDA_VERSION: "12.8.1", TORCH_VERSION: "main", TORCH_INSTALL: "source", CUDNN_FRONTEND_VERSION: "1.15.0" }
4545
#'cuda 12.1': # this version - '8.9.5.29-1+cuda12.1' for 'libcudnn8' was not found
4646
# how much time to give 'run always even if cancelled tasks' before stopping them
4747
cancelTimeoutInMinutes: "2"
4848
timeoutInMinutes: "95"
4949
variables:
5050
UBUNTU_VERSION: "24.04"
51-
PYTHON_VERSION: "3.10"
51+
PYTHON_VERSION: "3.12"
5252
imageRepository: "pytorchlightning/lightning-thunder"
5353
imageTag: "ubuntu$(UBUNTU_VERSION)-cuda$(CUDA_VERSION)-cudnn-fe$(CUDNN_FRONTEND_VERSION)-py$(PYTHON_VERSION)-pt_${TORCH_VERSION/v/}"
5454
pool: "lit-rtx-3090"

.azure/gpu-coverage.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
strategy:
2222
matrix:
2323
"w/ torch 2.7.1":
24-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.7.1-dev"
24+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2525
# how much time to give 'run always even if cancelled tasks' before stopping them
2626
cancelTimeoutInMinutes: "2"
2727
pool: "lit-rtx-3090"
@@ -65,7 +65,7 @@ jobs:
6565
chmod +x codecov
6666
6767
# install this package
68-
python setup.py develop
68+
pip install -e .
6969
displayName: "Install package & ..."
7070
7171
- bash: bash scripts/sanity-check.sh

.azure/gpu-tests.yml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,28 +16,28 @@ jobs:
1616
strategy:
1717
matrix:
1818
"main w/ torch 2.8.0":
19-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
19+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2020
testing: "main"
2121
"ops w/ torch 2.8.0":
22-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
22+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2323
testing: "ops"
2424
"grads w/ torch 2.8.0":
25-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
25+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2626
testing: "grads"
2727
"distributed w/ torch 2.8.0":
28-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
28+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2929
testing: "distributed"
3030
"main w/ torch-nightly":
31-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
31+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
3232
testing: "main"
3333
"ops w/ torch-nightly":
34-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
34+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
3535
testing: "ops"
3636
"grads w/ torch-nightly":
37-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
37+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
3838
testing: "grads"
3939
"distributed w/ torch-nightly":
40-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
40+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
4141
testing: "distributed"
4242
# how much time to give 'run always even if cancelled tasks' before stopping them
4343
cancelTimeoutInMinutes: "2"
@@ -82,7 +82,7 @@ jobs:
8282
chmod +x codecov
8383
8484
# install this package
85-
python setup.py develop
85+
pip install -e .
8686
displayName: "Install package & ..."
8787
8888
- bash: bash scripts/sanity-check.sh

.azure/notebook-runs.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ jobs:
1616
strategy:
1717
matrix:
1818
"notebooks w/ torch 2.8":
19-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
19+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
2020
"notebooks w/ torch-nightly":
21-
docker-image: "ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
21+
docker-image: "ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
2222
# how long to run the job before automatically cancelling
2323
timeoutInMinutes: "45"
2424
# how much time to give 'run always even if cancelled tasks' before stopping them
@@ -53,7 +53,7 @@ jobs:
5353
cat requirements/base.txt
5454
pip install -U -r requirements/notebooks.txt
5555
# install this package
56-
python setup.py develop
56+
pip install -e .
5757
# double check on test requirements
5858
echo "Install special requirements for notebooks"
5959
displayName: "Install package & ..."

.lightning/workflows/all-tests.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,24 @@ interruptible: False
99
parametrize:
1010
matrix:
1111
image:
12-
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
13-
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
12+
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
13+
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
1414
testing: ["main", "ops", "grads"]
1515
machine: ["L4"]
1616
exclude: []
1717
include:
18-
- image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
18+
- image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
1919
testing: "distributed"
2020
machine: "L4_X_2"
21-
- image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
21+
- image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
2222
testing: "distributed"
2323
machine: "L4_X_2"
2424

2525
env:
2626
CI: "true" # skip some tests with CI
2727
NCCL_DEBUG: "INFO"
2828
NCCL_IGNORE_DISABLED_P2P: "1"
29-
TORCH_VERSION: "2.7.1"
29+
TORCH_VERSION: "2.8.0"
3030
CUDA_LAUNCH_BLOCKING: "1" # for debugging purposes, to get better stack traces
3131

3232
run: |
@@ -49,7 +49,7 @@ run: |
4949
chmod +x codecov
5050
5151
# install this package
52-
python setup.py develop
52+
pip install -e .
5353
5454
bash scripts/sanity-check.sh
5555

.lightning/workflows/notebooks.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ interruptible: False
1010
parametrize:
1111
matrix:
1212
image:
13-
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
14-
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_main-dev"
13+
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
14+
- "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_main-dev"
1515
exclude: []
1616
include: []
1717

@@ -29,7 +29,7 @@ run: |
2929
# double check on test requirements
3030
pip install -q -U -r requirements/base.txt -r requirements/notebooks.txt
3131
# install this package
32-
python setup.py develop
32+
pip install -e .
3333
3434
bash scripts/sanity-check.sh
3535

.lightning/workflows/transformer-engine.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ trigger:
77
timeout: "30" # minutes
88
machine: "L4"
99
interruptible: False
10-
image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.6.3-cudnn-fe1.10.0-py3.10-pt_2.8.0-dev"
10+
image: "pytorchlightning/lightning-thunder:ubuntu24.04-cuda12.8.1-cudnn-fe1.15.0-py3.12-pt_2.8.0-dev"
1111
parametrize:
1212
matrix:
1313
test_file:
@@ -20,11 +20,12 @@ run: |
2020
pip list
2121
set -ex
2222
23+
pip install wheel
2324
# conda install -c conda-forge libstdcxx-ng
2425
# sudo apt install libstdc++6 libstdc++-*-dev
2526
pip install . -U -q -r requirements/test.txt
2627
# Need to explicitly point to cudnn.h as it is installed at a non-standard location
2728
# Ref: https://github.com/NVIDIA/TransformerEngine/issues/918#issuecomment-2187703769
28-
CPLUS_INCLUDE_PATH="/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/include/" pip install --no-build-isolation 'transformer_engine[pytorch]'
29+
CPLUS_INCLUDE_PATH="/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/include/" pip install --no-build-isolation 'transformer_engine[pytorch]'
2930
pip list # for debugging purposes
3031
pytest thunder/tests/${test_file} -v -rs

README.md

Lines changed: 38 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,10 @@
1313
 
1414

1515
<strong>Source-to-source compiler for PyTorch.</strong>
16-
Fast. Understandable. Extensible.
16+
Understandable. Inspectable. Extensible.
1717

1818
</div>
1919

20-
______________________________________________________________________
21-
22-
**Thunder** makes optimizing PyTorch models easy, augmenting them with custom kernels, fusions, quantization, distributed strategies, and more.
23-
24-
For **end users**, Thunder comes with plugins that provide model speed-ups out of the box, for optimal utilization of last generation hardware.
25-
26-
For **performance experts**, Thunder is the most ergonomic framework for understanding, modifying, and optimizing AI models through composable transformations.
27-
2820
<div align='center'>
2921

3022
<pre>
@@ -36,6 +28,28 @@ For **performance experts**, Thunder is the most ergonomic framework for underst
3628

3729
</div>
3830

31+
Thunder is a source-to-source deep learning compiler for PyTorch that focuses on making it simple to optimize models for training and inference.
32+
33+
It provides:
34+
35+
- a simple, Pythonic IR capturing the entire computation
36+
- a rich system of transforms that simultaneously operate on the computation IR, the model, and the weights
37+
- an extensible dispatch mechanism to fusers and optimized kernel libraries
38+
39+
With Thunder you can:
40+
41+
- profile deep learning programs easily, map individual ops to kernels and inspect programs interactively
42+
- programmatically replace sequences of operations with optimized ones and see the effect on performance
43+
- acquire full computation graphs without graph breaks by flexibly extending the interpreter
44+
- modify programs to fully utilize bleeding edge kernel libraries on specific hardware
45+
- write models for single GPU and transform them to run distributed
46+
- quickly iterate on mixed precision and quantization strategies to search for combinations that minimally affect quality
47+
- bundle all optimizations in composable recipes, so they can be ported across model families
48+
49+
Ultimately, you should think about Thunder as a highly efficient tool to go from “unoptimized” to “optimized”.
50+
51+
If that is of interest for you, read on to Install Thunder and get started quickly.
52+
3953
<div align='center'>
4054

4155
[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning-thunder/blob/main/LICENSE)
@@ -168,7 +182,7 @@ torch.testing.assert_close(y, model(x))
168182

169183
## Examples
170184

171-
### Speed up LLM training
185+
### LLM training
172186

173187
Install LitGPT (without updating other dependencies)
174188

@@ -194,7 +208,7 @@ out = thunder_model(inp)
194208
out.sum().backward()
195209
```
196210

197-
### Speed up HuggingFace BERT inference
211+
### HuggingFace BERT inference
198212

199213
Install Hugging Face Transformers (recommended version is `4.50.2` and above)
200214

@@ -228,7 +242,7 @@ out = thunder_model(**inp)
228242
print(out)
229243
```
230244

231-
### Speed up HuggingFace DeepSeek R1 distill inference
245+
### HuggingFace DeepSeek R1 distill inference
232246

233247
Install Hugging Face Transformers (recommended version is `4.50.2` and above)
234248

@@ -264,22 +278,7 @@ out = thunder_model.generate(
264278
print(out)
265279
```
266280

267-
To get an idea of the speedups, just run
268-
269-
```bash
270-
python examples/quickstart/hf_llm.py
271-
```
272-
273-
Here what you get on a L4 machine from [Lightning Studio](https://lightning.ai):
274-
275-
```bash
276-
Eager: 2273.22ms
277-
Thunder: 1254.39ms
278-
```
279-
280-
81% faster 🏎️! Quite the speedup ⚡️
281-
282-
### Speed up Vision Transformer inference
281+
### Vision Transformer inference
283282

284283
```python
285284
import thunder
@@ -300,28 +299,21 @@ thunder_model = thunder.compile(model)
300299
out = thunder_model(inp)
301300
```
302301

303-
### Benchmarking HF models
302+
### Benchmarks
304303

305-
The script `examples/quickstart/hf_benchmarks.py` demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.
304+
Although is Thunder a tool for optimizing models, rather than an opaque compiler that gets you speedups out of the box, here is a set of benchmarks.
306305

307-
On an H100 with torch=2.7.0 and nvfuser-cu126-torch27, running deepseek-ai/DeepSeek-R1-Distill-Llama-1.5B, the thunder executors (NVFuser and torch.compile) achieve the following speedups:
306+
Perf-wise, out of the box Thunder is in the ballpark of torch compile, especially when using CUDAGraphs. Note however that Thunder is not a competitor to torch compile! It can actually use torch compile as one of its fusion executors.
308307

309-
```
310-
Text generation:
311-
Thunder (nvfuser): 3.36× faster
312-
Thunder (torch.compile): 3.42× faster
308+
The script `examples/quickstart/hf_llm.py` demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.
313309

314-
Forward pass:
315-
Thunder (nvfuser): 1.51× faster
316-
Thunder (torch.compile): 1.63× faster
310+
On an H100 with torch=2.8.0 and nvfuser-cu128-torch28 and Transformers 4.55.4 running Llama 3.2 1B we see the following timings:
317311

318-
Forward pass + loss:
319-
Thunder (nvfuser): 1.55× faster
320-
Thunder (torch.compile): 1.64× faster
321-
322-
Forward + backward:
323-
Thunder (nvfuser): 1.51× faster
324-
Thunder (torch.compile): 1.69× faster
312+
```
313+
Transformers with torch.compile and CUDAGraphs (reduce-overhead mode): 521ms
314+
Transformers with torch.compile but no CUDAGraphs (default mode): 814ms
315+
Transformers without torch.compile: 1493ms
316+
Thunder with CUDAGraphs: 542ms
325317
```
326318

327319
## Plugins
@@ -352,7 +344,7 @@ Thunder works in three stages:
352344

353345
1. ⚡️ It acquires your model by interpreting Python bytecode and producing a straight-line Python program
354346

355-
1. ️⚡️ It transforms the computation trace to make it distributed, change precision
347+
1. ️⚡️ It transforms the model and computation trace to make it distributed, change precision
356348

357349
1. ⚡️ It routes parts of the trace for execution
358350

dockers/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@ You can build it on your own, note it takes lots of time, be prepared.
66

77
```bash
88
# build with specific arguments
9-
docker image build -t lightning:ubuntu-cuda-py3.10-cuda12.1.1 -f dockers/ubuntu-cuda/Dockerfile --build-arg "CUDA_VERSION=12.1.1" .
9+
docker image build -t lightning:ubuntu-cuda-py3.12-cuda12.8 -f dockers/ubuntu-cuda/Dockerfile --build-arg "CUDA_VERSION=12.1.1" .
1010
```
1111

1212
To run your docker use
1313

1414
```bash
1515
docker image list
16-
docker run --rm -it pytorch-lightning:ubuntu-cuda-py3.10-cuda11.7.0 bash
16+
docker run --rm -it pytorch-lightning:ubuntu-cuda-py3.12-cuda12.8 bash
1717
```
1818

1919
## Run docker image with GPUs
@@ -33,5 +33,5 @@ sudo systemctl restart docker
3333
and later run the docker image with `--gpus=all`. For example,
3434

3535
```bash
36-
docker run --rm -it --gpus=all pytorchlightning/lightning:ubuntu-cuda-py3.10-cuda12.1.0
36+
docker run --rm -it --gpus=all pytorchlightning/lightning:ubuntu-cuda-py3.12-cuda12.1.0
3737
```

0 commit comments

Comments
 (0)