Skip to content

Commit 43bddb3

Browse files
committed
Update benchmarks: 10s through 10min, Rust vs Python
1 parent ea3c7a0 commit 43bddb3

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,13 +169,19 @@ systemctl --user enable --now ace-step-gen
169169

170170
## Performance
171171

172-
Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Uses the [cuDNN ConvTranspose1d patch](https://github.com/huggingface/candle/pull/3383).
172+
Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Rust uses the [cuDNN ConvTranspose1d patch](https://github.com/huggingface/candle/pull/3383).
173173

174174
| Duration | Python (PyTorch) | Rust (candle) | Ratio |
175175
|----------|-----------------|---------------|-------|
176-
| 10s | 0.88s | 0.59s | **1.5x faster** |
176+
| 10s | 0.88s | 0.67s | **1.3x faster** |
177177
| 30s | 1.38s | 1.25s | **1.1x faster** |
178+
| 1 min | 2.65s | 2.33s | **1.1x faster** |
179+
| 2 min | 4.75s | 5.19s | 1.1x slower |
178180
| 4 min | 9.26s | 12.04s | 1.3x slower |
181+
| 6 min | 15.33s | 21.36s | 1.4x slower |
182+
| 7 min | 19.13s | 27.68s | 1.4x slower |
183+
| 8 min | 22.70s | OOM ||
184+
| 10 min | 30.79s | OOM ||
179185

180186
<details>
181187
<summary>Per-stage breakdown (30s)</summary>
@@ -188,7 +194,7 @@ Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Uses the [cuDNN ConvTra
188194

189195
</details>
190196

191-
Rust wins at short/medium durations. At longer durations PyTorch's Cutlass tensor-core VAE kernels (cuDNN v9 engine API) give it an edge. Without the [candle patch](https://github.com/huggingface/candle/pull/3383), VAE decode is ~3s (100x slower ConvTranspose1d).
197+
Rust wins up to ~1 min. Beyond that, PyTorch's Cutlass tensor-core VAE kernels (cuDNN v9 engine API) and better memory efficiency give it an edge — Rust OOMs on 24GB at 8+ minutes while Python handles the full 10 minutes. Without the [candle patch](https://github.com/huggingface/candle/pull/3383), VAE decode is ~3s at 30s (100x slower ConvTranspose1d).
192198

193199
## Running Tests
194200

0 commit comments

Comments
 (0)