Skip to content

Commit 729ed09

Browse files
committed
README: link upstream candle PRs for required patches
1 parent 11eaa3f commit 729ed09

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,10 @@ For cuDNN-accelerated ConvTranspose1d (faster VAE decode):
112112
cargo build --release --features cudnn
113113
```
114114

115+
Requires a [candle fork](https://github.com/Marenz/candle/tree/fast-conv-transpose1d-no-cudnn) with two upstream PRs:
116+
- [cuDNN ConvTranspose1d](https://github.com/huggingface/candle/pull/3383) — 100x faster VAE decode vs the default CPU fallback kernel
117+
- [public `Model::clear_kv_cache` for Qwen3](https://github.com/huggingface/candle/pull/3381) — needed to reset KV state between inference calls
118+
115119
Depending on your system, you may need additional environment variables for the CUDA build — see [AGENTS.md](AGENTS.md) for platform-specific notes.
116120

117121
### Metal (macOS)
@@ -157,7 +161,7 @@ cargo build --release --bin generation-daemon --features cuda,audio-ogg
157161

158162
## Performance
159163

160-
Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Uses a local candle patch adding cuDNN ConvTranspose1d.
164+
Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Uses the [cuDNN ConvTranspose1d patch](https://github.com/huggingface/candle/pull/3383).
161165

162166
| Duration | Python (PyTorch) | Rust (candle) | Ratio |
163167
|----------|-----------------|---------------|-------|
@@ -176,7 +180,7 @@ Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Uses a local candle pat
176180

177181
</details>
178182

179-
Rust wins at short/medium durations. At longer durations PyTorch's Cutlass tensor-core VAE kernels (cuDNN v9 engine API) give it an edge. Without the candle patch, VAE decode is ~3s (100x slower ConvTranspose1d).
183+
Rust wins at short/medium durations. At longer durations PyTorch's Cutlass tensor-core VAE kernels (cuDNN v9 engine API) give it an edge. Without the [candle patch](https://github.com/huggingface/candle/pull/3383), VAE decode is ~3s (100x slower ConvTranspose1d).
180184

181185
## Running Tests
182186

0 commit comments

Comments
 (0)