Release 0.0.7 · turboderp-org/exllamav3

Support SeedOssForCausalLM
Support ApertusForCausalLM
Support Qwen3NextForCausalLM¹
Reduced CPU overhead
Fix support for non-AVX2 CPUs
Optimized GEMM kernels
Faster quantization, especially on Blackwell
Quant optimizer utils
Much lower overhead from quantized cache
Tensor split option for MoE layers with large experts
Add recurrent model support to generator
Generator now allows allocating pages on the fly
Many more improvements and bugfixes

¹ Qwen3-Next currently requires Triton and Flash Linear Attention. causal-conv1d is recommended but not required. Triton-free implementation is in the works for v0.0.8.

Full Changelog: v0.0.6...v0.0.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

0.0.7

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!