0.0.7
- Support SeedOssForCausalLM
- Support ApertusForCausalLM
- Support Qwen3NextForCausalLM¹
- Reduced CPU overhead
- Fix support for non-AVX2 CPUs
- Optimized GEMM kernels
- Faster quantization, especially on Blackwell
- Quant optimizer utils
- Much lower overhead from quantized cache
- Tensor split option for MoE layers with large experts
- Add recurrent model support to generator
- Generator now allows allocating pages on the fly
- Many more improvements and bugfixes
¹ Qwen3-Next currently requires Triton and Flash Linear Attention. causal-conv1d is recommended but not required. Triton-free implementation is in the works for v0.0.8.
Full Changelog: v0.0.6...v0.0.7