Releases: turboderp-org/exllamav3
Releases · turboderp-org/exllamav3
0.0.16
- Fix regression breaking tensor-parallel inference
- Allow TP text-model to work with vision tower
Full Changelog: v0.0.15...v0.0.16
0.0.15
- Support Glm4vForConditionalGeneration
- Support Glm4vMoeForConditionalGeneration
- Fix some tokenizer issues
- QoL improvements
Full Changelog: v0.0.14...v0.0.15
0.0.14
- Fix small regression in Gemma and Mistral vision towers.
Full Changelog: v0.0.13...v0.0.14
0.0.13
- Support Qwen3-VL and Qwen3-VL MoE
- Minor bugfixes
Full Changelog: v0.0.12...v0.0.13
0.0.12
- Support MiniMaxM2ForCausalLM
- Graphs (reduce CPU overhead)
- Misc. optimizations
- Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)
- Fix some bugs
Full Changelog: v0.0.11...v0.0.12
0.0.11
- Fix issue with TP loading of models quantized since v0.0.9+
Full Changelog: v0.0.10...v0.0.11
0.0.10
- Fix issue preventing AsyncGenerator from working with new requeue option
Full Changelog: v0.0.9...v0.0.10
0.0.9
- Lock MCG and MUL1 multipliers, no longer flag as experimental
- Switch to MCG codebook by default to new models (use
--codebook 3instfor previous default) - Add more calibration data
- Increase default calibration size to 250 rows (use
--cal_rows 100for previous default) - Fix quantized cache for bsz > 1
- Fix kernel selection on A100
- A few more TP-related fixes
Full Changelog: v0.0.8...v0.0.9
0.0.8
- New GEMM kernel tuning scheme
- Fix banned strings regression
- Fix some memory leaks
- Fix potential stack overflow in cache defrag
Full Changelog: v0.0.7...v0.0.8
0.0.7
- Support SeedOssForCausalLM
- Support ApertusForCausalLM
- Support Qwen3NextForCausalLM¹
- Reduced CPU overhead
- Fix support for non-AVX2 CPUs
- Optimized GEMM kernels
- Faster quantization, especially on Blackwell
- Quant optimizer utils
- Much lower overhead from quantized cache
- Tensor split option for MoE layers with large experts
- Add recurrent model support to generator
- Generator now allows allocating pages on the fly
- Many more improvements and bugfixes
¹ Qwen3-Next currently requires Triton and Flash Linear Attention. causal-conv1d is recommended but not required. Triton-free implementation is in the works for v0.0.8.
Full Changelog: v0.0.6...v0.0.7