Skip to content

Releases: turboderp-org/exllamav3

0.0.16

25 Nov 16:57

Choose a tag to compare

  • Fix regression breaking tensor-parallel inference
  • Allow TP text-model to work with vision tower

Full Changelog: v0.0.15...v0.0.16

0.0.15

16 Nov 12:55

Choose a tag to compare

  • Support Glm4vForConditionalGeneration
  • Support Glm4vMoeForConditionalGeneration
  • Fix some tokenizer issues
  • QoL improvements

Full Changelog: v0.0.14...v0.0.15

0.0.14

10 Nov 00:38

Choose a tag to compare

  • Fix small regression in Gemma and Mistral vision towers.

Full Changelog: v0.0.13...v0.0.14

0.0.13

09 Nov 22:04

Choose a tag to compare

  • Support Qwen3-VL and Qwen3-VL MoE
  • Minor bugfixes

Full Changelog: v0.0.12...v0.0.13

0.0.12

01 Nov 17:27

Choose a tag to compare

  • Support MiniMaxM2ForCausalLM
  • Graphs (reduce CPU overhead)
  • Misc. optimizations
  • Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)
  • Fix some bugs

Full Changelog: v0.0.11...v0.0.12

0.0.11

17 Oct 15:35

Choose a tag to compare

  • Fix issue with TP loading of models quantized since v0.0.9+

Full Changelog: v0.0.10...v0.0.11

0.0.10

15 Oct 12:51

Choose a tag to compare

  • Fix issue preventing AsyncGenerator from working with new requeue option

Full Changelog: v0.0.9...v0.0.10

0.0.9

13 Oct 21:42

Choose a tag to compare

  • Lock MCG and MUL1 multipliers, no longer flag as experimental
  • Switch to MCG codebook by default to new models (use --codebook 3inst for previous default)
  • Add more calibration data
  • Increase default calibration size to 250 rows (use --cal_rows 100 for previous default)
  • Fix quantized cache for bsz > 1
  • Fix kernel selection on A100
  • A few more TP-related fixes

Full Changelog: v0.0.8...v0.0.9

0.0.8

09 Oct 22:12

Choose a tag to compare

  • New GEMM kernel tuning scheme
  • Fix banned strings regression
  • Fix some memory leaks
  • Fix potential stack overflow in cache defrag

Full Changelog: v0.0.7...v0.0.8

0.0.7

28 Sep 15:44

Choose a tag to compare

  • Support SeedOssForCausalLM
  • Support ApertusForCausalLM
  • Support Qwen3NextForCausalLM¹
  • Reduced CPU overhead
  • Fix support for non-AVX2 CPUs
  • Optimized GEMM kernels
  • Faster quantization, especially on Blackwell
  • Quant optimizer utils
  • Much lower overhead from quantized cache
  • Tensor split option for MoE layers with large experts
  • Add recurrent model support to generator
  • Generator now allows allocating pages on the fly
  • Many more improvements and bugfixes

¹ Qwen3-Next currently requires Triton and Flash Linear Attention. causal-conv1d is recommended but not required. Triton-free implementation is in the works for v0.0.8.

Full Changelog: v0.0.6...v0.0.7