Releases · turboderp-org/exllamav3

25 Nov 16:57

github-actions

v0.0.16

9e314e6

0.0.16 Latest

Latest

Fix regression breaking tensor-parallel inference
Allow TP text-model to work with vision tower

Full Changelog: v0.0.15...v0.0.16

Assets 19

exllamav3-0.0.16+cu128.torch2.7.0-cp310-cp310-linux_x86_64.whl

sha256:25db529a2a416f26d497a9fac07e5d53408199005dbe79e316621c4b67eb305c

140 MB 2025-11-25T17:23:21Z
exllamav3-0.0.16+cu128.torch2.7.0-cp310-cp310-win_amd64.whl

sha256:fb52b7ba66f0d967dce807aa8fe8ebaa640da8b337124338dda2f34d9067cd2a

128 MB 2025-11-25T17:25:21Z
exllamav3-0.0.16+cu128.torch2.7.0-cp311-cp311-linux_x86_64.whl

sha256:e564fd0ec896d16137b14d779a7e40f87e11720fadd1519fe60c87a789b770de

140 MB 2025-11-25T17:22:52Z
exllamav3-0.0.16+cu128.torch2.7.0-cp311-cp311-win_amd64.whl

sha256:88d2a9b5befad7088827dd97050135e3c66ac5b647265e8822680ccce14822c5

128 MB 2025-11-25T17:26:12Z
exllamav3-0.0.16+cu128.torch2.7.0-cp312-cp312-linux_x86_64.whl

sha256:b6005c8d02c7ecf2db9ca94bfe3e0f9626ce8c212a98fe362df701d84a2c230d

140 MB 2025-11-25T17:21:39Z
exllamav3-0.0.16+cu128.torch2.7.0-cp312-cp312-win_amd64.whl

sha256:0c7ab5e951eabf5c991157c935fde931ec53bf958439ae15fb9be9c895ce4445

128 MB 2025-11-25T17:26:17Z
exllamav3-0.0.16+cu128.torch2.7.0-cp313-cp313-linux_x86_64.whl

sha256:bee9ae8baa73306db18fbb61b0cbd01ebab685852af3f1c7de907c89cc20b6c9

140 MB 2025-11-25T17:22:46Z
exllamav3-0.0.16+cu128.torch2.7.0-cp313-cp313-win_amd64.whl

sha256:9096cca705b3550f095646b14c955d90c78983ea183ed4f1870b66fbfa67a85f

128 MB 2025-11-25T17:27:40Z
exllamav3-0.0.16+cu128.torch2.8.0-cp310-cp310-linux_x86_64.whl

sha256:9c22645dbf31bbf0b119ba513c42c8f75b61af0e4443ff83ee2c474cce476db8

140 MB 2025-11-25T17:20:43Z
exllamav3-0.0.16+cu128.torch2.8.0-cp310-cp310-win_amd64.whl

sha256:6b0489ba40696e3aac3ffaca3d481604369cc4b2558f0b5fb5c7d849aa0a335b

128 MB 2025-11-25T17:24:56Z
Source code (zip)

2025-11-25T16:54:31Z
Source code (tar.gz)

2025-11-25T16:54:31Z

16 Nov 12:55

github-actions

v0.0.15

2fc131e

0.0.15

Support Glm4vForConditionalGeneration
Support Glm4vMoeForConditionalGeneration
Fix some tokenizer issues
QoL improvements

Full Changelog: v0.0.14...v0.0.15

Assets 19

10 Nov 00:38

github-actions

v0.0.14

2a4aac0

0.0.14

Fix small regression in Gemma and Mistral vision towers.

Full Changelog: v0.0.13...v0.0.14

Assets 19

09 Nov 22:04

github-actions

v0.0.13

98e1c40

0.0.13

Support Qwen3-VL and Qwen3-VL MoE
Minor bugfixes

Full Changelog: v0.0.12...v0.0.13

Assets 19

01 Nov 17:27

github-actions

v0.0.12

e384d39

0.0.12

Support MiniMaxM2ForCausalLM
Graphs (reduce CPU overhead)
Misc. optimizations
Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)
Fix some bugs

Full Changelog: v0.0.11...v0.0.12

Assets 19

17 Oct 15:35

github-actions

v0.0.11

b817f0e

0.0.11

Fix issue with TP loading of models quantized since v0.0.9+

Full Changelog: v0.0.10...v0.0.11

Assets 19

15 Oct 12:51

github-actions

v0.0.10

a29d97a

0.0.10

Fix issue preventing AsyncGenerator from working with new requeue option

Full Changelog: v0.0.9...v0.0.10

Assets 19

13 Oct 21:42

github-actions

v0.0.9

f2b8349

0.0.9

Lock MCG and MUL1 multipliers, no longer flag as experimental
Switch to MCG codebook by default to new models (use --codebook 3inst for previous default)
Add more calibration data
Increase default calibration size to 250 rows (use --cal_rows 100 for previous default)
Fix quantized cache for bsz > 1
Fix kernel selection on A100
A few more TP-related fixes

Full Changelog: v0.0.8...v0.0.9

Assets 19

09 Oct 22:12

github-actions

v0.0.8

41e2846

0.0.8

New GEMM kernel tuning scheme
Fix banned strings regression
Fix some memory leaks
Fix potential stack overflow in cache defrag

Full Changelog: v0.0.7...v0.0.8

Assets 19

28 Sep 15:44

github-actions

v0.0.7

4c23cb9

0.0.7

Support SeedOssForCausalLM
Support ApertusForCausalLM
Support Qwen3NextForCausalLM¹
Reduced CPU overhead
Fix support for non-AVX2 CPUs
Optimized GEMM kernels
Faster quantization, especially on Blackwell
Quant optimizer utils
Much lower overhead from quantized cache
Tensor split option for MoE layers with large experts
Add recurrent model support to generator
Generator now allows allocating pages on the fly
Many more improvements and bugfixes

¹ Qwen3-Next currently requires Triton and Flash Linear Attention. causal-conv1d is recommended but not required. Triton-free implementation is in the works for v0.0.8.

Full Changelog: v0.0.6...v0.0.7

Assets 19

Uh oh!

Releases: turboderp-org/exllamav3

0.0.16

Uh oh!

0.0.15

Uh oh!

0.0.14

Uh oh!

0.0.13

Uh oh!

0.0.12

Uh oh!

0.0.11

Uh oh!

0.0.10

Uh oh!

0.0.9

Uh oh!

0.0.8

Uh oh!

0.0.7

Uh oh!