Releases: ModelCloud/GPTQModel
Releases · ModelCloud/GPTQModel
GPT-QModel v6.0.3
Notable Changes:
Quantization and inference
- Major ParoQuant improvements across speed, inference, and accuracy.
- Added Paro inference support and a new layer optimizer.
- Auto-enables AMP for the fast Paro implementation to better match reference behavior.
- Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
- Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
- Fixed a layer output replay/re-capture regression.
- Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
- Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.
Model and backend support
- Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
- Added PrismML/Bonsai model support for inference.
- Fixed Qwen3_5QModel definition issues.
- Fixed Qwen 3.5 rotary embedding behavior.
- Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
- Fixed awq_processor.dynamic so skipped layers are handled correctly.
- Improved dtype compatibility.
- Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.
Evaluation, calibration, and usability
- Integrated Evalution into the workflow.
- Added evalution.VLLM and evalution.SGLang backends.
- Fixed SGLang evaluation engine initialization.
- Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
- Improved calibration data device handling.
- Updated tokenizer handling, and collation now respects tokenizer padding_size.
- Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
- Cleaned up warning behavior and added an option to suppress warnings.
- Removed forced random seed overrides.
Dependency and compatibility updates
- Updated pypcre to 0.2.14.
- Pinned logbar to >=0.4.1.
- Updated transformers and defuser package versions.
- Fixed SAVE_PATH handling and import path resolution issues.
Breaking and removed
- Removed GPTQModel.upload_to_hub().
- Removed MLX export support.
What's Changed
- [CI] fix pkgs' order & fix flashinfer version was overridden by @CSY-ModelCloud in #2575
- allow to disable warning by @CSY-ModelCloud in #2576
- lazy load _DEVICE_THREAD_POOL, to speed up import by @CSY-ModelCloud in #2577
- remove disable env check by @CSY-ModelCloud in #2578
- [CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2579
- Update pypcre version to 0.2.14 by @Qubitium in #2581
- Nothing to see here... by @Qubitium in #2456
- dtype compat by @Qubitium in #2582
- fix test_moe_config by @ZX-ModelCloud in #2583
- fix new format test by @ZX-ModelCloud in #2586
- [CI] add test config by @CSY-ModelCloud in #2587
- fix Qwen3_5QModel definition by @ZX-ModelCloud in #2588
- speed up paroquant quant speed and resolve accuracy issues by @Qubitium in #2590
- append last commit to version by @CSY-ModelCloud in #2591
- speedup paroquant test by @ZX-ModelCloud in #2592
- [CI] generate release matrix from torch registry by @CSY-ModelCloud in #2593
- Evalution integration by @Qubitium in #2585
- move eval.sh to tests by @Qubitium in #2594
- remove warning by @Qubitium in #2595
- [CI] use new docker image by @CSY-ModelCloud in #2596
- [CI] install required pkg by @CSY-ModelCloud in #2597
- Automatically Determine MODEL_COMPAT_FAST_LAYER_COUNT by @ZX-ModelCloud in #2598
- [CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2599
- Fix: Paroquant impl accuracy by @Qubitium in #2601
- remove forced random seed override in cls proper by @Qubitium in #2603
- Paro test by @Qubitium in #2604
- [FIX] incorrect SAVE_PATH by @ZX-ModelCloud in #2605
- pin logbar to >= 0.4.1 by @Qubitium in #2606
- Update the evalution scores by @ZX-ModelCloud in #2600
- Paro: auto enable amp for fast impl to sync with reference by @Qubitium in #2607
- paro: fix seeding and cleanup by @Qubitium in #2609
- gate hf kernel to non-nogil builds of python until upsteram fix wheels by @Qubitium in #2610
- [CI] use Ubuntu 24.04 docker image by @CSY-ModelCloud in #2612
- Fix layer output re-capture (replay) regression by @Qubitium in #2611
- remove legacy ppl codes by @Qubitium in #2613
- replace marlin_fp16 backend with env flag control for fp32 reduction … by @Qubitium in #2614
- [CI] default py 3.14t & install latest Evalution by @CSY-ModelCloud in #2616
- [CI] fix Evalution is private by @CSY-ModelCloud in #2617
- updat tokenicer by @Qubitium in #2618
- make collate respect tokenier padding_size by @Qubitium in #2620
- paro: clamp learned channel scales to avoid collapse by @Qubitium in #2622
- Calibration data device by @avtc in #2608
- [FIX] qwen3_5 rotary_embedding by @ZX-ModelCloud in #2624
- Temporarily disable gptqmodel spit_by feature by @ZX-ModelCloud in #2625
- use evalution.VLLM by @CSY-ModelCloud in #2615
- use evalution.SGLang by @ZX-ModelCloud in #2626
- paro: enter the dragon by @Qubitium in #2623
- [CI] use torch 2.11 by @CSY-ModelCloud in #2627
- [FIX] sglang evaluation engine initialization error. by @ZX-ModelCloud in #2629
- [MODEL] Add minicpmo support by @ZX-ModelCloud in #2630
- [CI] update CI path by @CSY-ModelCloud in #2633
- [FIX] qwen3_5_moe / llama4 / qwen2_moe / qwen3_next awq layer grouping by @ZX-ModelCloud in #2634
- Remove GPTQModel.upload_to_hub() api by @ZX-ModelCloud in #2635
- remove export to mlx option by @ZX-ModelCloud in #2636
- [MODEL] supports minicpmv by @ZX-ModelCloud in #2637
- Paro: layer optimizer by @Qubitium in #2628
- Paro inference by @Qubitium in #2638
- PrismAI/Bonsai Model Support (inference only) by @Qubitium in #2640
- Update README.md by @Qubitium in #2641
- Update transformers and defuser package versions by @Qubitium in #2642
- [CI] install gguf for test_local_model_paths by @CSY-ModelCloud in #2645
- fix imported path not found by @CSY-ModelCloud in #2646
- [MODEL] support glm4_moe_lite by @ZX-ModelCloud in #2644
- [FEATURE] Add
FOEM: First-Order Error Matters; Accurate Compensation for Quantized LLM by @Xingyu-Zheng in #2639 - Revise README with latest news and article references by @Qubitium in #2647
- FIX paroquant bf16 rotation support for fused cuda kernel by @Qubitium in #2648
- paroquant rotation autotune by @Qubitium in #2649
- [FIX] In
awq_processor,dynamicdid not correctly skip layers. by @ZX-ModelCloud in #2650 - ruff fix by @Qubitium in #2651
- Ruff fix by @Qubitium in #2652
- update readme by @Qubitium in #2653
- fix: ensure contagious tensors by @Qubitium in #2655
- fix failed test by @ZX-ModelCloud in https://github.com/ModelCl...
GPT-QModel v5.8.0
Notable Changes
-
Transformers 5.3.0 compatibility.
-
Video Quantization Support
- Added support for video input during quantization.
-
MoE & Model Support
- Added support for Qwen 3.5 and Qwen 3.5 MoE.
- Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
- Added support for LLada2 block diffusion LLM models.
- Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
- Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
-
AWQ / GPTQ Kernels
- Added CPU fused AWQ kernels for
torch_fusedandhf_kernel. - Added torch_int8 AWQ kernel.
- Added BitBLAS AWQ kernel.
- Ported Intel int8 GPTQ/AWQ kernels.
- Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
- Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
- Added CPU fused AWQ kernels for
-
Quantization Improvements
- Replaced greedy search with ternary search in SmoothBSE.
- Fixed SmoothMAD overly aggressive clipping.
- Added layer-level dynamic skip for fast quantization.
- Added early stop when all remaining layers are skipped during quantization.
- Fixed AWQ OOM and dequantization-related issues.
-
Runtime & Dequantization
- Added optional CPU int64
g_idxcache for TorchQuantLinear dequantization. - Improved TorchFused dequantization and fp32 dtype support.
- Removed unnecessary symmetric handling in
dequantize_gemm. - Fixed rotary embedding device mismatch by storing per-device rotary copies.
- Added warmup protection for threaded timing.
- Added optional CPU int64
-
Defuser Integration
- Integrated
defuser.convert_hf_model(). - Integrated
defuser.materialize_model(). - Integrated
defuser.replace_fused_blocks(). - Improved defuser meta/offload compatibility and fused block handling.
- Integrated
-
Compatibility Fixes
- Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
- Fixed import compatibility issues in
models/utils. - Fixed rotary / embedding config compatibility with older HF and model variants.
- Improved tokenizer and model compatibility updates related to
tokenicer. - Fixed OSS compatibility issues.
-
Kernel / Backend Changes
- Hard deprecated ExLLaMA v1 kernel.
- Exposed the Triton patcher as an externally callable API.
What's Changed
- support video input for quantization by @techshoww in #2386
- feat: moe-router-bypass-batch-size by @avtc in #2349
- [CI] use UV as python manager by @CSY-ModelCloud in #2415
- [CI] fix deps installation & gpu service api path by @CSY-ModelCloud in #2416
- [CI] auto release GPU if job has sth wrong or unrecoverable by @CSY-ModelCloud in #2417
- [CI] save log to disk & fix deps installation by @CSY-ModelCloud in #2418
- Replace Greedy with Tenary Search for SmoothBSE by @namgyu-youn in #2419
- Feature/LLada2 support: Block Diffusion LLM by @blazingbhavneek in #2422
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2426
- [MODEL] supports qwen3_5 by @ZX-ModelCloud in #2427
- [FIX] eval bug for qwn3_5 quantized model by @ZX-ModelCloud in #2428
- [MODEL] supports qwen3_5_moe by @ZX-ModelCloud in #2433
- Update tokenicer dependency version to 0.0.7 by @Qubitium in #2434
- Optional
CPUg_idx int64 cache for TorchQuantLinear dequant path by @Qubitium in #2431 - fix import compat issues for models/utils that is locked to higher ve… by @Qubitium in #2436
- call defuser.convert_hf_model() by @ZX-ModelCloud in #2437
- Update defuser dependency version to 0.0.3 by @Qubitium in #2439
- quantize mlp experts module for qwen3_5_moe by @ZX-ModelCloud in #2443
- Fix typo in setup.py causing wheel build failure (sys.abiflag -> sys.abiflags) by @beomchan0 in #2444
- call defuser's materialize_model() by @ZX-ModelCloud in #2446
- Update defuser dependency version to 0.0.4 by @Qubitium in #2447
- port intel's int8 gptq/awq kernel over by @Qubitium in #2438
- expose triton patcher as externally callable by @Qubitium in #2448
- docs by @Qubitium in #2449
- Add AWQ support for CPU fused kernels (torch_fused & hf_kernel) by @jiqing-feng in #2445
- Cleanupx by @Qubitium in #2450
- Make HF kernels for gptq/awq highest priority as they are the highest… by @Qubitium in #2451
- rm sym in dequantize_gemm by @jiqing-feng in #2452
- fix awq rotary device mismatch. store per-device copy of rotary by @Qubitium in #2453
- add torch_int8 awq kernel by @Qubitium in #2454
- [CI] move check log to a new step by @CSY-ModelCloud in #2455
- cleanup hf kernel gptq/awq post_init loading by @Qubitium in #2457
- fix SmoothMAD overly-aggressive clipping by @Qubitium in #2459
- upgrade defuser version to 0.0.5 by @ZX-ModelCloud in #2460
- [FIX] test_qwen3_5_moe by @ZX-ModelCloud in #2461
- Update defuser dependency version to 0.0.6 by @Qubitium in #2462
- fix awq oom by @CSY-ModelCloud in #2458
- [CI] CUDA 131 + Torch 2.10.0 + Python 3.13 by @CSY-ModelCloud in #2463
- Fix the module_tree in Qwen3_5_Moe to correctly support AWQ by @ZX-ModelCloud in #2464
- [CI] fix git link cannot be installed by uv by @CSY-ModelCloud in #2465
- [FIX] GEMM can't pack by @ZX-ModelCloud in #2466
- [CI] add peft for test_asym_gptq_v1 & check log after test by @CSY-ModelCloud in #2467
- [CI] get path error from log & install pre-compiled bitblas by @CSY-ModelCloud in #2468
- [CI] fix log files were saved with wrong runid by @CSY-ModelCloud in #2469
- [FIX] where qwen3_5_moe got incorrect
position_embeddingsduringAWQquantization by @ZX-ModelCloud in #2470 - Update pypcre version to 0.2.13 by @CSY-ModelCloud in #2471
- read dependencies from requirements.txt by @CSY-ModelCloud in #2472
- add setuptools to requirements.txt by @CSY-ModelCloud in #2474
- set minimum setuptools version to 78.1.1 by @CSY-ModelCloud in #2475
- [FIX] device mismatch issue that occurred during multi-GPU AWQ quantization in moe Model by @ZX-ModelCloud in #2476
- [CI] auto uninstall unneeded pkgs by @CSY-ModelCloud in #2478
- fix ci failed tests by @ZX-ModelCloud in #2477
- update mixtral's module_tree by @ZX-ModelCloud in #2480
- Fix CI by @Qubitium in #2481
- [CI] add pypi as backup by @CSY-ModelCloud in #2482
- Ci fixes 2 by @Qubitium in #2483
- CI Tests Fix 3 by @Qubitium in #2484
- [CI] fix old models need old transformers by @CSY-ModelCloud in #2485
- fix failed test by @ZX-ModelCloud in #2486
- [CI] install latest bitblas & fix missing pkgs by @CSY-ModelCloud in #2487
- BaiChuan fix by @Qubitium in #2488
- Ci fix 5 by @Qubitium in #2489
- Shelll/Src module buffer registratio mismatch + Qwen 2.5 VL patch by @Qubitium in #2490
- [CI] install latest evalplus wheel by @CSY-ModelCloud in #2492
- [CI] throw error for fast check by @CSY-ModelCloud in #2493
- [FIX] test_post_quant_eora by @ZX-ModelCloud in https://github.com/ModelCloud/GPTQ...
GPT-QModel v5.7.0
Notable Changes:
- Feature: MoE.Routing control (Bypass or Override) by @avtc in #2235
- Feature: Use FailSafe Naive Quantization when GPTQ fails due to MoE uneven routing by @ZX-ModelCloud in #2293
- Feature: ability to pause/resume quantization via 'p' key by @avtc in #2294
- Glm4v support by @LRL2-ModelCloud in #2303
- Failsafe smoothers by @Qubitium in #2304
- New median strategy and SmoothPercentileAsymmetric smoother by @Qubitium in
- Support for Qwen2.5-Omni calibration data includes audio. by @ChenShisen in #2309
- Add Smooth trigger based on group_size by @Qubitium in #2312
- Voxtral support by @LRL2-ModelCloud in #2315
- Better compat with triton-windows and other alternative triton packages by @Qubitium in #2395
- Dynamically map format/backend to kernel by @Qubitium in #2353
- Add EXAONE4 support by @namgyu-youn in #2405
What's Changed
- [FIX] unittest by @ZX-ModelCloud in #2291
- [FIX] marlin forward by @ZX-ModelCloud in #2296
- FIX fast_hadamard_transform import by @LRL2-ModelCloud in #2298
- do not log moe errors if
failesafeenabled by @Qubitium in #2299 - [CI] allow cancel action by @CSY-ModelCloud in #2300
- Fix non-rtn packing by @Qubitium in #2302
- log q vs weight abs.mean for loss column by @Qubitium in #2306
- fix inverted failsafe log condition by @Qubitium in #2310
#2311 - Allow failsafe to be none by @Qubitium in #2313
- move non-inference affecting fields to meta on save by @Qubitium in #2314
- [FIX] GPTQModel.load() can now correctly load non-quantized models. by @ZX-ModelCloud in #2317
- FIX hf kernel by @jiqing-feng in #2319
- [CI] test_qwen3_moe add eval task: GSM8K_PLATINUM_COT and MMLU_STEM by @ZX-ModelCloud in #2320
- Release 5.7 Prep by @Qubitium in #2318
- [FIX] Exclude unrouted MoE experts on load by @ZX-ModelCloud in #2321
- [FIX] Skip empty subset by @ZX-ModelCloud in #2322
- [FIX] GLM-4.5-Air quantize fail by @ZX-ModelCloud in #2323
- fix: offload_to_disk=True uses more vram than offload_to_disk=False by @avtc in #2325
- Fix import no_init_weights from transformers by @jiqing-feng in #2329
- [FIX] qqq quantize by @ZX-ModelCloud in #2330
- chery pick: attempt to fix terminal state after pause/resume handlers by @avtc in #2327
- [FIX] quantization to fail for non-MoE models by @ZX-ModelCloud in #2333
- Device check by @jiqing-feng in #2334
- FIX moe flag passing not passing nested ci test by @Qubitium in #2337
- Use safer checks for nullable properties where they may not exists at… by @Qubitium in #2338
- Fix unit test by @Qubitium in #2339
- Group module_tree/subsection parsing related tests to module_tree folder by @Qubitium in #2340
- Group kernel tests by @Qubitium in #2341
- Lifecycle: Move
awq.pack_moduletosubmodule_finalize()fromprocess()by @ZX-ModelCloud in #2335 - Partial Revert 2235: temp remove moe bypass by @ZX-ModelCloud in #2343
- Re apply compute device filter by @Qubitium in #2345
- Re-apply moe routing bypass by @ZX-ModelCloud in #2347
- Fix: Zero point underflow in AWQ Exllama v2 kernel by @12345txy in #2351
- Remove unnecessary +1/-1 inference/packing zerpoint offset for AWQ Exllama v2 kernel by @Qubitium in #2352
- Normalize AWQ.qcfg
zero_pointtosymproperty by @Qubitium in #2355 - FIX sym True with AWQ by @ZX-ModelCloud in #2357
- Prepare for 5.7 by @Qubitium in #2358
- [FIX]
self_attn.q_projwas not quantized in the Moonlight Model by @ZX-ModelCloud in #2360 - [FIX] torch_fused inference error by @ZX-ModelCloud in #2362
- [FIX] FORMAT.LLM_AWQ was incorrectly quantized as FORMAT.GEMM by @ZX-ModelCloud in #2364
- [CI] load all tests include sub dirs & merge some small tests in to one file by @CSY-ModelCloud in #2363
- Fix evalplus output filename mismatch by @juraev in #2365
- [FIX] FORMAT.GEMV and FORMAT.GEMV_FAST could not be quantized by @ZX-ModelCloud in #2366
- [CI] add deps config for CI tests by @CSY-ModelCloud in #2368
- [FIX] unittest by @ZX-ModelCloud in #2370
- [FIX] In AWQProcessor, the failsafe threshold_value should be calculated based on the scale group, not the entire layer by @ZX-ModelCloud in #2369
- [CI] fix ci didn't read correct yaml by @CSY-ModelCloud in #2371
- [FIX] ci unittest by @ZX-ModelCloud in #2372
- [FIX] test_q4_bitblas and test_qqq by @ZX-ModelCloud in #2373
- [CI] add test_integration deps by @CSY-ModelCloud in #2374
- [CI] fix torch version was upgraded by deps by @CSY-ModelCloud in #2377
select_quant_linearshould always receive a non-nulldeviceby @Qubitium in #2376- [CI] uninstall pynvml by @CSY-ModelCloud in #2378
- [FIX] failed ci test by @ZX-ModelCloud in #2380
- [FIX] test_gptq by @ZX-ModelCloud in #2382
- [FIX] correct
has_captured_input_ids()logic by using> 0check by @ZX-ModelCloud in #2383 - [FIX] test_model by @ZX-ModelCloud in #2384
- [FIX] unit test by @ZX-ModelCloud in #2385
- [CI] use new docker by @CSY-ModelCloud in #2387
- [FIX] ci test by @ZX-ModelCloud in #2388
- [FIX] unittest by @ZX-ModelCloud in #2389
- [FIX] missing ExllamaV2 kernels initialization in AutoRound by @ZX-ModelCloud in #2390
- [CI] keep uv up to date by @CSY-ModelCloud in #2391
- [FIX] test_awq by @ZX-ModelCloud in #2392
- [FIX] Incorrectly selected device by @ZX-ModelCloud in #2394
- [FIX] quantization failure for Qwen2/2.5/3 VL models with FlashAttention-2 by @ZX-ModelCloud in #2396
- [FIX] test_ovis2 and test_ovis_1_6_llama by @ZX-ModelCloud in #2397
- [FIX] test_stage_modules by @ZX-ModelCloud in #2398
- [CI] list test files with py file & fix duplicated test names by @CSY-ModelCloud in #2399
- [FIX] test_pause_resume by @ZX-ModelCloud in #2400
- [CI] update sort, root test files first by @CSY-ModelCloud in #2401
- [FIX] exllama_v1 kernel crash by @ZX-ModelCloud in #2402
- [FIX] test_chatglm by @ZX-ModelCloud in #2406
- set tokenicer>=0.0.6 by @CSY-ModelCloud in #2407
- Fix tokenizer_class incompatibility with transformers 5.0 by @juraev in #2403
- [FIX] model_test by @ZX-ModelCloud in #2410
- fixed ValueError: invalid pyproject.toml config: project.license. con… by @CSY-ModelCloud in htt...
GPT-QModel v5.6.12
Notable Changes:
uvcompat- Both
uvandpipinstall will now display ui progress for external wheel/depend downloads.
What's Changed
- [FIX] failed unittest by @ZX-ModelCloud in #2286
- fix wheel name mistaches with version name by @CSY-ModelCloud in #2288
- Setup download progress by @Qubitium in #2289
- Update latest news section in README.md by @Qubitium in #2290
Full Changelog: v5.6.10...v5.6.12
GPT-QModel v5.6.10
Notable Changes:
- Triton check by @Qubitium in #2274
- Fix bitblas support for gptq_v2 format by @xxxxyu in #2281
- Fix awq triton kernel has invalid properties by @Qubitium in #2279
What's Changed
- Add kernel selection log by @ZX-ModelCloud in #2275
- Update README.md by @Qubitium in #2276
- Update pypcre depend by @Qubitium in #2277
- Update version.py by @Qubitium in #2278
- Add macos unit tests by @CSY-ModelCloud in #2282
- Update README.md by @Qubitium in #2283
New Contributors
Full Changelog: v5.6.6...v5.6.10
GPT-QModel v5.6.8
Notable Changes:
What's Changed
- Add kernel selection log by @ZX-ModelCloud in #2275
- Update README.md by @Qubitium in #2276
Full Changelog: v5.6.6...v5.6.8
v5.6.6
Notable Changes:
- Use static cuda ctx for triton kernel launch by @Qubitium in #2269
- Remove random-word depend by @LRL2-ModelCloud in #2266
- Update PyPcre depend from 0.2.7 to 0.2.8 by @Qubitium in #2267
What's Changed
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
- Update version.py by @Qubitium in #2268
- Ready 5.6.6 by @Qubitium in #2270
Full Changelog: v5.6.2...v5.6.6
GPT-QModel v5.6.4
What's Changed
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
- remove random-word depend by @LRL2-ModelCloud in #2266
- Update pypcre version from 0.2.7 to 0.2.8 by @Qubitium in #2267
- Update version.py by @Qubitium in #2268
Full Changelog: v5.6.2...v5.6.4
GPT-QModel v5.6.2
Notable Changes
- FIX JIT Pytorch extension
pack_cpu_extstall by @ZX-ModelCloud in #2248 - Refractor Kernel External Dependency Validation by @LRL2-ModelCloud in #2249
- FIX some models not honoring model.config.use_cache by force pass use_cache=false by @LRL2-ModelCloud in #2246
- FIX Incorrect Triton dequant_kernel for 3-bit GPTQ (INT3) leads to Triton compile error / wrong dequantization #2251 by
- Support llm-awq by @ZX-ModelCloud in #2252
What's Changed
- Update version.py by @Qubitium in #2247
- Update README.md by @davedgd in #2250
- [CI] add torch 2.9.1 by @CSY-ModelCloud in #2254
@KingdalfGoodman in #2258 - Update license declaration in pyproject.toml by @CSY-ModelCloud in #2259
- Modify setup by @Qubitium in #2260
- Add release notes for version 5.6.2 by @Qubitium in #2261
- fix test_quant_formats.py by @LRL2-ModelCloud in #2262
- [CI] mount dateset dir to /monster/data/model/dataset by @CSY-ModelCloud in #2263
- fix parsing args by @CSY-ModelCloud in #2264
New Contributors
- @KingdalfGoodman made their first contribution in #2258
Full Changelog: v5.6.0...v5.6.2
GPT-QModel v5.6.0
Notable Changes:
- HF Kernel for CPU: AMX, AVX2, AVX512 optimized by @jiqing-feng in #2232
- Fix: Resolve performance regression during initial forward pass with offload_to_disk by @avtc in #2239
- Auto module tree by @LRL2-ModelCloud in #2204
- Afmoe support by @LRL2-ModelCloud in #2243
- Add dots1 by @Qubitium in #2231
What's Changed
- Update description and code about GPTAQ in README.md by @wayneguow in #2202
- Update test cases for qwen2.5-vl and qwen3-vl by @wayneguow in #2203
- Optimize minimax m2 modelling forward pass by @avtc in #2176
- remove gemm ipex by @LRL2-ModelCloud in #2206
- Bump actions/checkout from 5 to 6 in the github-actions group by @dependabot[bot] in #2207
- Update device-smi dependency version to 0.5.2 by @Qubitium in #2208
- Fix loading an AWQ-quantized model with GPTQModel when it is not actu… by @LRL2-ModelCloud in #2209
- fix exllama v2 post init by @LRL2-ModelCloud in #2211
- [FIX] Add fallback for "module_dir" and "entry key" lookup by @ZX-ModelCloud in #2210
- Update unit_tests.yml by @Qubitium in #2213
- fix mps backend does not implement float64 by @Qubitium in #2216
- [FIX] _apply_quant() not being called with awq by @ZX-ModelCloud in #2218
- Fix AWQ Extension by @LRL2-ModelCloud in #2217
- Auto AWQ kernel selection for Transformers compat by @Qubitium in #2214
- Fix add bias for torch_fuse by @jiqing-feng in #2223
- [CI] Add torch_fused test with Bias by @ZX-ModelCloud in #2222
- [FIX] device_map with cpu only causing
CpuOffloadhooks to be injected by @ZX-ModelCloud in #2225 - fix awq apply_scale and apply_clip multi thread issue by @LRL2-ModelCloud in #2224
- Fix CI test not pasing by @Qubitium in #2226
- Monkeypatch lm-eval latest broken imports by @Qubitium in #2227
- make file can be pytest called by @CSY-ModelCloud in #2228
- CI Fix awq weight mean by @LRL2-ModelCloud in #2229
- fix pycharm auto imported wrong path by @CSY-ModelCloud in #2230
- [FIX] TorchFusedAwqQuantLinear selection by @ZX-ModelCloud in #2233
- [CI] update CI path by @CSY-ModelCloud in #2236
- [Model] Mistral3 support by @LRL2-ModelCloud in #2238
- Update setup.py by @Qubitium in #2240
- Increase MAX_JOBS from 4 to 8 in release.yml by @Qubitium in #2241
- [FIX] non-peristent buffer was saved incorrectly by @ZX-ModelCloud in #2242
New Contributors
- @wayneguow made their first contribution in #2202