Skip to content

Releases: ModelCloud/GPTQModel

GPT-QModel v6.0.3

02 Apr 23:56
6a65d69

Choose a tag to compare

Notable Changes:

Quantization and inference

  • Major ParoQuant improvements across speed, inference, and accuracy.
  • Added Paro inference support and a new layer optimizer.
  • Auto-enables AMP for the fast Paro implementation to better match reference behavior.
  • Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
  • Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
  • Fixed a layer output replay/re-capture regression.
  • Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
  • Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.

Model and backend support

  • Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
  • Added PrismML/Bonsai model support for inference.
  • Fixed Qwen3_5QModel definition issues.
  • Fixed Qwen 3.5 rotary embedding behavior.
  • Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
  • Fixed awq_processor.dynamic so skipped layers are handled correctly.
  • Improved dtype compatibility.
  • Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.

Evaluation, calibration, and usability

  • Integrated Evalution into the workflow.
  • Added evalution.VLLM and evalution.SGLang backends.
  • Fixed SGLang evaluation engine initialization.
  • Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
  • Improved calibration data device handling.
  • Updated tokenizer handling, and collation now respects tokenizer padding_size.
  • Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
  • Cleaned up warning behavior and added an option to suppress warnings.
  • Removed forced random seed overrides.

Dependency and compatibility updates

  • Updated pypcre to 0.2.14.
  • Pinned logbar to >=0.4.1.
  • Updated transformers and defuser package versions.
  • Fixed SAVE_PATH handling and import path resolution issues.

Breaking and removed

  • Removed GPTQModel.upload_to_hub().
  • Removed MLX export support.

What's Changed

Read more

GPT-QModel v5.8.0

19 Mar 16:35
9980f01

Choose a tag to compare

Notable Changes

  • Transformers 5.3.0 compatibility.

  • Video Quantization Support

    • Added support for video input during quantization.
  • MoE & Model Support

    • Added support for Qwen 3.5 and Qwen 3.5 MoE.
    • Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
    • Added support for LLada2 block diffusion LLM models.
    • Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
    • Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
  • AWQ / GPTQ Kernels

    • Added CPU fused AWQ kernels for torch_fused and hf_kernel.
    • Added torch_int8 AWQ kernel.
    • Added BitBLAS AWQ kernel.
    • Ported Intel int8 GPTQ/AWQ kernels.
    • Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
    • Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
  • Quantization Improvements

    • Replaced greedy search with ternary search in SmoothBSE.
    • Fixed SmoothMAD overly aggressive clipping.
    • Added layer-level dynamic skip for fast quantization.
    • Added early stop when all remaining layers are skipped during quantization.
    • Fixed AWQ OOM and dequantization-related issues.
  • Runtime & Dequantization

    • Added optional CPU int64 g_idx cache for TorchQuantLinear dequantization.
    • Improved TorchFused dequantization and fp32 dtype support.
    • Removed unnecessary symmetric handling in dequantize_gemm.
    • Fixed rotary embedding device mismatch by storing per-device rotary copies.
    • Added warmup protection for threaded timing.
  • Defuser Integration

    • Integrated defuser.convert_hf_model().
    • Integrated defuser.materialize_model().
    • Integrated defuser.replace_fused_blocks().
    • Improved defuser meta/offload compatibility and fused block handling.
  • Compatibility Fixes

    • Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
    • Fixed import compatibility issues in models/utils.
    • Fixed rotary / embedding config compatibility with older HF and model variants.
    • Improved tokenizer and model compatibility updates related to tokenicer.
    • Fixed OSS compatibility issues.
  • Kernel / Backend Changes

    • Hard deprecated ExLLaMA v1 kernel.
    • Exposed the Triton patcher as an externally callable API.

What's Changed

Read more

GPT-QModel v5.7.0

10 Feb 10:09
ed96f2e

Choose a tag to compare

Notable Changes:

What's Changed

Read more

GPT-QModel v5.6.12

17 Dec 11:28
1a19cd0

Choose a tag to compare

Notable Changes:

  • uv compat
  • Both uv and pip install will now display ui progress for external wheel/depend downloads.

What's Changed

Full Changelog: v5.6.10...v5.6.12

GPT-QModel v5.6.10

16 Dec 10:13
70a507d

Choose a tag to compare

Notable Changes:

What's Changed

New Contributors

Full Changelog: v5.6.6...v5.6.10

GPT-QModel v5.6.8

16 Dec 04:11
711b214

Choose a tag to compare

Notable Changes:

What's Changed

Full Changelog: v5.6.6...v5.6.8

v5.6.6

15 Dec 10:35
9a79b62

Choose a tag to compare

Notable Changes:

What's Changed

Full Changelog: v5.6.2...v5.6.6

GPT-QModel v5.6.4

15 Dec 08:27
61e5e7f

Choose a tag to compare

What's Changed

Full Changelog: v5.6.2...v5.6.4

GPT-QModel v5.6.2

12 Dec 10:04
d97478f

Choose a tag to compare

Notable Changes

What's Changed

New Contributors

Full Changelog: v5.6.0...v5.6.2

GPT-QModel v5.6.0

09 Dec 11:53
b63b373

Choose a tag to compare

Notable Changes:

What's Changed

New Contributors