Skip to content

Commit f4b78b7

Browse files
authored
[Docs] Adding 0.7.0 docs homepage updates (#1756)
Adding 0.7.0 docs homepage updates Signed-off-by: Aidan Reilly <[email protected]>
1 parent e1f50f1 commit f4b78b7

File tree

1 file changed

+9
-6
lines changed

1 file changed

+9
-6
lines changed

docs/index.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,21 @@
1515

1616
## Recent Updates
1717

18+
!!! info "QuIP and SpinQuant-style Transforms"
19+
The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
20+
21+
!!! info "DeepSeekV3-style Block Quantization Support"
22+
Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8.md).
23+
24+
!!! info "FP4 Quantization - now with MoE and non-uniform support"
25+
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
26+
1827
!!! info "Llama4 Quantization Support"
1928
Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
2029

2130
!!! info "Large Model Support with Sequential Onloading"
2231
As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md).
2332

24-
!!! info "Preliminary FP4 Quantization Support"
25-
Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4.md) and [fp4 activation support](examples/quantization_w4a4_fp4.md). Support is currently preliminary and additional support will be added for MoEs.
26-
27-
!!! info "Updated AWQ Support"
28-
Improved support for MoEs with better handling of larger models
29-
3033
!!! info "Axolotl Sparse Finetuning Integration"
3134
Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
3235

0 commit comments

Comments
 (0)