Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
a1a5fda
Created FAQ page
cajeonrh Oct 2, 2025
5a95ce3
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
190050e
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
2a15992
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
046ce57
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
930a046
[Transforms] Update examples for R4 and `transform_block_size` option…
brian-dellabetta Sep 30, 2025
d1f31c2
[Training] Deprecate Training Surpport (#1882)
dsikka Sep 30, 2025
701b164
Improve stability of flaky perplexity test (#1884)
fynnsu Sep 30, 2025
ff37d39
[Qwen3Next] Add FP8 Quantization Example (#1886)
shanjiaz Oct 1, 2025
abd867a
Run `ruff format` twice in `make style` (#1893)
fynnsu Oct 1, 2025
e72b470
[transforms] update examples so hadacore kernel is used by default (#…
brian-dellabetta Oct 1, 2025
78e3189
[Qwen3Next] Add calibration support and NVFP4 Example (#1889)
dsikka Oct 1, 2025
34a4602
[examples] fix vision_tower/multi_modal_projector regexes (#1871)
brian-dellabetta Oct 1, 2025
5dd06ff
[Qwen3VLMoe] Add linearized definition and FP8 Quantization Example (…
dsikka Oct 1, 2025
d2840e8
Update README.md with Qwen3 Support (#1891)
dsikka Oct 1, 2025
3d11809
[Tests] Add recovery-based validation to LM-Eval tests (#1750)
rahul-tuli Oct 1, 2025
2e7bc94
v0.8.0 New in this release (#1892)
aireilly Oct 1, 2025
dc4470a
[Tests] Workaround qwen_2_5_vl (#1894)
kylesayrs Oct 1, 2025
34edf9d
Update upper bounds for some dependencies (#1890)
dhuangnm Oct 1, 2025
4600822
Merge branch 'main' into INFERENG-1867
brian-dellabetta Oct 2, 2025
85ed1ed
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
8758004
Update docs/getting-started/faq.md
cajeonrh Oct 2, 2025
a6e7ba6
Updating numbering of FAQ as 2 FAQs got merged.
cajeonrh Oct 2, 2025
198ce26
Added FAQ box to Getting Started page and updated FAQs to include fee…
cajeonrh Oct 6, 2025
286d259
Merge branch 'main' into INFERENG-1867
cajeonrh Oct 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions docs/getting-started/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Frequently Asked Questions

Below are the most frequently asked questions when using LLM Compressor. If you do not see your question here, please file an issue: [LLM Compressor Issues](https://github.com/vllm-project/llm-compressor/issues).

**1. Why doesn't my model run any faster after I compress it?**

This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models.

**2. Do we support sglang?**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**2. Do we support sglang?**
**2. Are models compressed using LLM Compressor supported with SGLang?**


There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended.

**3. How do I select the appropriate strategy for compression?**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**3. How do I select the appropriate strategy for compression?**
**3. How do I choose the right quantization scheme?**


This involves understanding your hardware availability and inference requirements. Refer to [Compression Schemes Guide](../guides/compression_schemes.md).

**4. What are the memory requirements for compression?**

Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor).

**5. What layers should be quantized?**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**5. What layers should be quantized?**
**5. Which model layers should be quantized?**


Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18).

Mixture of Expert (MoE) models, due to their advanced architecture and some components such as gate and routing layers, are sensitive to quantization as well. For example, [this code snippet shows how to ignore the gates](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/qwen_example.py#L60).

Multimodal models (e.g., vision-language models)pair a language model with another component for image, audio, or video input as well as text. In these cases, the non-textual component is excluded from quantization, as it generally has fewer parameters and is more sensitive.

For more information, see [Quantizing Multimodal Audio Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_audio) and [Quantizing Multimodal Vision-Language Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_vision).

**6. What environment should be used for installing LLM Compressor?**

vLLM and LLM Compressor should be used in separate environments as they may have dependency mismatches.

**7. Does LLM Compressor have multi-GPU support?**

LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We essentially do not have mult-GPU support right now.

#1809


8 changes: 8 additions & 0 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,12 @@ Follow the guides below to get started with LLM Compressor and optimize your mod

[:octicons-arrow-right-24: Deployment Guide](deploy.md)

- :material-rocket-launch:{ .lg .middle } FAQ

---

Get answers to your most frequently asked questions.

[:octicons-arrow-right-24: FAQ](faq.md)

</div>