-
Notifications
You must be signed in to change notification settings - Fork 249
Created FAQ page first draft #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 24 commits
a1a5fda
5a95ce3
190050e
2a15992
046ce57
930a046
d1f31c2
701b164
ff37d39
abd867a
e72b470
78e3189
34a4602
5dd06ff
d2840e8
3d11809
2e7bc94
dc4470a
34edf9d
4600822
85ed1ed
8758004
a6e7ba6
198ce26
286d259
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,38 @@ | ||||||
# Frequently Asked Questions | ||||||
|
||||||
Below are the most frequently asked questions when using LLM Compressor. If you do not see your question here, please file an issue: [LLM Compressor Issues](https://github.com/vllm-project/llm-compressor/issues). | ||||||
|
||||||
**1. Why doesn't my model run any faster after I compress it?** | ||||||
|
||||||
This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models. | ||||||
|
||||||
**2. Do we support sglang?** | ||||||
|
||||||
There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended. | ||||||
|
||||||
**3. How do I select the appropriate strategy for compression?** | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
This involves understanding your hardware availability and inference requirements. Refer to [Compression Schemes Guide](../guides/compression_schemes.md). | ||||||
|
||||||
**4. What are the memory requirements for compression?** | ||||||
|
||||||
Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor). | ||||||
|
||||||
**5. What layers should be quantized?** | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18). | ||||||
|
||||||
Mixture of Expert (MoE) models, due to their advanced architecture and some components such as gate and routing layers, are sensitive to quantization as well. For example, [this code snippet shows how to ignore the gates](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/qwen_example.py#L60). | ||||||
cajeonrh marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Multimodal models (e.g., vision-language models)pair a language model with another component for image, audio, or video input as well as text. In these cases, the non-textual component is excluded from quantization, as it generally has fewer parameters and is more sensitive. | ||||||
|
||||||
For more information, see [Quantizing Multimodal Audio Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_audio) and [Quantizing Multimodal Vision-Language Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_vision). | ||||||
|
||||||
**6. What environment should be used for installing LLM Compressor?** | ||||||
|
||||||
vLLM and LLM Compressor should be used in separate environments as they may have dependency mismatches. | ||||||
|
||||||
**7. Does LLM Compressor have multi-GPU support?** | ||||||
|
||||||
LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We essentially do not have mult-GPU support right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dispatch_for_generation is used for data-free atm? |
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.