-
Notifications
You must be signed in to change notification settings - Fork 248
Created FAQ page first draft #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cajeonrh
wants to merge
24
commits into
vllm-project:main
Choose a base branch
from
cajeonrh:INFERENG-1867
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
a1a5fda
Created FAQ page
cajeonrh 5a95ce3
Update docs/getting-started/faq.md
cajeonrh 190050e
Update docs/getting-started/faq.md
cajeonrh 2a15992
Update docs/getting-started/faq.md
cajeonrh 046ce57
Update docs/getting-started/faq.md
cajeonrh 930a046
[Transforms] Update examples for R4 and `transform_block_size` option…
brian-dellabetta d1f31c2
[Training] Deprecate Training Surpport (#1882)
dsikka 701b164
Improve stability of flaky perplexity test (#1884)
fynnsu ff37d39
[Qwen3Next] Add FP8 Quantization Example (#1886)
shanjiaz abd867a
Run `ruff format` twice in `make style` (#1893)
fynnsu e72b470
[transforms] update examples so hadacore kernel is used by default (#…
brian-dellabetta 78e3189
[Qwen3Next] Add calibration support and NVFP4 Example (#1889)
dsikka 34a4602
[examples] fix vision_tower/multi_modal_projector regexes (#1871)
brian-dellabetta 5dd06ff
[Qwen3VLMoe] Add linearized definition and FP8 Quantization Example (…
dsikka d2840e8
Update README.md with Qwen3 Support (#1891)
dsikka 3d11809
[Tests] Add recovery-based validation to LM-Eval tests (#1750)
rahul-tuli 2e7bc94
v0.8.0 New in this release (#1892)
aireilly dc4470a
[Tests] Workaround qwen_2_5_vl (#1894)
kylesayrs 34edf9d
Update upper bounds for some dependencies (#1890)
dhuangnm 4600822
Merge branch 'main' into INFERENG-1867
brian-dellabetta 85ed1ed
Update docs/getting-started/faq.md
cajeonrh 8758004
Update docs/getting-started/faq.md
cajeonrh a6e7ba6
Updating numbering of FAQ as 2 FAQs got merged.
cajeonrh 198ce26
Added FAQ box to Getting Started page and updated FAQs to include fee…
cajeonrh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Frequently Asked Questions | ||
|
||
Below are the most frequently asked questions when using LLM Compressor. If you do not see your question here, please file an issue: [LLM Compressor Issues](https://github.com/vllm-project/llm-compressor/issues). | ||
|
||
**1. Why doesn't my model run any faster after I compress it?** | ||
|
||
This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models. | ||
|
||
**2. Do we support sglang?** | ||
|
||
There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended. | ||
|
||
**3. How do I select the appropriate strategy for compression?** | ||
|
||
This involves understanding your hardware availability and inference requirements. Refer to [Compression Schemes Guide](../guides/compression_schemes.md). | ||
|
||
**4. What are the memory requirements for compression?** | ||
|
||
Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor). | ||
|
||
**5. What layers should be quantized?** | ||
|
||
Typically, all linear layers are quantized except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18). | ||
|
||
Mixture of Expert (MoE) models, due to their advanced architecture and some components such as gate and routing layers, are sensitive to quantization as well. For example, [this code snippet shows how to ignore the gates](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/qwen_example.py#L60). | ||
|
||
Multimodal models (e.g., vision-language models)pair a language model with another component for image, audio, or video input as well as text. In these cases, the non-textual component is excluded from quantization, as it generally has fewer parameters and is more sensitive. | ||
|
||
For more information, see [Quantizing Multimodal Audio Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_audio) and [Quantizing Multimodal Vision-Language Models](https://github.com/vllm-project/llm-compressor/tree/main/examples/multimodal_vision). | ||
|
||
**6. What environment should be used for installing LLM Compressor?** | ||
|
||
vLLM and LLM Compressor should be used in separate environments as they may have dependency mismatches. | ||
|
||
**7. Does LLM Compressor have multi-GPU support?** | ||
|
||
LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.