-
Notifications
You must be signed in to change notification settings - Fork 249
Created FAQ page first draft #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cajeonrh
wants to merge
25
commits into
vllm-project:main
Choose a base branch
from
cajeonrh:INFERENG-1867
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+46
−0
Open
Changes from 20 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
a1a5fda
Created FAQ page
cajeonrh 5a95ce3
Update docs/getting-started/faq.md
cajeonrh 190050e
Update docs/getting-started/faq.md
cajeonrh 2a15992
Update docs/getting-started/faq.md
cajeonrh 046ce57
Update docs/getting-started/faq.md
cajeonrh 930a046
[Transforms] Update examples for R4 and `transform_block_size` option…
brian-dellabetta d1f31c2
[Training] Deprecate Training Surpport (#1882)
dsikka 701b164
Improve stability of flaky perplexity test (#1884)
fynnsu ff37d39
[Qwen3Next] Add FP8 Quantization Example (#1886)
shanjiaz abd867a
Run `ruff format` twice in `make style` (#1893)
fynnsu e72b470
[transforms] update examples so hadacore kernel is used by default (#…
brian-dellabetta 78e3189
[Qwen3Next] Add calibration support and NVFP4 Example (#1889)
dsikka 34a4602
[examples] fix vision_tower/multi_modal_projector regexes (#1871)
brian-dellabetta 5dd06ff
[Qwen3VLMoe] Add linearized definition and FP8 Quantization Example (…
dsikka d2840e8
Update README.md with Qwen3 Support (#1891)
dsikka 3d11809
[Tests] Add recovery-based validation to LM-Eval tests (#1750)
rahul-tuli 2e7bc94
v0.8.0 New in this release (#1892)
aireilly dc4470a
[Tests] Workaround qwen_2_5_vl (#1894)
kylesayrs 34edf9d
Update upper bounds for some dependencies (#1890)
dhuangnm 4600822
Merge branch 'main' into INFERENG-1867
brian-dellabetta 85ed1ed
Update docs/getting-started/faq.md
cajeonrh 8758004
Update docs/getting-started/faq.md
cajeonrh a6e7ba6
Updating numbering of FAQ as 2 FAQs got merged.
cajeonrh 198ce26
Added FAQ box to Getting Started page and updated FAQs to include fee…
cajeonrh 286d259
Merge branch 'main' into INFERENG-1867
cajeonrh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,29 @@ | ||||||
# LLM Compressor Frequently Asked Questions | ||||||
|
||||||
Below are the most frequently asked questions when using LLM Compressor. If you do not see your question here, please use this page to ask your question: [vLLM Compressor Issues](https://github.com/vllm-project/llm-compressor/issues). | ||||||
|
||||||
**1. Why doesn't my model run any faster after I compress it?** | ||||||
|
||||||
This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models. | ||||||
|
||||||
**2. Do we support sglang?** | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended. | ||||||
|
||||||
**3. How do I select the appropriate strategy for compression?** | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
This involves understanding your hardware availability and inference requirements. Refer to [Compression Schemes Guide](../guides/compression_schemes.md). | ||||||
|
||||||
**4. How much memory or time will xyz algorithm take with my model?** | ||||||
|
||||||
Refer to [https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#memory-requirements-for-llm-compressor](Memory Requirements for LLM Compressor). | ||||||
|
||||||
**5. What are the memory requirements?** | ||||||
|
||||||
Refer to [https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#memory-requirements-for-llm-compressor](Memory Requirements for LLM Compressor). | ||||||
cajeonrh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
**6. What layers should be quantized?** | ||||||
|
||||||
All linear layers go through basic quantization except the `lm_head` layer. This is because the `lm_head` layer is the last layer of the model and sensitive to quantization, which will impact the model's accuracy. For example, [this code snippet shows how to ignore the lm_head layer](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/llama3_example.py#L18). | ||||||
cajeonrh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
Mixture of Expert (MoE) models, due to their advanced architecture and some components such as gate and routing layers, are sensitive to quantization as well. For example, [this code snippet shows how to ignore the gates](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/qwen_example.py#L60). | ||||||
cajeonrh marked this conversation as resolved.
Show resolved
Hide resolved
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.