Created FAQ page first draft #1896

cajeonrh · 2025-10-02T17:07:42Z

SUMMARY:
Created a FAQ page under the "Getting Started" section

TEST PLAN:

Created a new page called FAQ under "Getting Started" section
Included all FAQs from INFERENG-1867

Requesting review of content

github-actions · 2025-10-02T17:07:50Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-10-02T17:07:53Z

Summary of Changes

Hello @cajeonrh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new Frequently Asked Questions (FAQ) page within the "Getting Started" section of the documentation. The primary goal is to centralize answers to common user queries about the LLM Compressor, thereby enhancing user self-service and clarity on topics such as model performance post-compression, integration with other tools like sglang, and practical guidance on compression strategies and memory requirements.

Highlights

New FAQ Page: A new Frequently Asked Questions (FAQ) page has been added to the documentation.
Comprehensive Coverage: The FAQ page addresses common questions regarding LLM Compressor, including performance expectations after compression, sglang integration, compression strategy selection, memory usage, and specifics on which layers to quantize.
Resource Links: The page provides direct links to relevant guides and code examples for users seeking more in-depth information on various topics.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new FAQ page, which is a great addition to the documentation. The content is relevant and covers important user questions. I've identified a few areas for improvement, mainly related to Markdown link formatting, content clarity, and consistency. There are several instances of incorrect link syntax that need to be fixed across the document. I've also suggested consolidating a couple of redundant questions and using relative paths for internal links to improve maintainability.

docs/getting-started/faq.md

Signed-off-by: Cassie Jeon <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Cassie Jeon <[email protected]>

…vllm-project#1870) SUMMARY: Prerequisites: - [x] neuralmagic/compressed-tensors#472 This PR updates the SpinQuant and Quip examples to include `transform_block_size` and the latest R4 feature in SpinQuant. It also reverts the `TransformScheme.block_size` changes previously introduced into CT, and updated in Pr linked above. While `block_size` is a more appropriate name, `head_dim` has already landed in vllm, and it would be too much of a pain to change. Users will rarely create their own `TransformScheme` anyway. TEST PLAN: - [x] Both examples run and the saved model can be run in vllm, output is meaningful. - [x] with prints, confirmed hadacore is used for `QuIPModifier(rotations=["v", "u"], transform_block_size=64, transform_type="hadamard")` - [x] and dense gemm is used for `QuIPModifier(rotations=["v", "u"], transform_block_size=64, transform_type="random-hadamard")` --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: - Will give the user a warning if `train` is called Example: ```python /home/dsikka/llm-compressor/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py:83: DeprecationWarning: Training support will be removed in future releases. Please use the llmcompressor Axolotl integration for fine-tuning https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open train( ``` Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: `tests/llmcompressor/transformers/compression/test_quantization.py:test_perplexity` is currently flaky, with the test [occasionally failing](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/17994161150/job/51234264145) due to the recorded `avg_ppl` exceeding the test threshold. Through debugging, it seems like most of the high perplexity samples are samples where most of the target labels are not trained (i.e. set to `-100`). This makes the loss calculation averaging over the remaining tokens more volatile and can result in high perplexity values recorded. To correct this, I added a check that filters out samples where less than `25%` of the tokens have training labels. This should make the perplexity calculation more consistent, while still testing the model's perplexities are reasonable. TEST PLAN: Ran the test locally and all cases passed. Although this is a flaky test, so that doesn't guarantee this has solved the problem. --------- Signed-off-by: Fynn Schmitt-Ulms <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: Added qwen3 next fp8 quantization example. Model produced is uploaded [here](https://huggingface.co/shanjiaz/qwen3-80b-fp8-dynamic) TEST PLAN: Tested locally. --------- Signed-off-by: shanjiaz <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

There was an issue where `ruff check --fix` will fail on too long files before `ruff format` has a chance to fix them. To fix this, we now call `ruff format` before `ruff check --fix` but we also run formatting a second in case the lint fix introduces formatting issues. Signed-off-by: Fynn Schmitt-Ulms <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

…llm-project#1883) SUMMARY: Quick follow-up to recently merged * vllm-project#1870 Updates our `examples/transform` scripts to - [x] default to `transform_type="hadamard"`, which is preferred so that vllm hadacore kernel is used - [x] default to `tranform_block_size=128`, which is preferred for group-size 128 schemes like W4A16 TEST PLAN: Previously confirmed that hadacore kernel was being invoked for `transform_type="hadamard"` --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

) SUMMARY: - Add calibration support for Qwen3-Next - Add an NVFP4 example - Update moe calibration context to support the model TEST PLAN: For Qwen3-30B-A3B-NVFP4: ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8825|± |0.0089| | | |strict-match | 5|exact_match|↑ |0.8802|± |0.0089| ``` Qwen3-Next - >96% recovery Base ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8514|± |0.0098| | | |strict-match | 5|exact_match|↑ |0.8089|± |0.0108| ``` NVFP4: ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101| | | |strict-match | 5|exact_match|↑ |0.7733|± |0.0115| ``` Signed-off-by: Cassie Jeon <[email protected]>

…ct#1871) SUMMARY: Resolves vllm-project#1652 Our multimodal examples all ignore `"re:vision_tower.*"`, but this misses cases where the name is prefixed with something else (e.g. `model.vision_tower`). This PR loosens the regexes to allow for anything to precede `vision_tower` or `multi_modal_projector` and still be caught by the ignore. Layers beginning with `vision_tower`, without a prefix, will still be caught. Also some formatting fixes, which must not be included on `examples/` as part of ci/cd checks. TEST PLAN: Running `llm-compressor/examples/multimodal_vision/mistral3_example.py` on latest main shows we are quantizing layers we don't want to be: ``` 2025-09-26T20:02:43.571160+0000 | compress_modules | INFO - Quantizing model.vision_tower.transformer.layers.4.feed_forward.gate_proj using 512 samples ``` After these changes, those don't appear in the logs --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Fynn Schmitt-Ulms <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

…llm-project#1874) SUMMARY: - Updates the MoE layer to use a linearized definition such that we can quantize and run the model in vLLM - Wraps the gate layer so that it is properly ignored - this is hack for now. We will need to do this properly in ct - Not adding forward pass for now; will add a forward pass as a follow-up but would like it in for the release to enable FP8 quantization - Note - requires latest transformers TEST PLAN: - Produces `/proving-grounds/engine/hub_cache/Qwen3-VL-235B-A22B-Instruct-FP8_DYNAMIC` which generates coherent generations: ```python if __name__ == '__main__': import torch from vllm import LLM, SamplingParams import torch prompts = [ "The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are" ] # Create a sampling params object for greedy sampling sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10) llm = LLM("/proving-grounds/engine/hub_cache/Qwen3-VL-235B-A22B-Instruct-FP8_DYNAMIC", tensor_parallel_size=2, max_model_len=4096, enforce_eager=True) output = llm.generate(prompts, sampling_params) for out in output: print(out.outputs[0].text) ``` Generations: ```bash a true paradise for nature lovers and outdoor enthusiasts. With their snow-capped peaks, lush green valleys, and crystal-clear lakes, the Alps offer a stunning backdrop for a wide range of activities. Whether a prominent figure in the NHL, known for his exceptional performance and leadership. He has won the Art Ross Trophy as the NHL's leading scorer, with 110 points (32 goals and a professional ice hockey team based in Toronto, Ontario, Canada. They are members of the Atlantic Division in the Eastern Conference of the National Hockey League (NHL). The team was established in 1 ``` Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: - Need to update links when the following PRs land: 1. vllm-project#1886 2. vllm-project#1874 3. vllm-project#1889 Signed-off-by: Cassie Jeon <[email protected]>

…1750) # Recovery-Based Testing for LM-Eval This PR implements **recovery-based testing** as the default validation mechanism for all lm-eval tests. Tests now compare compressed model performance against base model performance, making them robust to upstream changes while ensuring quantization quality. **Current Problem:** - Tests fail when base models regress due to external changes (e.g., transformers updates, lm-eval changes) - False positives block CI even when quantization maintains expected recovery - Absolute thresholds become stale as models/libraries evolve **Example:** Qwen2.5-VL tests fail with transformers ≥ 4.54.0 due to ~10% base model accuracy drop, despite quantization maintaining the same relative performance. **Solution:** Recovery testing validates that compressed models retain ≥95% (configurable) of base model performance, regardless of absolute score changes. --- ## 🚀 New Behavior ### Default Behavior (Zero Config Required) All lm-eval tests now **automatically**: 1. ✅ Evaluate the base (uncompressed) model 2. ✅ Quantize the model using configured scheme 3. ✅ Evaluate the compressed model 4. ✅ Validate recovery ≥ 95% (default threshold) 5. ✅ Show optional warnings for absolute metrics **Recovery Formula:** ```python # For "higher is better" metrics (accuracy, F1, etc.) recovery = compressed_score / base_score # For "lower is better" metrics (perplexity, loss) recovery = base_score / compressed_score # Inverted! # Validation assert recovery >= threshold # Default: 0.95 ``` **Recovery Interpretation:** - `1.00` = Perfect (0% degradation) - `0.96` = 96% retained (4% degradation) ✅ - `0.93` = 93% retained (7% degradation) ❌ (with default threshold) --- ## 📝 Configuration Options ### Option 1: Use Default (Recommended) No configuration needed - uses 95% recovery threshold: ```yaml cadence: "weekly" model: meta-llama/Meta-Llama-3-8B-Instruct scheme: FP8_DYNAMIC lmeval: # That's it! Uses recovery_threshold: 0.95 by default ``` ### Option 2: Override Global Threshold Set a different threshold for all metrics: ```yaml lmeval: recovery_threshold: 0.93 # All metrics need ≥93% recovery ``` ### Option 3: Per-Metric Thresholds Set different thresholds for different metrics: ```yaml lmeval: recovery_threshold: exact_match,flexible-extract: 0.95 # Strict threshold exact_match,strict-match: 0.90 # Relaxed threshold ``` ### Option 4: With Absolute Metric Warnings Keep absolute metrics for informational warnings (not failures): ```yaml lmeval: recovery_threshold: 0.95 # Required - TEST FAILS if not met metrics: # Optional - warnings only, no failures exact_match,flexible-extract: 0.75 exact_match,strict-match: 0.72 ``` --- ## Example Output ### ✅ Recovery Validation (Always Shown) ``` ================================================================================ RECOVERY TESTING COMPARISON ================================================================================ ✓ exact_match,flexible-extract | Base: 0.7890 | Compressed: 0.7601 | Recovery: 96.34% ↑ | Threshold: ≥95.00% ✓ exact_match,strict-match | Base: 0.7564 | Compressed: 0.7262 | Recovery: 96.01% ↑ | Threshold: ≥95.00% ================================================================================ ✓ ALL METRICS PASSED RECOVERY THRESHOLDS ================================================================================ ``` ### Absolute Metric Warnings (If Configured) ``` ================================================================================ ABSOLUTE METRICS CHECK (warnings only, not failures) ================================================================================ ✓ exact_match,flexible-extract | Expected: 0.7500 (±5%) | Got: 0.7601 | Within expected range ⚠ exact_match,strict-match | Expected: 0.8000 (±5%) | Got: 0.7262 | Below expected range ================================================================================ ``` **Note:** The warning above doesn't fail the test - recovery validation already passed! --- ## 🔄 Migration Guide ### Existing Configs with Absolute Metrics **Before (absolute thresholds cause failures):** ```yaml lmeval: metrics: exact_match: 0.75 # TEST FAILS if not met ``` **After (minimal - uses recovery testing):** ```yaml lmeval: # Uses default recovery_threshold: 0.95 # No other config needed! ``` **After (keep warnings):** ```yaml lmeval: # recovery_threshold: 0.95 is implicit (default) metrics: # Now just warnings, won't fail tests exact_match: 0.75 ``` ### No Breaking Changes - ✅ All existing configs continue to work - ✅ `metrics` dict now shows warnings instead of failing - ✅ Recovery testing automatically enabled with sensible default - ✅ Backward compatible with all test infrastructure --- ## Implementation Details ### Files Changed - **`tests/lmeval/test_lmeval.py`** (+151/-31 lines) - Added `recovery_threshold` config field (default: 0.95) - Made `metrics` field optional - Added `_eval_base_model()` method - Added `_validate_recovery()` method - Modified `_check_absolute_warnings()` to only warn, not fail - Updated test flow to always evaluate base model first ### Key Features 1. **Direction-Aware Recovery** - Automatically detects "higher is better" vs "lower is better" metrics - Inverts ratio for perplexity-style metrics 2. **Edge Case Handling** - Zero base values: `recovery = 1.0 if compressed == 0 else 0.0` - Missing metrics: Skipped gracefully - Metadata filtering: Skips stderr and alias keys 3. **Flexible Thresholds** - Global float: `recovery_threshold: 0.93` - Per-metric dict: `recovery_threshold: {metric1: 0.95, metric2: 0.90}` - Fallback to 0.95 for unlisted metrics when using dict 4. **Comprehensive Logging** - Recovery threshold displayed at test start - Detailed comparison table with base/compressed/recovery values - Clear pass/fail indicators with direction arrows (↑/↓) - Separate section for optional absolute warnings --- ## Performance Impact **Additional Runtime:** - Base model evaluation: ~2-10 minutes - Compressed model evaluation: ~2-10 minutes (unchanged) - **Total: ~2x single evaluation time** **Trade-off:** Doubled evaluation time for robust, meaningful metrics that don't break from upstream changes. **Mitigation:** Tests run on weekly cadence, making the additional time acceptable. --- ## ✅ Benefits | Benefit | Description | |---------|-------------| | 🛡️ **Robustness** | Tests never break from lm-eval or transformers updates | | 📊 **Meaningful** | Measures actual compression degradation, not arbitrary thresholds | | 🎯 **Automatic** | Works out of box, no config needed | | 🔧 **Flexible** | Override threshold globally or per-metric | | ↔️ **Compatible** | Zero breaking changes, existing configs work | | 🧹 **Simple** | ~150 lines in single file, no new dependencies | --- ## Testing To test recovery-based validation: ```bash # Uses default recovery threshold (0.95) CADENCE=weekly TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \ pytest tests/lmeval/test_lmeval.py -v ``` --- --------- Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: rahul-tuli <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: Added What's new to the docs front page ~and a release notes draft~. v0.8.0 release notes are here: https://gist.github.com/aireilly/7866a8f71f99e7005a8d809c136c5984 Signed-off-by: Aidan Reilly <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

## Purpose ## * Fix qwen_2_5_vl tests Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

SUMMARY: Some dependencies have new release lately, need to bump up their upper bounds to catch up with the latest release. Will update compressed-tensors a bit later once 0.12.0 goes out. TEST PLAN: Run all tests. --------- Signed-off-by: Dan Huang <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

fynnsu

Thanks Cassie! Added a couple suggestions below.

Also some of your links are incorrectly formatted. It should be [link text](link url)

docs/getting-started/faq.md

Consolidating similar 2 FAQs into 1 and updating incorrect link formatting Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Incorporating feedback with wording Co-authored-by: Fynn Schmitt-Ulms <[email protected]>

Signed-off-by: Cassie Jeon <[email protected]>

cajeonrh · 2025-10-02T17:53:11Z

Thanks Fynn! I've incorporated your feedback.

brian-dellabetta

Looks good! one comment to add a note on multimodal models for question 5

docs/getting-started/faq.md

fynnsu · 2025-10-02T19:34:00Z

In the sidebar we're getting this title for the page. Can we simplify this to just "Frequently Asked Questions" or maybe even "FAQ". I believe this is being set by the # header at the top of the file

fynnsu · 2025-10-02T19:37:37Z

Also on the "Getting Started" page we have these boxes for "Installation", "Compress Your Model" and "Deploy on vLLM". Could we add a box for FAQ?

docs/getting-started/faq.md

dsikka

Could we add a quick question on installation: vLLM and llmcompressor should be used in separate environments as they may have dependency mismatches?

dsikka

The other common question we get asked is about multi-gpu support.

Can we add the following?

LLM Compresor handles all gpu movement for you.
For data-free pathways, we leverage all available gpus and offload anything that doesnt fit onto the allocated gpus. If using pathways that require data, we sequentially onload model layers onto a single gpu. This is the case for LLM Compressor 0.6-0.8.

…dback Signed-off-by: Cassie Jeon <[email protected]>

cajeonrh · 2025-10-06T16:24:35Z

I've incorporated feedback, added more questions, and also added a FAQ box on the Getting Started page. Please let me know if I missed anything.

fynnsu

Looks great! Thanks for making those changes!

fynnsu · 2025-10-06T20:05:46Z

Looks like you need to fix DCO though. There are some instructions here: https://github.com/vllm-project/llm-compressor/pull/1896/checks?check_run_id=52066401360.

kylesayrs · 2025-10-07T14:29:02Z

docs/getting-started/faq.md

+
+This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models.
+
+**2. Do we support sglang?**


Suggested change

**2. Do we support sglang?**

**2. Are models compressed using LLM Compressor supported with SGLang?**

kylesayrs

I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.

kylesayrs · 2025-10-07T14:33:42Z

docs/getting-started/faq.md

+
+There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended.
+
+**3. How do I select the appropriate strategy for compression?**


Suggested change

**3. How do I select the appropriate strategy for compression?**

**3. How do I choose the right quantization scheme?**

kylesayrs

We should add a section titled "Where can I learn more about LLM Compressor?" which links to talks we've given.

https://www.youtube.com/watch?v=caLYSZMVQ1c
https://www.youtube.com/watch?v=GrhuqQDmBk8
https://www.youtube.com/watch?v=WVenRmF4dPY
https://www.youtube.com/watch?v=G1WNlLxPLSE

kylesayrs · 2025-10-07T14:36:20Z

docs/getting-started/faq.md

+
+Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor).
+
+**5. What layers should be quantized?**


Suggested change

**5. What layers should be quantized?**

**5. Which model layers should be quantized?**

kylesayrs · 2025-10-07T14:37:46Z

docs/getting-started/faq.md

+
+**7. Does LLM Compressor have multi-GPU support?**
+
+LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8.


We essentially do not have mult-GPU support right now.

#1809

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

cajeonrh and others added 19 commits October 2, 2025 13:16

Created FAQ page

a1a5fda

Signed-off-by: Cassie Jeon <[email protected]>

Update docs/getting-started/faq.md

5a95ce3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Cassie Jeon <[email protected]>

Update docs/getting-started/faq.md

190050e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Cassie Jeon <[email protected]>

Update docs/getting-started/faq.md

2a15992

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Cassie Jeon <[email protected]>

Update docs/getting-started/faq.md

046ce57

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Cassie Jeon <[email protected]>

Update README.md with Qwen3 Support (vllm-project#1891)

d2840e8

SUMMARY: - Need to update links when the following PRs land: 1. vllm-project#1886 2. vllm-project#1874 3. vllm-project#1889 Signed-off-by: Cassie Jeon <[email protected]>

[Tests] Workaround qwen_2_5_vl (vllm-project#1894)

dc4470a

## Purpose ## * Fix qwen_2_5_vl tests Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Cassie Jeon <[email protected]>

cajeonrh force-pushed the INFERENG-1867 branch from 21cb4f0 to 34edf9d Compare October 2, 2025 17:16

Merge branch 'main' into INFERENG-1867

4600822

fynnsu reviewed Oct 2, 2025

View reviewed changes

docs/getting-started/faq.md Outdated Show resolved Hide resolved

docs/getting-started/faq.md Outdated Show resolved Hide resolved

cajeonrh and others added 3 commits October 2, 2025 13:45

Update docs/getting-started/faq.md

85ed1ed

Consolidating similar 2 FAQs into 1 and updating incorrect link formatting Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update docs/getting-started/faq.md

8758004

Incorporating feedback with wording Co-authored-by: Fynn Schmitt-Ulms <[email protected]>

Updating numbering of FAQ as 2 FAQs got merged.

a6e7ba6

Signed-off-by: Cassie Jeon <[email protected]>

brian-dellabetta reviewed Oct 2, 2025

View reviewed changes

docs/getting-started/faq.md Show resolved Hide resolved

dsikka requested changes Oct 3, 2025

View reviewed changes

docs/getting-started/faq.md Outdated Show resolved Hide resolved

dsikka requested changes Oct 6, 2025

View reviewed changes

dsikka reviewed Oct 6, 2025

View reviewed changes

Added FAQ box to Getting Started page and updated FAQs to include fee…

198ce26

…dback Signed-off-by: Cassie Jeon <[email protected]>

fynnsu approved these changes Oct 6, 2025

View reviewed changes

cajeonrh requested a review from dsikka October 6, 2025 20:44

Merge branch 'main' into INFERENG-1867

286d259

kylesayrs reviewed Oct 7, 2025

View reviewed changes


		This is usually the case when loading your model through transformers, not an inference server that supports models in the compressed-tensors format. Loading the model through transformers does not provide an inference benefit, as forward passes of the model are done with the model decompressed. There is no support for optimized compression inference during runtime. Instead, the model should be run in vLLM or another inference server that supports optimized inference for the quantized models.

		2. Do we support sglang?

	2. Do we support sglang?
	2. Are models compressed using LLM Compressor supported with SGLang?


		There is minimal support for compressed-tensors models in sglang, but it is not maintained nor tested by our team. Much of the integration relies on vLLM. For the most up-to-date and tested integration, vLLM is recommended.

		3. How do I select the appropriate strategy for compression?

	3. How do I select the appropriate strategy for compression?
	3. How do I choose the right quantization scheme?


		Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor).

		5. What layers should be quantized?

	5. What layers should be quantized?
	5. Which model layers should be quantized?


		7. Does LLM Compressor have multi-GPU support?

		LLM Compressor handles all GPU movement for you. For data-free pathways, we leverage all available GPUs and offload anything that doesn't fit onto the allocated GPUs. If you are using pathways that require data, we sequentially onload model layers onto a single GPU. This is the case for LLM Compressor 0.6-0.8.

Created FAQ page first draft #1896

Are you sure you want to change the base?

Created FAQ page first draft #1896

Conversation

cajeonrh commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

gemini-code-assist bot commented Oct 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cajeonrh commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fynnsu commented Oct 2, 2025

Uh oh!

fynnsu commented Oct 2, 2025

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

cajeonrh commented Oct 6, 2025

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu commented Oct 6, 2025

Uh oh!

kylesayrs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cajeonrh commented Oct 2, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading