Skip to content

Commit 74aa01e

Browse files
dsikkayiliu30
authored andcommitted
[Docs] Clean-up + Example ReadMe updates (vllm-project#2399)
SUMMARY: - Remove marlin24 examples - Clean-up existing README docs - Add examples/README.md file explaining repo structure - Update MoE README.md Signed-off-by: yiliu30 <yi4.liu@intel.com>
1 parent f8880a2 commit 74aa01e

File tree

22 files changed

+202
-351
lines changed

22 files changed

+202
-351
lines changed

docs/.nav.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,11 @@ nav:
3232
- Memory Requirements: guides/memory.md
3333
- Runtime Performance: guides/runtime.md
3434
- Examples:
35-
- examples/index.md
35+
- examples/README.md
3636
- examples/*
37+
- Experimental:
38+
- experimental/README.md
39+
- experimental/*
3740
- Developer:
3841
- developer/index.md
3942
- developer/*

docs/api/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,4 @@ oneshot(
1919
```
2020

2121
For advanced usage, you can configure individual modifiers and apply them directly to models.
22-
See the [Examples](../examples/index.md) section for detailed usage patterns.
22+
See the [Examples](https://github.com/vllm-project/llm-compressor/tree/main/examples) section for detailed usage patterns.

docs/examples/index.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

docs/scripts/gen_files.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,16 @@ def migrate_examples():
8282
examples_path = project_root / "examples"
8383
files = []
8484

85+
# Add the main examples README.md
86+
main_readme = examples_path / "README.md"
87+
if main_readme.exists():
88+
files.append(
89+
ProcessFile(
90+
root_path=main_readme.relative_to(project_root),
91+
docs_path=Path("examples/README.md"),
92+
)
93+
)
94+
8595
# Find all README.md files 2 levels down (examples/EXAMPLE_NAME/README.md)
8696
for example_dir in examples_path.iterdir():
8797
if (
@@ -101,6 +111,40 @@ def migrate_examples():
101111
process_files(files, project_root)
102112

103113

114+
def migrate_experimental():
115+
project_root = find_project_root()
116+
experimental_path = project_root / "experimental"
117+
files = []
118+
119+
# Add the main experimental README.md
120+
main_readme = experimental_path / "README.md"
121+
if main_readme.exists():
122+
files.append(
123+
ProcessFile(
124+
root_path=main_readme.relative_to(project_root),
125+
docs_path=Path("experimental/README.md"),
126+
)
127+
)
128+
129+
# Find all README.md files 2 levels down (experimental/EXPERIMENTAL_NAME/README.md)
130+
for experimental_dir in experimental_path.iterdir():
131+
if (
132+
not experimental_dir.is_dir()
133+
or not (readme_path := experimental_dir / "README.md").exists()
134+
):
135+
continue
136+
137+
experimental_name = experimental_dir.name
138+
files.append(
139+
ProcessFile(
140+
root_path=readme_path.relative_to(project_root),
141+
docs_path=Path(f"experimental/{experimental_name}.md"),
142+
)
143+
)
144+
145+
process_files(files, project_root)
146+
147+
104148
def migrate_readme_to_index():
105149
"""Copy README.md files to index.md for MkDocs compatibility.
106150
@@ -127,4 +171,5 @@ def migrate_readme_to_index():
127171

128172
migrate_developer_docs()
129173
migrate_examples()
174+
migrate_experimental()
130175
migrate_readme_to_index()

examples/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
weight: -4
3+
---
4+
5+
# LLM Compressor Examples
6+
7+
The LLM Compressor examples are organized primarily by quantization scheme. Each folder contains model-specific examples showing how to apply that quantization scheme to a particular model.
8+
9+
Some examples are additionally grouped by model type, such as:
10+
- `multimodal_audio`
11+
- `multimodal_vision`
12+
- `quantizing_moe`
13+
14+
Other examples are grouped by algorithm, such as:
15+
- `awq`
16+
- `autoround`
17+
18+
## How to find the right example
19+
20+
- If you are interested in quantizing a specific model, start by browsing the model-type folders (for example, `multimodal_audio`, `multimodal_vision`, or `quantizing_moe`).
21+
- If you don’t see your model there, decide which quantization scheme you want to use (e.g., FP8, FP4, INT4, INT8, or KV cache / attention quantization) and look in the corresponding `quantization_***` folder.
22+
- Each quantization scheme folder contains at least one LLaMA 3 example, which can be used as a general reference for other models.
23+
24+
## Where to start if you’re unsure
25+
26+
If you’re unsure which quantization scheme to use, a good starting point is a data-free pathway, such as `w8a8_fp8`, found under `quantization_w8a8_fp8`. For more details on available schemes and when to use them, see the Compression Schemes [guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/compression_schemes/).
27+
28+
## Need help?
29+
30+
If you don’t see your model or aren’t sure which quantization scheme applies, feel free to open an issue and someone from the community will be happy to help.
31+
32+
!!! note
33+
We are currently updating and improving our documentation and examples structure. Feedback is very welcome during this transition.

examples/awq/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quantizing Models with Activation-Aware Quantization (AWQ) #
1+
# AWQ Quantization #
22

33
Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values.
44

examples/big_models_with_sequential_onloading/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Big Modeling with Sequential Onloading #
1+
# Big Model Quantization with Sequential Onloading
2+
23
## What is Sequential Onloading? ##
34
Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.
45

examples/model_free_ptq/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quantizing models without a model definition
1+
# Model-free Quantization
22

33
`model_free_ptq` provides a PTQ pathway for data-free schemes (such for FP8 Dynamic Per Token or FP8 Block). Specifically, this pathway removes the requirement for a model definition or the need to load the model through transformers. If you are interested in applying a data-free scheme, there are two key scenarios in which applying this pathway may make sense for your model:
44

examples/multimodal_audio/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quantizing Multimodal Audio Models #
1+
# Multimodal Audio Model Quantization
22

33
https://github.com/user-attachments/assets/6732c60b-1ebe-4bed-b409-c16c4415dff5
44

examples/multimodal_vision/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quantizing Multimodal Vision-Language Models #
1+
# Multimodal Vision-Language Quantization #
22

33
<p align="center" style="text-align: center;">
44
<img src=http://images.cocodataset.org/train2017/000000231895.jpg alt="sample image from MS COCO dataset"/>

0 commit comments

Comments
 (0)