Skip to content

Commit 83d2001

Browse files
dsikkaclaude
andauthored
[Docs] Add Entrypoints section to User Guides (#2518)
## Summary Adds a new **Entrypoints** section under User Guides with detailed documentation for both PTQ entrypoints: - **Entrypoints overview** (`guides/entrypoints/index.md`) — decision table comparing `oneshot` vs `model_free_ptq` to help users choose the right entrypoint - **oneshot** (`guides/entrypoints/oneshot.md`) — full lifecycle (preprocessing, calibration, postprocessing), all arguments organized by category (model, recipe, dataset, pipeline, misc), calibration pipeline descriptions, and examples for FP8 data-free, GPTQ W4A16, and Llama4 MoE NVFP4 with a proper ignore list - **model_free_ptq** (`guides/entrypoints/model-free-ptq.md`) — when to use (data-free schemes, no transformers definition, oneshot fallback), how it works internally (file-by-file safetensors processing), standard flow vs NVFP4 microscale flow (with `reindex_fused_weights`), ignore patterns, and supported schemes table Also updates `.nav.yml` to nest the three pages under `Entrypoints` in User Guides. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
1 parent a9e2488 commit 83d2001

File tree

4 files changed

+381
-2
lines changed

4 files changed

+381
-2
lines changed

docs/.nav.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,17 @@ nav:
3030
- key-models/mistral-large-3/index.md
3131
- FP8 Example: key-models/mistral-large-3/fp8-example.md
3232
- User Guides:
33+
- Entrypoints:
34+
- guides/entrypoints/index.md
35+
- oneshot: guides/entrypoints/oneshot.md
36+
- model-free-ptq: guides/entrypoints/model-free-ptq.md
37+
- Compression Schemes: guides/compression_schemes.md
38+
- Observers: guides/observers.md
3339
- Big Models and Distributed Support:
3440
- Model Loading: guides/big_models_and_distributed/model_loading.md
3541
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
3642
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
37-
- Compression Schemes: guides/compression_schemes.md
3843
- Saving a Compressed Model: guides/saving_a_model.md
39-
- Observers: guides/observers.md
4044
- Memory Requirements: guides/memory.md
4145
- Runtime Performance: guides/runtime.md
4246
- Examples:

docs/guides/entrypoints/index.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Entrypoints
2+
3+
LLM Compressor provides two entrypoints for post-training quantization (PTQ), each suited to different scenarios.
4+
5+
## Choosing an Entrypoint
6+
7+
| | [`oneshot`](oneshot.md) | [`model_free_ptq`](model-free-ptq.md) |
8+
|---|---|---|
9+
| **Can apply calibration data** | Yes | No — data-free only |
10+
| **Requires HF model definition** | Yes | No |
11+
| **Supports GPTQ / AWQ / SmoothQuant** | Yes | No |
12+
| **Supports FP8 / NVFP4 data-free** | Yes | Yes |
13+
| **Works when model has no transformers definition** | No | Yes |
14+
| **Fallback when `oneshot` fails** || Yes |
15+
16+
## oneshot
17+
18+
Use `oneshot` when your quantization algorithm or scheme **requires calibration data**, such as GPTQ, AWQ, SmoothQuant, or static activation quantization (FP8 or INT8 with static per tensor activations). It loads the model through Hugging Face `transformers`, runs calibration forward passes, and applies recipe-defined modifiers.
19+
20+
[:octicons-arrow-right-24: oneshot documentation](oneshot.md)
21+
22+
## model_free_ptq
23+
24+
Use `model_free_ptq` when your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16) and either the model has no Hugging Face model definition, or `oneshot` fails for your model. It works directly on the safetensors checkpoint without loading the model through `transformers`.
25+
26+
[:octicons-arrow-right-24: model_free_ptq documentation](model-free-ptq.md)
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# model_free_ptq
2+
3+
`model_free_ptq` is a PTQ entrypoint for **data-free quantization schemes** that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through `transformers`.
4+
5+
## When to Use
6+
7+
Use `model_free_ptq` when:
8+
9+
- Your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8)
10+
- The model **does not have a Hugging Face transformers definition** (e.g. a newly released model not yet in transformers)
11+
- `oneshot` **fails** for your model
12+
13+
For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use [`oneshot`](oneshot.md) instead.
14+
15+
## Basic Usage
16+
17+
```python
18+
from llmcompressor import model_free_ptq
19+
20+
model_free_ptq(
21+
model_stub="meta-llama/Meta-Llama-3-8B-Instruct",
22+
save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK",
23+
scheme="FP8_BLOCK",
24+
ignore=["lm_head"],
25+
device="cuda:0",
26+
)
27+
```
28+
29+
## How It Works
30+
31+
`model_free_ptq` processes each `.safetensors` file in the checkpoint independently, without ever loading the full model into memory as a `torch.nn.Module`. For each file:
32+
33+
1. **Validate** — check that all quantizable tensors can be quantized with the given scheme
34+
2. **Initialize** — create a minimal `torch.nn.Linear` module for each weight tensor
35+
3. **Calibrate** — compute scale and zero point directly from the weight tensor (data-free)
36+
4. **Compress** — call `compress_module` from `compressed-tensors` to pack/quantize the weights
37+
5. **Save** — write the compressed tensors back to disk
38+
39+
After all files are processed, the safetensors index and model config are updated with the quantization metadata.
40+
41+
Multiple files can be processed in parallel using the `max_workers` argument.
42+
43+
## Arguments
44+
45+
| Argument | Type | Default | Description |
46+
|----------|------|---------|-------------|
47+
| `model_stub` | `str \| PathLike` || HuggingFace model ID or path to a local directory containing safetensors files |
48+
| `save_directory` | `str \| PathLike` || Directory to save the quantized checkpoint |
49+
| `scheme` | `QuantizationScheme \| str` || Quantization scheme to apply. Can be a preset string (e.g. `"FP8_BLOCK"`, `"NVFP4A16"`) or a `QuantizationScheme` object |
50+
| `ignore` | `Iterable[str]` | `()` | Module names or regex patterns to skip. Modules ending in `"norm"` are always ignored automatically |
51+
| `max_workers` | `int` | `1` | Number of parallel worker threads for processing safetensors files |
52+
| `device` | `str \| torch.device \| None` | `None` | Device to use for quantization. Defaults to GPU if available, otherwise CPU |
53+
| `converter` | `Converter \| None` | `None` | Optional `compressed-tensors` converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format |
54+
55+
## Standard Flow (Non-Microscale Schemes)
56+
57+
For schemes without a global scale (e.g. `FP8_BLOCK`, `FP8_DYNAMIC`), call `model_free_ptq` directly:
58+
59+
```python
60+
from llmcompressor import model_free_ptq
61+
62+
model_free_ptq(
63+
model_stub="unsloth/Kimi-K2-Thinking-BF16",
64+
save_directory="Kimi-K2-Thinking-FP8-BLOCK",
65+
scheme="FP8_BLOCK",
66+
ignore=[
67+
"re:.*gate$",
68+
"lm_head",
69+
"re:.*kv_a_proj_with_mqa$",
70+
"re:.*q_a_proj$",
71+
"model.embed_tokens",
72+
],
73+
max_workers=15,
74+
device="cuda:0",
75+
)
76+
```
77+
78+
## Microscale Flow (NVFP4)
79+
80+
NVFP4 requires a **global scale** that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the **same safetensors shard**.
81+
82+
Standard model checkpoints often split these weights across different shards. To fix this, run the `reindex_fused_weights` CLI tool first to reorganize the checkpoint:
83+
84+
```bash
85+
llmcompressor.reindex_fused_weights \
86+
unsloth/Kimi-K2-Thinking-BF16 \
87+
Kimi-K2-Thinking-BF16-reindexed \
88+
--num_workers=10
89+
```
90+
91+
Then run `model_free_ptq` on the reindexed checkpoint:
92+
93+
```python
94+
from llmcompressor import model_free_ptq
95+
96+
model_free_ptq(
97+
model_stub="Kimi-K2-Thinking-BF16-reindexed",
98+
save_directory="Kimi-K2-Thinking-NVFP4A16",
99+
scheme="NVFP4A16",
100+
ignore=[
101+
"re:.*gate$",
102+
"lm_head",
103+
"re:.*kv_a_proj_with_mqa$",
104+
"re:.*q_a_proj$",
105+
"model.embed_tokens",
106+
],
107+
max_workers=15,
108+
device="cuda:0",
109+
)
110+
```
111+
112+
!!! note
113+
Reindexing is only required for **NVFP4**, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing.
114+
115+
## Ignoring Layers
116+
117+
The `ignore` argument accepts module name strings or regex patterns prefixed with `re:`. Modules whose names end in `"norm"` are automatically ignored regardless of the `ignore` list.
118+
119+
```python
120+
ignore=[
121+
"lm_head", # exact name match
122+
"re:.*gate$", # regex: any module ending in "gate"
123+
"model.embed_tokens", # exact name match
124+
]
125+
```
126+
127+
## Supported Schemes
128+
129+
`model_free_ptq` supports any data-free weight quantization scheme. Common presets:
130+
131+
| Scheme | Description |
132+
|--------|-------------|
133+
| `FP8_DYNAMIC` | FP8 weights with dynamic per-token activation quantization |
134+
| `FP8_BLOCK` | FP8 weights with block-wise scaling (Blackwell-optimized) |
135+
| `NVFP4A16` | NVFP4 weight-only quantization with FP8 group scales and a global scale |
136+
| `MXFP4/MXFP8` | MXFP4 or MXFP8 quantization with MX-format microscales |
137+
138+
Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot.
139+
For the full list of supported schemes and formats, see [Compression Schemes](../compression_schemes.md).

docs/guides/entrypoints/oneshot.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# oneshot
2+
3+
`oneshot` is the primary entrypoint for post-training quantization (PTQ) when your algorithm or scheme requires calibration data. It loads a model through Hugging Face `transformers`, applies recipe-defined modifiers (such as GPTQ, AWQ, SmoothQuant, or QuantizationModifier), and optionally saves the compressed result.
4+
5+
## When to Use
6+
7+
Use `oneshot` when:
8+
9+
- Your quantization algorithm **requires calibration data** (GPTQ, AWQ, SmoothQuant, AutoRound)
10+
- Your scheme uses **static activation quantization** that requires calibration (FP8 per-tensor, INT8 per-tensor, NVFP4 with activations)
11+
- Your model has a **Hugging Face model definition** available via `transformers`
12+
13+
For data-free schemes on models without a transformers definition, or when `oneshot` fails, see [`model_free_ptq`](model-free-ptq.md).
14+
15+
## Basic Usage
16+
17+
```python
18+
from transformers import AutoModelForCausalLM, AutoTokenizer
19+
from llmcompressor import oneshot
20+
from llmcompressor.modifiers.quantization import QuantizationModifier
21+
22+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
23+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
24+
25+
oneshot(
26+
model=model,
27+
recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]),
28+
output_dir="Meta-Llama-3-8B-Instruct-FP8",
29+
)
30+
```
31+
32+
## Lifecycle
33+
34+
The `oneshot` entrypoint runs three phases:
35+
36+
1. **Preprocessing**
37+
- Loads the model and tokenizer/processor from the provided identifiers or objects
38+
- Unties input and output embedding layers if they share tensors
39+
- Patches `save_pretrained` to support compressed-tensors serialization
40+
41+
2. **Calibration**
42+
- Wraps the model in a MoE calibration context (if applicable) to ensure all experts receive calibration data
43+
- Initializes modifiers defined in the recipe via a global `CompressionSession`
44+
- Runs calibration forward passes through the selected [pipeline](#calibration-pipelines)
45+
- Finalizes modifiers, applying any post-calibration transformations
46+
47+
3. **Postprocessing**
48+
- Saves the compressed model, tokenizer/processor, recipe, and config to `output_dir` (if specified)
49+
- Weights are saved in a compressed SafeTensors format via `compressed-tensors`
50+
51+
## Arguments
52+
53+
### Model Arguments
54+
55+
| Argument | Type | Default | Description |
56+
|----------|------|---------|-------------|
57+
| `model` | `str \| PreTrainedModel` || HuggingFace model ID, local path, or a pre-loaded model instance |
58+
| `tokenizer` | `str \| PreTrainedTokenizerBase \| None` | `None` | Tokenizer ID or path. Inferred from `model` if not set |
59+
| `processor` | `str \| ProcessorMixin \| None` | `None` | Processor ID or path (for multimodal models). Inferred from `model` if not set |
60+
| `config_name` | `str \| None` | `None` | Config name or path if different from `model` |
61+
| `precision` | `str` | `"auto"` | Precision to cast model weights to on load (e.g. `"float16"`, `"bfloat16"`, `"auto"`) |
62+
| `trust_remote_code_model` | `bool` | `False` | Allow custom model code from the repository |
63+
| `save_compressed` | `bool` | `True` | Whether to save weights in compressed format |
64+
| `model_revision` | `str` | `"main"` | Model version (branch, tag, or commit) |
65+
66+
### Recipe Arguments
67+
68+
| Argument | Type | Default | Description |
69+
|----------|------|---------|-------------|
70+
| `recipe` | `str \| list \| None` | `None` | Path to a recipe file, a list of paths, or a modifier object / list of modifier objects |
71+
| `recipe_args` | `list[str] \| None` | `None` | Recipe argument overrides in `"key=value"` format |
72+
| `stage` | `str \| None` | `None` | Specific recipe stage to run. Runs all stages if not set |
73+
74+
### Dataset Arguments
75+
76+
| Argument | Type | Default | Description |
77+
|----------|------|---------|-------------|
78+
| `dataset` | `str \| Dataset \| DatasetDict \| DataLoader \| None` | `None` | Dataset name (HuggingFace), a pre-loaded Dataset/DatasetDict, or a PyTorch DataLoader |
79+
| `dataset_config_name` | `str \| None` | `None` | HuggingFace dataset configuration name |
80+
| `dataset_path` | `str \| None` | `None` | Path to a local dataset (JSON, CSV, or DVC) |
81+
| `num_calibration_samples` | `int` | `512` | Number of samples to use for calibration |
82+
| `max_seq_length` | `int` | `384` | Maximum sequence length after tokenization. Longer sequences are truncated |
83+
| `batch_size` | `int` | `1` | Calibration batch size |
84+
| `data_collator` | `str \| Callable` | `"truncation"` | Batch collation strategy. `"truncation"` or `"padding"`, or a custom callable |
85+
| `shuffle_calibration_samples` | `bool` | `True` | Whether to shuffle the dataset before selecting calibration samples |
86+
| `text_column` | `str` | `"text"` | Dataset column to use as text input to the tokenizer/processor |
87+
| `concatenate_data` | `bool` | `False` | Whether to concatenate samples to fill `max_seq_length` |
88+
| `streaming` | `bool` | `False` | Stream data from a cloud-hosted dataset |
89+
| `preprocessing_num_workers` | `int \| None` | `None` | Number of workers for dataset preprocessing |
90+
| `dataloader_num_workers` | `int` | `0` | Number of workers for the DataLoader. Set to 2+ for faster loading if RAM allows |
91+
| `moe_calibrate_all_experts` | `bool` | `True` | Route all tokens through all experts during calibration. Required for accurate MoE quantization |
92+
| `min_tokens_per_module` | `float \| None` | `None` | Minimum fraction of tokens a module must receive. Logs a warning if unmet. Mainly relevant for MoE models |
93+
94+
### Pipeline Arguments
95+
96+
| Argument | Type | Default | Description |
97+
|----------|------|---------|-------------|
98+
| `pipeline` | `str \| None` | `"independent"` | Calibration pipeline to use. See [Calibration Pipelines](#calibration-pipelines) |
99+
| `sequential_targets` | `list[str] \| None` | `None` | Layer targets for the sequential pipeline (typically a single decoder layer class). Defaults to `no_split_modules` from the HF model definition |
100+
| `sequential_offload_device` | `str` | `"cpu"` | Device to offload intermediate activations between sequential layers. Use `"cuda:1"` if a second GPU is available |
101+
| `quantization_aware_calibration` | `bool` | `True` | Apply quantization during the calibration forward pass in the sequential pipeline |
102+
| `sequential_prefetch` | `bool` | `False` | Prefetch the next batch in a background thread during sequential pipeline calibration |
103+
104+
### Miscellaneous Arguments
105+
106+
| Argument | Type | Default | Description |
107+
|----------|------|---------|-------------|
108+
| `output_dir` | `str \| None` | `None` | Directory to save the compressed model. Nothing is saved if `None` |
109+
| `log_dir` | `str \| None` | `None` | Directory to write timestamped log files. Nothing is logged to file if `None` |
110+
111+
## Calibration Pipelines
112+
113+
The `pipeline` argument controls how calibration forward passes are run through the model.
114+
115+
| Pipeline | Description | Best For |
116+
|----------|-------------|----------|
117+
| `independent` | Each modifier manages its own forward passes independently *(default)* | Most use cases |
118+
| `sequential` | Runs calibration layer-by-layer, offloading intermediate activations between layers | Large models that don't fit in GPU memory |
119+
| `datafree` | Runs initialization and finalization without any forward passes | Data-free weight-only quantization |
120+
| `basic` | Single set of forward passes shared across all modifiers | Simple post-hoc calibration |
121+
122+
## Examples
123+
124+
### FP8 Data-Free Quantization
125+
126+
```python
127+
from transformers import AutoModelForCausalLM, AutoTokenizer
128+
from llmcompressor import oneshot
129+
from llmcompressor.modifiers.quantization import QuantizationModifier
130+
131+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
132+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
133+
134+
oneshot(
135+
model=model,
136+
recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]),
137+
output_dir="Meta-Llama-3-8B-Instruct-FP8",
138+
)
139+
```
140+
141+
### GPTQ W4A16
142+
143+
```python
144+
from transformers import AutoModelForCausalLM, AutoTokenizer
145+
from llmcompressor import oneshot
146+
from llmcompressor.modifiers.gptq import GPTQModifier
147+
148+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
149+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
150+
151+
oneshot(
152+
model=model,
153+
dataset="HuggingFaceH4/ultrachat_200k",
154+
recipe=GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
155+
num_calibration_samples=512,
156+
max_seq_length=2048,
157+
output_dir="Meta-Llama-3-8B-Instruct-W4A16-GPTQ",
158+
)
159+
```
160+
161+
### MoE Model with All-Expert Calibration
162+
163+
```python
164+
from transformers import AutoModelForCausalLM, AutoTokenizer
165+
from llmcompressor import oneshot
166+
from llmcompressor.modeling.llama4 import SequentialLlama4TextMoe # noqa: F401
167+
from llmcompressor.modifiers.quantization import QuantizationModifier
168+
169+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct", dtype="auto")
170+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
171+
172+
oneshot(
173+
model=model,
174+
dataset="HuggingFaceH4/ultrachat_200k",
175+
recipe=QuantizationModifier(
176+
targets="Linear",
177+
scheme="NVFP4",
178+
ignore=[
179+
"re:.*lm_head",
180+
"re:.*self_attn",
181+
"re:.*router",
182+
"re:.*vision_model.*",
183+
"re:.*multi_modal_projector.*",
184+
"Llama4TextAttention",
185+
],
186+
),
187+
num_calibration_samples=20,
188+
max_seq_length=2048,
189+
moe_calibrate_all_experts=True,
190+
output_dir="Llama-4-Scout-17B-NVFP4",
191+
)
192+
```
193+
194+
## Saving
195+
196+
The recommended way to save is via the `output_dir` argument, which automatically saves the model weights in compressed SafeTensors format along with the tokenizer/processor, recipe, and config:
197+
198+
```python
199+
oneshot(..., output_dir="./my-compressed-model")
200+
```
201+
202+
Alternatively, you can save manually after the call:
203+
204+
```python
205+
model = oneshot(model=model, recipe=recipe)
206+
model.save_pretrained("./my-compressed-model", save_compressed=True)
207+
tokenizer.save_pretrained("./my-compressed-model")
208+
```
209+
210+
For more details on save options, see [Saving a Compressed Model](../saving_a_model.md).

0 commit comments

Comments
 (0)