[Docs] Add Entrypoints section to User Guides (#2518)

dsikka · claude · web-flow · commit 83d200106669 · 2026-03-31T14:09:39.000-04:00
## Summary Adds a new **Entrypoints** section under User Guides with detailed documentation for both PTQ entrypoints: - **Entrypoints overview** (`guides/entrypoints/index.md`) — decision table comparing `oneshot` vs `model_free_ptq` to help users choose the right entrypoint - **oneshot** (`guides/entrypoints/oneshot.md`) — full lifecycle (preprocessing, calibration, postprocessing), all arguments organized by category (model, recipe, dataset, pipeline, misc), calibration pipeline descriptions, and examples for FP8 data-free, GPTQ W4A16, and Llama4 MoE NVFP4 with a proper ignore list - **model_free_ptq** (`guides/entrypoints/model-free-ptq.md`) — when to use (data-free schemes, no transformers definition, oneshot fallback), how it works internally (file-by-file safetensors processing), standard flow vs NVFP4 microscale flow (with `reindex_fused_weights`), ignore patterns, and supported schemes table Also updates `.nav.yml` to nest the three pages under `Entrypoints` in User Guides. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -30,13 +30,17 @@ nav:
       - key-models/mistral-large-3/index.md
       - FP8 Example: key-models/mistral-large-3/fp8-example.md
   - User Guides:
+    - Entrypoints:
+      - guides/entrypoints/index.md
+      - oneshot: guides/entrypoints/oneshot.md
+      - model-free-ptq: guides/entrypoints/model-free-ptq.md
+    - Compression Schemes: guides/compression_schemes.md
+    - Observers: guides/observers.md
     - Big Models and Distributed Support:
       - Model Loading: guides/big_models_and_distributed/model_loading.md
       - Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
       - Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
-    - Compression Schemes: guides/compression_schemes.md
     - Saving a Compressed Model: guides/saving_a_model.md
-    - Observers: guides/observers.md
     - Memory Requirements: guides/memory.md
     - Runtime Performance: guides/runtime.md
   - Examples:
diff --git a/docs/guides/entrypoints/index.md b/docs/guides/entrypoints/index.md
@@ -0,0 +1,26 @@
+# Entrypoints
+
+LLM Compressor provides two entrypoints for post-training quantization (PTQ), each suited to different scenarios.
+
+## Choosing an Entrypoint
+
+| | [`oneshot`](oneshot.md) | [`model_free_ptq`](model-free-ptq.md) |
+|---|---|---|
+| **Can apply calibration data** | Yes | No — data-free only |
+| **Requires HF model definition** | Yes | No |
+| **Supports GPTQ / AWQ / SmoothQuant** | Yes | No |
+| **Supports FP8 / NVFP4 data-free** | Yes | Yes |
+| **Works when model has no transformers definition** | No | Yes |
+| **Fallback when `oneshot` fails** | — | Yes |
+
+## oneshot
+
+Use `oneshot` when your quantization algorithm or scheme **requires calibration data**, such as GPTQ, AWQ, SmoothQuant, or static activation quantization (FP8 or INT8 with static per tensor activations). It loads the model through Hugging Face `transformers`, runs calibration forward passes, and applies recipe-defined modifiers.
+
+[:octicons-arrow-right-24: oneshot documentation](oneshot.md)
+
+## model_free_ptq
+
+Use `model_free_ptq` when your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16) and either the model has no Hugging Face model definition, or `oneshot` fails for your model. It works directly on the safetensors checkpoint without loading the model through `transformers`.
+
+[:octicons-arrow-right-24: model_free_ptq documentation](model-free-ptq.md)
diff --git a/docs/guides/entrypoints/model-free-ptq.md b/docs/guides/entrypoints/model-free-ptq.md
@@ -0,0 +1,139 @@
+# model_free_ptq
+
+`model_free_ptq` is a PTQ entrypoint for **data-free quantization schemes** that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through `transformers`.
+
+## When to Use
+
+Use `model_free_ptq` when:
+
+- Your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8)
+- The model **does not have a Hugging Face transformers definition** (e.g. a newly released model not yet in transformers)
+- `oneshot` **fails** for your model
+
+For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use [`oneshot`](oneshot.md) instead.
+
+## Basic Usage
+
+```python
+from llmcompressor import model_free_ptq
+
+model_free_ptq(
+    model_stub="meta-llama/Meta-Llama-3-8B-Instruct",
+    save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK",
+    scheme="FP8_BLOCK",
+    ignore=["lm_head"],
+    device="cuda:0",
+)
+```
+
+## How It Works
+
+`model_free_ptq` processes each `.safetensors` file in the checkpoint independently, without ever loading the full model into memory as a `torch.nn.Module`. For each file:
+
+1. **Validate** — check that all quantizable tensors can be quantized with the given scheme
+2. **Initialize** — create a minimal `torch.nn.Linear` module for each weight tensor
+3. **Calibrate** — compute scale and zero point directly from the weight tensor (data-free)
+4. **Compress** — call `compress_module` from `compressed-tensors` to pack/quantize the weights
+5. **Save** — write the compressed tensors back to disk
+
+After all files are processed, the safetensors index and model config are updated with the quantization metadata.
+
+Multiple files can be processed in parallel using the `max_workers` argument.
+
+## Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `model_stub` | `str \| PathLike` | — | HuggingFace model ID or path to a local directory containing safetensors files |
+| `save_directory` | `str \| PathLike` | — | Directory to save the quantized checkpoint |
+| `scheme` | `QuantizationScheme \| str` | — | Quantization scheme to apply. Can be a preset string (e.g. `"FP8_BLOCK"`, `"NVFP4A16"`) or a `QuantizationScheme` object |
+| `ignore` | `Iterable[str]` | `()` | Module names or regex patterns to skip. Modules ending in `"norm"` are always ignored automatically |
+| `max_workers` | `int` | `1` | Number of parallel worker threads for processing safetensors files |
+| `device` | `str \| torch.device \| None` | `None` | Device to use for quantization. Defaults to GPU if available, otherwise CPU |
+| `converter` | `Converter \| None` | `None` | Optional `compressed-tensors` converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format |
+
+## Standard Flow (Non-Microscale Schemes)
+
+For schemes without a global scale (e.g. `FP8_BLOCK`, `FP8_DYNAMIC`), call `model_free_ptq` directly:
+
+```python
+from llmcompressor import model_free_ptq
+
+model_free_ptq(
+    model_stub="unsloth/Kimi-K2-Thinking-BF16",
+    save_directory="Kimi-K2-Thinking-FP8-BLOCK",
+    scheme="FP8_BLOCK",
+    ignore=[
+        "re:.*gate$",
+        "lm_head",
+        "re:.*kv_a_proj_with_mqa$",
+        "re:.*q_a_proj$",
+        "model.embed_tokens",
+    ],
+    max_workers=15,
+    device="cuda:0",
+)
+```
+
+## Microscale Flow (NVFP4)
+
+NVFP4 requires a **global scale** that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the **same safetensors shard**.
+
+Standard model checkpoints often split these weights across different shards. To fix this, run the `reindex_fused_weights` CLI tool first to reorganize the checkpoint:
+
+```bash
+llmcompressor.reindex_fused_weights \
+    unsloth/Kimi-K2-Thinking-BF16 \
+    Kimi-K2-Thinking-BF16-reindexed \
+    --num_workers=10
+```
+
+Then run `model_free_ptq` on the reindexed checkpoint:
+
+```python
+from llmcompressor import model_free_ptq
+
+model_free_ptq(
+    model_stub="Kimi-K2-Thinking-BF16-reindexed",
+    save_directory="Kimi-K2-Thinking-NVFP4A16",
+    scheme="NVFP4A16",
+    ignore=[
+        "re:.*gate$",
+        "lm_head",
+        "re:.*kv_a_proj_with_mqa$",
+        "re:.*q_a_proj$",
+        "model.embed_tokens",
+    ],
+    max_workers=15,
+    device="cuda:0",
+)
+```
+
+!!! note
+    Reindexing is only required for **NVFP4**, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing.
+
+## Ignoring Layers
+
+The `ignore` argument accepts module name strings or regex patterns prefixed with `re:`. Modules whose names end in `"norm"` are automatically ignored regardless of the `ignore` list.
+
+```python
+ignore=[
+    "lm_head",            # exact name match
+    "re:.*gate$",         # regex: any module ending in "gate"
+    "model.embed_tokens", # exact name match
+]
+```
+
+## Supported Schemes
+
+`model_free_ptq` supports any data-free weight quantization scheme. Common presets:
+
+| Scheme | Description |
+|--------|-------------|
+| `FP8_DYNAMIC` | FP8 weights with dynamic per-token activation quantization |
+| `FP8_BLOCK` | FP8 weights with block-wise scaling (Blackwell-optimized) |
+| `NVFP4A16` | NVFP4 weight-only quantization with FP8 group scales and a global scale |
+| `MXFP4/MXFP8` | MXFP4 or MXFP8 quantization with MX-format microscales |
+
+Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot.
+For the full list of supported schemes and formats, see [Compression Schemes](../compression_schemes.md).
diff --git a/docs/guides/entrypoints/oneshot.md b/docs/guides/entrypoints/oneshot.md
@@ -0,0 +1,210 @@
+# oneshot
+
+`oneshot` is the primary entrypoint for post-training quantization (PTQ) when your algorithm or scheme requires calibration data. It loads a model through Hugging Face `transformers`, applies recipe-defined modifiers (such as GPTQ, AWQ, SmoothQuant, or QuantizationModifier), and optionally saves the compressed result.
+
+## When to Use
+
+Use `oneshot` when:
+
+- Your quantization algorithm **requires calibration data** (GPTQ, AWQ, SmoothQuant, AutoRound)
+- Your scheme uses **static activation quantization** that requires calibration (FP8 per-tensor, INT8 per-tensor, NVFP4 with activations)
+- Your model has a **Hugging Face model definition** available via `transformers`
+
+For data-free schemes on models without a transformers definition, or when `oneshot` fails, see [`model_free_ptq`](model-free-ptq.md).
+
+## Basic Usage
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
+
+oneshot(
+    model=model,
+    recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]),
+    output_dir="Meta-Llama-3-8B-Instruct-FP8",
+)
+```
+
+## Lifecycle
+
+The `oneshot` entrypoint runs three phases:
+
+1. **Preprocessing**
+    - Loads the model and tokenizer/processor from the provided identifiers or objects
+    - Unties input and output embedding layers if they share tensors
+    - Patches `save_pretrained` to support compressed-tensors serialization
+
+2. **Calibration**
+    - Wraps the model in a MoE calibration context (if applicable) to ensure all experts receive calibration data
+    - Initializes modifiers defined in the recipe via a global `CompressionSession`
+    - Runs calibration forward passes through the selected [pipeline](#calibration-pipelines)
+    - Finalizes modifiers, applying any post-calibration transformations
+
+3. **Postprocessing**
+    - Saves the compressed model, tokenizer/processor, recipe, and config to `output_dir` (if specified)
+    - Weights are saved in a compressed SafeTensors format via `compressed-tensors`
+
+## Arguments
+
+### Model Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `model` | `str \| PreTrainedModel` | — | HuggingFace model ID, local path, or a pre-loaded model instance |
+| `tokenizer` | `str \| PreTrainedTokenizerBase \| None` | `None` | Tokenizer ID or path. Inferred from `model` if not set |
+| `processor` | `str \| ProcessorMixin \| None` | `None` | Processor ID or path (for multimodal models). Inferred from `model` if not set |
+| `config_name` | `str \| None` | `None` | Config name or path if different from `model` |
+| `precision` | `str` | `"auto"` | Precision to cast model weights to on load (e.g. `"float16"`, `"bfloat16"`, `"auto"`) |
+| `trust_remote_code_model` | `bool` | `False` | Allow custom model code from the repository |
+| `save_compressed` | `bool` | `True` | Whether to save weights in compressed format |
+| `model_revision` | `str` | `"main"` | Model version (branch, tag, or commit) |
+
+### Recipe Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `recipe` | `str \| list \| None` | `None` | Path to a recipe file, a list of paths, or a modifier object / list of modifier objects |
+| `recipe_args` | `list[str] \| None` | `None` | Recipe argument overrides in `"key=value"` format |
+| `stage` | `str \| None` | `None` | Specific recipe stage to run. Runs all stages if not set |
+
+### Dataset Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `dataset` | `str \| Dataset \| DatasetDict \| DataLoader \| None` | `None` | Dataset name (HuggingFace), a pre-loaded Dataset/DatasetDict, or a PyTorch DataLoader |
+| `dataset_config_name` | `str \| None` | `None` | HuggingFace dataset configuration name |
+| `dataset_path` | `str \| None` | `None` | Path to a local dataset (JSON, CSV, or DVC) |
+| `num_calibration_samples` | `int` | `512` | Number of samples to use for calibration |
+| `max_seq_length` | `int` | `384` | Maximum sequence length after tokenization. Longer sequences are truncated |
+| `batch_size` | `int` | `1` | Calibration batch size |
+| `data_collator` | `str \| Callable` | `"truncation"` | Batch collation strategy. `"truncation"` or `"padding"`, or a custom callable |
+| `shuffle_calibration_samples` | `bool` | `True` | Whether to shuffle the dataset before selecting calibration samples |
+| `text_column` | `str` | `"text"` | Dataset column to use as text input to the tokenizer/processor |
+| `concatenate_data` | `bool` | `False` | Whether to concatenate samples to fill `max_seq_length` |
+| `streaming` | `bool` | `False` | Stream data from a cloud-hosted dataset |
+| `preprocessing_num_workers` | `int \| None` | `None` | Number of workers for dataset preprocessing |
+| `dataloader_num_workers` | `int` | `0` | Number of workers for the DataLoader. Set to 2+ for faster loading if RAM allows |
+| `moe_calibrate_all_experts` | `bool` | `True` | Route all tokens through all experts during calibration. Required for accurate MoE quantization |
+| `min_tokens_per_module` | `float \| None` | `None` | Minimum fraction of tokens a module must receive. Logs a warning if unmet. Mainly relevant for MoE models |
+
+### Pipeline Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `pipeline` | `str \| None` | `"independent"` | Calibration pipeline to use. See [Calibration Pipelines](#calibration-pipelines) |
+| `sequential_targets` | `list[str] \| None` | `None` | Layer targets for the sequential pipeline (typically a single decoder layer class). Defaults to `no_split_modules` from the HF model definition |
+| `sequential_offload_device` | `str` | `"cpu"` | Device to offload intermediate activations between sequential layers. Use `"cuda:1"` if a second GPU is available |
+| `quantization_aware_calibration` | `bool` | `True` | Apply quantization during the calibration forward pass in the sequential pipeline |
+| `sequential_prefetch` | `bool` | `False` | Prefetch the next batch in a background thread during sequential pipeline calibration |
+
+### Miscellaneous Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `output_dir` | `str \| None` | `None` | Directory to save the compressed model. Nothing is saved if `None` |
+| `log_dir` | `str \| None` | `None` | Directory to write timestamped log files. Nothing is logged to file if `None` |
+
+## Calibration Pipelines
+
+The `pipeline` argument controls how calibration forward passes are run through the model.
+
+| Pipeline | Description | Best For |
+|----------|-------------|----------|
+| `independent` | Each modifier manages its own forward passes independently *(default)* | Most use cases |
+| `sequential` | Runs calibration layer-by-layer, offloading intermediate activations between layers | Large models that don't fit in GPU memory |
+| `datafree` | Runs initialization and finalization without any forward passes | Data-free weight-only quantization |
+| `basic` | Single set of forward passes shared across all modifiers | Simple post-hoc calibration |
+
+## Examples
+
+### FP8 Data-Free Quantization
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
+
+oneshot(
+    model=model,
+    recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]),
+    output_dir="Meta-Llama-3-8B-Instruct-FP8",
+)
+```
+
+### GPTQ W4A16
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.gptq import GPTQModifier
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
+
+oneshot(
+    model=model,
+    dataset="HuggingFaceH4/ultrachat_200k",
+    recipe=GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
+    num_calibration_samples=512,
+    max_seq_length=2048,
+    output_dir="Meta-Llama-3-8B-Instruct-W4A16-GPTQ",
+)
+```
+
+### MoE Model with All-Expert Calibration
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modeling.llama4 import SequentialLlama4TextMoe  # noqa: F401
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
+
+oneshot(
+    model=model,
+    dataset="HuggingFaceH4/ultrachat_200k",
+    recipe=QuantizationModifier(
+        targets="Linear",
+        scheme="NVFP4",
+        ignore=[
+            "re:.*lm_head",
+            "re:.*self_attn",
+            "re:.*router",
+            "re:.*vision_model.*",
+            "re:.*multi_modal_projector.*",
+            "Llama4TextAttention",
+        ],
+    ),
+    num_calibration_samples=20,
+    max_seq_length=2048,
+    moe_calibrate_all_experts=True,
+    output_dir="Llama-4-Scout-17B-NVFP4",
+)
+```
+
+## Saving
+
+The recommended way to save is via the `output_dir` argument, which automatically saves the model weights in compressed SafeTensors format along with the tokenizer/processor, recipe, and config:
+
+```python
+oneshot(..., output_dir="./my-compressed-model")
+```
+
+Alternatively, you can save manually after the call:
+
+```python
+model = oneshot(model=model, recipe=recipe)
+model.save_pretrained("./my-compressed-model", save_compressed=True)
+tokenizer.save_pretrained("./my-compressed-model")
+```
+
+For more details on save options, see [Saving a Compressed Model](../saving_a_model.md).