-
Notifications
You must be signed in to change notification settings - Fork 453
[Docs] Add Entrypoints section to User Guides #2518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dsikka
wants to merge
4
commits into
main
Choose a base branch
from
entrypoint_docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+381
−2
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
2f5a72f
[Docs] Add Entrypoints section to User Guides
dsikka 732d822
Merge branch 'main' into entrypoint_docs
dsikka 6f73f02
[Docs] Update Llama4 MoE example with proper ignore list in oneshot.md
dsikka c9c666d
Merge branch 'main' into entrypoint_docs
dsikka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Entrypoints | ||
|
|
||
| LLM Compressor provides two entrypoints for post-training quantization (PTQ), each suited to different scenarios. | ||
|
|
||
| ## Choosing an Entrypoint | ||
|
|
||
| | | [`oneshot`](oneshot.md) | [`model_free_ptq`](model-free-ptq.md) | | ||
| |---|---|---| | ||
| | **Can apply calibration data** | Yes | No — data-free only | | ||
| | **Requires HF model definition** | Yes | No | | ||
| | **Supports GPTQ / AWQ / SmoothQuant** | Yes | No | | ||
| | **Supports FP8 / NVFP4 data-free** | Yes | Yes | | ||
| | **Works when model has no transformers definition** | No | Yes | | ||
| | **Fallback when `oneshot` fails** | — | Yes | | ||
|
|
||
| ## oneshot | ||
|
|
||
| Use `oneshot` when your quantization algorithm or scheme **requires calibration data**, such as GPTQ, AWQ, SmoothQuant, or static activation quantization (FP8 per-tensor, INT8). It loads the model through Hugging Face `transformers`, runs calibration forward passes, and applies recipe-defined modifiers. | ||
|
|
||
| [:octicons-arrow-right-24: oneshot documentation](oneshot.md) | ||
|
|
||
| ## model_free_ptq | ||
|
|
||
| Use `model_free_ptq` when your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16) and either the model has no Hugging Face model definition, or `oneshot` fails for your model. It works directly on the safetensors checkpoint without loading the model through `transformers`. | ||
|
|
||
| [:octicons-arrow-right-24: model_free_ptq documentation](model-free-ptq.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| # model_free_ptq | ||
|
|
||
| `model_free_ptq` is a PTQ entrypoint for **data-free quantization schemes** that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through `transformers`. | ||
|
|
||
| ## When to Use | ||
|
|
||
| Use `model_free_ptq` when: | ||
|
|
||
| - Your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8) | ||
| - The model **does not have a Hugging Face transformers definition** (e.g. a newly released model not yet in transformers) | ||
| - `oneshot` **fails** for your model | ||
|
|
||
| For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use [`oneshot`](oneshot.md) instead. | ||
|
|
||
| ## Basic Usage | ||
|
|
||
| ```python | ||
| from llmcompressor import model_free_ptq | ||
|
|
||
| model_free_ptq( | ||
| model_stub="meta-llama/Meta-Llama-3-8B-Instruct", | ||
| save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK", | ||
| scheme="FP8_BLOCK", | ||
| ignore=["lm_head"], | ||
| device="cuda:0", | ||
| ) | ||
| ``` | ||
|
|
||
| ## How It Works | ||
|
|
||
| `model_free_ptq` processes each `.safetensors` file in the checkpoint independently, without ever loading the full model into memory as a `torch.nn.Module`. For each file: | ||
|
|
||
| 1. **Validate** — check that all quantizable tensors can be quantized with the given scheme | ||
| 2. **Initialize** — create a minimal `torch.nn.Linear` module for each weight tensor | ||
| 3. **Calibrate** — compute scale and zero point directly from the weight tensor (data-free) | ||
| 4. **Compress** — call `compress_module` from `compressed-tensors` to pack/quantize the weights | ||
| 5. **Save** — write the compressed tensors back to disk | ||
|
|
||
| After all files are processed, the safetensors index and model config are updated with the quantization metadata. | ||
|
|
||
| Multiple files can be processed in parallel using the `max_workers` argument. | ||
|
|
||
| ## Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `model_stub` | `str \| PathLike` | — | HuggingFace model ID or path to a local directory containing safetensors files | | ||
| | `save_directory` | `str \| PathLike` | — | Directory to save the quantized checkpoint | | ||
| | `scheme` | `QuantizationScheme \| str` | — | Quantization scheme to apply. Can be a preset string (e.g. `"FP8_BLOCK"`, `"NVFP4A16"`) or a `QuantizationScheme` object | | ||
| | `ignore` | `Iterable[str]` | `()` | Module names or regex patterns to skip. Modules ending in `"norm"` are always ignored automatically | | ||
| | `max_workers` | `int` | `1` | Number of parallel worker threads for processing safetensors files | | ||
| | `device` | `str \| torch.device \| None` | `None` | Device to use for quantization. Defaults to GPU if available, otherwise CPU | | ||
| | `converter` | `Converter \| None` | `None` | Optional `compressed-tensors` converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format | | ||
|
|
||
| ## Standard Flow (Non-Microscale Schemes) | ||
|
|
||
| For schemes without a global scale (e.g. `FP8_BLOCK`, `FP8_DYNAMIC`), call `model_free_ptq` directly: | ||
|
|
||
| ```python | ||
| from llmcompressor import model_free_ptq | ||
|
|
||
| model_free_ptq( | ||
| model_stub="unsloth/Kimi-K2-Thinking-BF16", | ||
| save_directory="Kimi-K2-Thinking-FP8-BLOCK", | ||
| scheme="FP8_BLOCK", | ||
| ignore=[ | ||
| "re:.*gate$", | ||
| "lm_head", | ||
| "re:.*kv_a_proj_with_mqa$", | ||
| "re:.*q_a_proj$", | ||
| "model.embed_tokens", | ||
| ], | ||
| max_workers=15, | ||
| device="cuda:0", | ||
| ) | ||
| ``` | ||
|
|
||
| ## Microscale Flow (NVFP4) | ||
|
|
||
| NVFP4 requires a **global scale** that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the **same safetensors shard**. | ||
|
|
||
| Standard model checkpoints often split these weights across different shards. To fix this, run the `reindex_fused_weights` CLI tool first to reorganize the checkpoint: | ||
|
|
||
| ```bash | ||
| llmcompressor.reindex_fused_weights \ | ||
| unsloth/Kimi-K2-Thinking-BF16 \ | ||
| Kimi-K2-Thinking-BF16-reindexed \ | ||
| --num_workers=10 | ||
| ``` | ||
|
|
||
| Then run `model_free_ptq` on the reindexed checkpoint: | ||
|
|
||
| ```python | ||
| from llmcompressor import model_free_ptq | ||
|
|
||
| model_free_ptq( | ||
| model_stub="Kimi-K2-Thinking-BF16-reindexed", | ||
| save_directory="Kimi-K2-Thinking-NVFP4A16", | ||
| scheme="NVFP4A16", | ||
| ignore=[ | ||
| "re:.*gate$", | ||
| "lm_head", | ||
| "re:.*kv_a_proj_with_mqa$", | ||
| "re:.*q_a_proj$", | ||
| "model.embed_tokens", | ||
| ], | ||
| max_workers=15, | ||
| device="cuda:0", | ||
| ) | ||
| ``` | ||
|
|
||
| !!! note | ||
| Reindexing is only required for **NVFP4**, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing. | ||
|
|
||
| ## Ignoring Layers | ||
|
|
||
| The `ignore` argument accepts module name strings or regex patterns prefixed with `re:`. Modules whose names end in `"norm"` are automatically ignored regardless of the `ignore` list. | ||
|
|
||
| ```python | ||
| ignore=[ | ||
| "lm_head", # exact name match | ||
| "re:.*gate$", # regex: any module ending in "gate" | ||
| "model.embed_tokens", # exact name match | ||
| ] | ||
| ``` | ||
|
|
||
| ## Supported Schemes | ||
|
|
||
| `model_free_ptq` supports any data-free weight quantization scheme. Common presets: | ||
|
|
||
| | Scheme | Description | | ||
| |--------|-------------| | ||
| | `FP8_DYNAMIC` | FP8 weights with dynamic per-token activation quantization | | ||
| | `FP8_BLOCK` | FP8 weights with block-wise scaling (Blackwell-optimized) | | ||
| | `NVFP4A16` | NVFP4 weight-only quantization with FP8 group scales and a global scale | | ||
| | `MXFP4/MXFP8` | MXFP4 or MXFP8 quantization with MX-format microscales | | ||
|
|
||
| Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot. | ||
dsikka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| For the full list of supported schemes and formats, see [Compression Schemes](../compression_schemes.md). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| # oneshot | ||
|
|
||
| `oneshot` is the primary entrypoint for post-training quantization (PTQ) when your algorithm or scheme requires calibration data. It loads a model through Hugging Face `transformers`, applies recipe-defined modifiers (such as GPTQ, AWQ, SmoothQuant, or QuantizationModifier), and optionally saves the compressed result. | ||
|
|
||
| ## When to Use | ||
|
|
||
| Use `oneshot` when: | ||
|
|
||
| - Your quantization algorithm **requires calibration data** (GPTQ, AWQ, SmoothQuant, AutoRound) | ||
| - Your scheme uses **static activation quantization** that requires calibration (FP8 per-tensor, INT8 per-tensor, NVFP4 with activations) | ||
| - Your model has a **Hugging Face model definition** available via `transformers` | ||
|
|
||
| For data-free schemes on models without a transformers definition, or when `oneshot` fails, see [`model_free_ptq`](model-free-ptq.md). | ||
|
|
||
| ## Basic Usage | ||
|
|
||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modifiers.quantization import QuantizationModifier | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]), | ||
| output_dir="Meta-Llama-3-8B-Instruct-FP8", | ||
| ) | ||
| ``` | ||
|
|
||
| ## Lifecycle | ||
|
|
||
| The `oneshot` entrypoint runs three phases: | ||
|
|
||
| 1. **Preprocessing** | ||
| - Loads the model and tokenizer/processor from the provided identifiers or objects | ||
| - Unties input and output embedding layers if they share tensors | ||
| - Patches `save_pretrained` to support compressed-tensors serialization | ||
|
|
||
| 2. **Calibration** | ||
| - Wraps the model in a MoE calibration context (if applicable) to ensure all experts receive calibration data | ||
| - Initializes modifiers defined in the recipe via a global `CompressionSession` | ||
| - Runs calibration forward passes through the selected [pipeline](#calibration-pipelines) | ||
| - Finalizes modifiers, applying any post-calibration transformations | ||
|
|
||
| 3. **Postprocessing** | ||
| - Saves the compressed model, tokenizer/processor, recipe, and config to `output_dir` (if specified) | ||
| - Weights are saved in a compressed SafeTensors format via `compressed-tensors` | ||
|
|
||
| ## Arguments | ||
|
|
||
| ### Model Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `model` | `str \| PreTrainedModel` | — | HuggingFace model ID, local path, or a pre-loaded model instance | | ||
| | `tokenizer` | `str \| PreTrainedTokenizerBase \| None` | `None` | Tokenizer ID or path. Inferred from `model` if not set | | ||
| | `processor` | `str \| ProcessorMixin \| None` | `None` | Processor ID or path (for multimodal models). Inferred from `model` if not set | | ||
| | `config_name` | `str \| None` | `None` | Config name or path if different from `model` | | ||
| | `precision` | `str` | `"auto"` | Precision to cast model weights to on load (e.g. `"float16"`, `"bfloat16"`, `"auto"`) | | ||
| | `trust_remote_code_model` | `bool` | `False` | Allow custom model code from the repository | | ||
| | `save_compressed` | `bool` | `True` | Whether to save weights in compressed format | | ||
| | `model_revision` | `str` | `"main"` | Model version (branch, tag, or commit) | | ||
|
|
||
| ### Recipe Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `recipe` | `str \| list \| None` | `None` | Path to a recipe file, a list of paths, or a modifier object / list of modifier objects | | ||
| | `recipe_args` | `list[str] \| None` | `None` | Recipe argument overrides in `"key=value"` format | | ||
| | `stage` | `str \| None` | `None` | Specific recipe stage to run. Runs all stages if not set | | ||
|
|
||
| ### Dataset Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `dataset` | `str \| Dataset \| DatasetDict \| DataLoader \| None` | `None` | Dataset name (HuggingFace), a pre-loaded Dataset/DatasetDict, or a PyTorch DataLoader | | ||
| | `dataset_config_name` | `str \| None` | `None` | HuggingFace dataset configuration name | | ||
| | `dataset_path` | `str \| None` | `None` | Path to a local dataset (JSON, CSV, or DVC) | | ||
| | `num_calibration_samples` | `int` | `512` | Number of samples to use for calibration | | ||
| | `max_seq_length` | `int` | `384` | Maximum sequence length after tokenization. Longer sequences are truncated | | ||
| | `batch_size` | `int` | `1` | Calibration batch size | | ||
| | `data_collator` | `str \| Callable` | `"truncation"` | Batch collation strategy. `"truncation"` or `"padding"`, or a custom callable | | ||
| | `shuffle_calibration_samples` | `bool` | `True` | Whether to shuffle the dataset before selecting calibration samples | | ||
| | `text_column` | `str` | `"text"` | Dataset column to use as text input to the tokenizer/processor | | ||
| | `concatenate_data` | `bool` | `False` | Whether to concatenate samples to fill `max_seq_length` | | ||
| | `streaming` | `bool` | `False` | Stream data from a cloud-hosted dataset | | ||
| | `preprocessing_num_workers` | `int \| None` | `None` | Number of workers for dataset preprocessing | | ||
| | `dataloader_num_workers` | `int` | `0` | Number of workers for the DataLoader. Set to 2+ for faster loading if RAM allows | | ||
| | `moe_calibrate_all_experts` | `bool` | `True` | Route all tokens through all experts during calibration. Required for accurate MoE quantization | | ||
| | `min_tokens_per_module` | `float \| None` | `None` | Minimum fraction of tokens a module must receive. Logs a warning if unmet. Mainly relevant for MoE models | | ||
|
|
||
| ### Pipeline Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `pipeline` | `str \| None` | `"independent"` | Calibration pipeline to use. See [Calibration Pipelines](#calibration-pipelines) | | ||
| | `sequential_targets` | `list[str] \| None` | `None` | Layer targets for the sequential pipeline (typically a single decoder layer class). Defaults to `no_split_modules` from the HF model definition | | ||
| | `sequential_offload_device` | `str` | `"cpu"` | Device to offload intermediate activations between sequential layers. Use `"cuda:1"` if a second GPU is available | | ||
| | `quantization_aware_calibration` | `bool` | `True` | Apply quantization during the calibration forward pass in the sequential pipeline | | ||
| | `sequential_prefetch` | `bool` | `False` | Prefetch the next batch in a background thread during sequential pipeline calibration | | ||
|
|
||
| ### Miscellaneous Arguments | ||
|
|
||
| | Argument | Type | Default | Description | | ||
| |----------|------|---------|-------------| | ||
| | `output_dir` | `str \| None` | `None` | Directory to save the compressed model. Nothing is saved if `None` | | ||
| | `log_dir` | `str \| None` | `None` | Directory to write timestamped log files. Nothing is logged to file if `None` | | ||
|
|
||
| ## Calibration Pipelines | ||
|
|
||
| The `pipeline` argument controls how calibration forward passes are run through the model. | ||
|
|
||
| | Pipeline | Description | Best For | | ||
| |----------|-------------|----------| | ||
| | `independent` | Each modifier manages its own forward passes independently *(default)* | Most use cases | | ||
| | `sequential` | Runs calibration layer-by-layer, offloading intermediate activations between layers | Large models that don't fit in GPU memory | | ||
| | `datafree` | Runs initialization and finalization without any forward passes | Data-free weight-only quantization | | ||
| | `basic` | Single set of forward passes shared across all modifiers | Simple post-hoc calibration | | ||
|
|
||
| ## Examples | ||
|
|
||
| ### FP8 Data-Free Quantization | ||
|
|
||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modifiers.quantization import QuantizationModifier | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]), | ||
| output_dir="Meta-Llama-3-8B-Instruct-FP8", | ||
| ) | ||
| ``` | ||
dsikka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### GPTQ W4A16 | ||
|
|
||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modifiers.gptq import GPTQModifier | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| dataset="HuggingFaceH4/ultrachat_200k", | ||
| recipe=GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]), | ||
| num_calibration_samples=512, | ||
| max_seq_length=2048, | ||
| output_dir="Meta-Llama-3-8B-Instruct-W4A16-GPTQ", | ||
| ) | ||
| ``` | ||
|
|
||
| ### MoE Model with All-Expert Calibration | ||
|
|
||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modeling.llama4 import SequentialLlama4TextMoe # noqa: F401 | ||
| from llmcompressor.modifiers.quantization import QuantizationModifier | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct", dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| dataset="HuggingFaceH4/ultrachat_200k", | ||
| recipe=QuantizationModifier( | ||
| targets="Linear", | ||
| scheme="NVFP4", | ||
| ignore=[ | ||
| "re:.*lm_head", | ||
| "re:.*self_attn", | ||
| "re:.*router", | ||
| "re:.*vision_model.*", | ||
| "re:.*multi_modal_projector.*", | ||
| "Llama4TextAttention", | ||
| ], | ||
| ), | ||
| num_calibration_samples=20, | ||
| max_seq_length=2048, | ||
| moe_calibrate_all_experts=True, | ||
| output_dir="Llama-4-Scout-17B-NVFP4", | ||
| ) | ||
| ``` | ||
|
|
||
| ## Saving | ||
|
|
||
| The recommended way to save is via the `output_dir` argument, which automatically saves the model weights in compressed SafeTensors format along with the tokenizer/processor, recipe, and config: | ||
|
|
||
| ```python | ||
| oneshot(..., output_dir="./my-compressed-model") | ||
| ``` | ||
|
|
||
| Alternatively, you can save manually after the call: | ||
|
|
||
| ```python | ||
| model = oneshot(model=model, recipe=recipe) | ||
| model.save_pretrained("./my-compressed-model", save_compressed=True) | ||
| tokenizer.save_pretrained("./my-compressed-model") | ||
| ``` | ||
|
|
||
| For more details on save options, see [Saving a Compressed Model](../saving_a_model.md). | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.