|
| 1 | +# oneshot |
| 2 | + |
| 3 | +`oneshot` is the primary entrypoint for post-training quantization (PTQ) when your algorithm or scheme requires calibration data. It loads a model through Hugging Face `transformers`, applies recipe-defined modifiers (such as GPTQ, AWQ, SmoothQuant, or QuantizationModifier), and optionally saves the compressed result. |
| 4 | + |
| 5 | +## When to Use |
| 6 | + |
| 7 | +Use `oneshot` when: |
| 8 | + |
| 9 | +- Your quantization algorithm **requires calibration data** (GPTQ, AWQ, SmoothQuant, AutoRound) |
| 10 | +- Your scheme uses **static activation quantization** that requires calibration (FP8 per-tensor, INT8 per-tensor, NVFP4 with activations) |
| 11 | +- Your model has a **Hugging Face model definition** available via `transformers` |
| 12 | + |
| 13 | +For data-free schemes on models without a transformers definition, or when `oneshot` fails, see [`model_free_ptq`](model-free-ptq.md). |
| 14 | + |
| 15 | +## Basic Usage |
| 16 | + |
| 17 | +```python |
| 18 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 19 | +from llmcompressor import oneshot |
| 20 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 21 | + |
| 22 | +model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") |
| 23 | +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") |
| 24 | + |
| 25 | +oneshot( |
| 26 | + model=model, |
| 27 | + recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]), |
| 28 | + output_dir="Meta-Llama-3-8B-Instruct-FP8", |
| 29 | +) |
| 30 | +``` |
| 31 | + |
| 32 | +## Lifecycle |
| 33 | + |
| 34 | +The `oneshot` entrypoint runs three phases: |
| 35 | + |
| 36 | +1. **Preprocessing** |
| 37 | + - Loads the model and tokenizer/processor from the provided identifiers or objects |
| 38 | + - Unties input and output embedding layers if they share tensors |
| 39 | + - Patches `save_pretrained` to support compressed-tensors serialization |
| 40 | + |
| 41 | +2. **Calibration** |
| 42 | + - Wraps the model in a MoE calibration context (if applicable) to ensure all experts receive calibration data |
| 43 | + - Initializes modifiers defined in the recipe via a global `CompressionSession` |
| 44 | + - Runs calibration forward passes through the selected [pipeline](#calibration-pipelines) |
| 45 | + - Finalizes modifiers, applying any post-calibration transformations |
| 46 | + |
| 47 | +3. **Postprocessing** |
| 48 | + - Saves the compressed model, tokenizer/processor, recipe, and config to `output_dir` (if specified) |
| 49 | + - Weights are saved in a compressed SafeTensors format via `compressed-tensors` |
| 50 | + |
| 51 | +## Arguments |
| 52 | + |
| 53 | +### Model Arguments |
| 54 | + |
| 55 | +| Argument | Type | Default | Description | |
| 56 | +|----------|------|---------|-------------| |
| 57 | +| `model` | `str \| PreTrainedModel` | — | HuggingFace model ID, local path, or a pre-loaded model instance | |
| 58 | +| `tokenizer` | `str \| PreTrainedTokenizerBase \| None` | `None` | Tokenizer ID or path. Inferred from `model` if not set | |
| 59 | +| `processor` | `str \| ProcessorMixin \| None` | `None` | Processor ID or path (for multimodal models). Inferred from `model` if not set | |
| 60 | +| `config_name` | `str \| None` | `None` | Config name or path if different from `model` | |
| 61 | +| `precision` | `str` | `"auto"` | Precision to cast model weights to on load (e.g. `"float16"`, `"bfloat16"`, `"auto"`) | |
| 62 | +| `trust_remote_code_model` | `bool` | `False` | Allow custom model code from the repository | |
| 63 | +| `save_compressed` | `bool` | `True` | Whether to save weights in compressed format | |
| 64 | +| `model_revision` | `str` | `"main"` | Model version (branch, tag, or commit) | |
| 65 | + |
| 66 | +### Recipe Arguments |
| 67 | + |
| 68 | +| Argument | Type | Default | Description | |
| 69 | +|----------|------|---------|-------------| |
| 70 | +| `recipe` | `str \| list \| None` | `None` | Path to a recipe file, a list of paths, or a modifier object / list of modifier objects | |
| 71 | +| `recipe_args` | `list[str] \| None` | `None` | Recipe argument overrides in `"key=value"` format | |
| 72 | +| `stage` | `str \| None` | `None` | Specific recipe stage to run. Runs all stages if not set | |
| 73 | + |
| 74 | +### Dataset Arguments |
| 75 | + |
| 76 | +| Argument | Type | Default | Description | |
| 77 | +|----------|------|---------|-------------| |
| 78 | +| `dataset` | `str \| Dataset \| DatasetDict \| DataLoader \| None` | `None` | Dataset name (HuggingFace), a pre-loaded Dataset/DatasetDict, or a PyTorch DataLoader | |
| 79 | +| `dataset_config_name` | `str \| None` | `None` | HuggingFace dataset configuration name | |
| 80 | +| `dataset_path` | `str \| None` | `None` | Path to a local dataset (JSON, CSV, or DVC) | |
| 81 | +| `num_calibration_samples` | `int` | `512` | Number of samples to use for calibration | |
| 82 | +| `max_seq_length` | `int` | `384` | Maximum sequence length after tokenization. Longer sequences are truncated | |
| 83 | +| `batch_size` | `int` | `1` | Calibration batch size | |
| 84 | +| `data_collator` | `str \| Callable` | `"truncation"` | Batch collation strategy. `"truncation"` or `"padding"`, or a custom callable | |
| 85 | +| `shuffle_calibration_samples` | `bool` | `True` | Whether to shuffle the dataset before selecting calibration samples | |
| 86 | +| `text_column` | `str` | `"text"` | Dataset column to use as text input to the tokenizer/processor | |
| 87 | +| `concatenate_data` | `bool` | `False` | Whether to concatenate samples to fill `max_seq_length` | |
| 88 | +| `streaming` | `bool` | `False` | Stream data from a cloud-hosted dataset | |
| 89 | +| `preprocessing_num_workers` | `int \| None` | `None` | Number of workers for dataset preprocessing | |
| 90 | +| `dataloader_num_workers` | `int` | `0` | Number of workers for the DataLoader. Set to 2+ for faster loading if RAM allows | |
| 91 | +| `moe_calibrate_all_experts` | `bool` | `True` | Route all tokens through all experts during calibration. Required for accurate MoE quantization | |
| 92 | +| `min_tokens_per_module` | `float \| None` | `None` | Minimum fraction of tokens a module must receive. Logs a warning if unmet. Mainly relevant for MoE models | |
| 93 | + |
| 94 | +### Pipeline Arguments |
| 95 | + |
| 96 | +| Argument | Type | Default | Description | |
| 97 | +|----------|------|---------|-------------| |
| 98 | +| `pipeline` | `str \| None` | `"independent"` | Calibration pipeline to use. See [Calibration Pipelines](#calibration-pipelines) | |
| 99 | +| `sequential_targets` | `list[str] \| None` | `None` | Layer targets for the sequential pipeline (typically a single decoder layer class). Defaults to `no_split_modules` from the HF model definition | |
| 100 | +| `sequential_offload_device` | `str` | `"cpu"` | Device to offload intermediate activations between sequential layers. Use `"cuda:1"` if a second GPU is available | |
| 101 | +| `quantization_aware_calibration` | `bool` | `True` | Apply quantization during the calibration forward pass in the sequential pipeline | |
| 102 | +| `sequential_prefetch` | `bool` | `False` | Prefetch the next batch in a background thread during sequential pipeline calibration | |
| 103 | + |
| 104 | +### Miscellaneous Arguments |
| 105 | + |
| 106 | +| Argument | Type | Default | Description | |
| 107 | +|----------|------|---------|-------------| |
| 108 | +| `output_dir` | `str \| None` | `None` | Directory to save the compressed model. Nothing is saved if `None` | |
| 109 | +| `log_dir` | `str \| None` | `None` | Directory to write timestamped log files. Nothing is logged to file if `None` | |
| 110 | + |
| 111 | +## Calibration Pipelines |
| 112 | + |
| 113 | +The `pipeline` argument controls how calibration forward passes are run through the model. |
| 114 | + |
| 115 | +| Pipeline | Description | Best For | |
| 116 | +|----------|-------------|----------| |
| 117 | +| `independent` | Each modifier manages its own forward passes independently *(default)* | Most use cases | |
| 118 | +| `sequential` | Runs calibration layer-by-layer, offloading intermediate activations between layers | Large models that don't fit in GPU memory | |
| 119 | +| `datafree` | Runs initialization and finalization without any forward passes | Data-free weight-only quantization | |
| 120 | +| `basic` | Single set of forward passes shared across all modifiers | Simple post-hoc calibration | |
| 121 | + |
| 122 | +## Examples |
| 123 | + |
| 124 | +### FP8 Data-Free Quantization |
| 125 | + |
| 126 | +```python |
| 127 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 128 | +from llmcompressor import oneshot |
| 129 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 130 | + |
| 131 | +model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") |
| 132 | +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") |
| 133 | + |
| 134 | +oneshot( |
| 135 | + model=model, |
| 136 | + recipe=QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]), |
| 137 | + output_dir="Meta-Llama-3-8B-Instruct-FP8", |
| 138 | +) |
| 139 | +``` |
| 140 | + |
| 141 | +### GPTQ W4A16 |
| 142 | + |
| 143 | +```python |
| 144 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 145 | +from llmcompressor import oneshot |
| 146 | +from llmcompressor.modifiers.gptq import GPTQModifier |
| 147 | + |
| 148 | +model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto") |
| 149 | +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") |
| 150 | + |
| 151 | +oneshot( |
| 152 | + model=model, |
| 153 | + dataset="HuggingFaceH4/ultrachat_200k", |
| 154 | + recipe=GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]), |
| 155 | + num_calibration_samples=512, |
| 156 | + max_seq_length=2048, |
| 157 | + output_dir="Meta-Llama-3-8B-Instruct-W4A16-GPTQ", |
| 158 | +) |
| 159 | +``` |
| 160 | + |
| 161 | +### MoE Model with All-Expert Calibration |
| 162 | + |
| 163 | +```python |
| 164 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 165 | +from llmcompressor import oneshot |
| 166 | +from llmcompressor.modeling.llama4 import SequentialLlama4TextMoe # noqa: F401 |
| 167 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 168 | + |
| 169 | +model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct", dtype="auto") |
| 170 | +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct") |
| 171 | + |
| 172 | +oneshot( |
| 173 | + model=model, |
| 174 | + dataset="HuggingFaceH4/ultrachat_200k", |
| 175 | + recipe=QuantizationModifier( |
| 176 | + targets="Linear", |
| 177 | + scheme="NVFP4", |
| 178 | + ignore=[ |
| 179 | + "re:.*lm_head", |
| 180 | + "re:.*self_attn", |
| 181 | + "re:.*router", |
| 182 | + "re:.*vision_model.*", |
| 183 | + "re:.*multi_modal_projector.*", |
| 184 | + "Llama4TextAttention", |
| 185 | + ], |
| 186 | + ), |
| 187 | + num_calibration_samples=20, |
| 188 | + max_seq_length=2048, |
| 189 | + moe_calibrate_all_experts=True, |
| 190 | + output_dir="Llama-4-Scout-17B-NVFP4", |
| 191 | +) |
| 192 | +``` |
| 193 | + |
| 194 | +## Saving |
| 195 | + |
| 196 | +The recommended way to save is via the `output_dir` argument, which automatically saves the model weights in compressed SafeTensors format along with the tokenizer/processor, recipe, and config: |
| 197 | + |
| 198 | +```python |
| 199 | +oneshot(..., output_dir="./my-compressed-model") |
| 200 | +``` |
| 201 | + |
| 202 | +Alternatively, you can save manually after the call: |
| 203 | + |
| 204 | +```python |
| 205 | +model = oneshot(model=model, recipe=recipe) |
| 206 | +model.save_pretrained("./my-compressed-model", save_compressed=True) |
| 207 | +tokenizer.save_pretrained("./my-compressed-model") |
| 208 | +``` |
| 209 | + |
| 210 | +For more details on save options, see [Saving a Compressed Model](../saving_a_model.md). |
0 commit comments