You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Resources | Extra links to relevant resources |\[[Link](#resources)\]|
@@ -61,13 +64,111 @@ This one-line command runs a minimal example workflow of training and exporting
61
64
- Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
62
65
- Exports a checkpoint ready for deployment
63
66
64
-
## Complete Workflow
67
+
## Training Draft Model with Online Base Model
65
68
66
-
This section presents a more comprehensive example for customizing speculative decoding training with Modelopt, including optional steps to enhance training quality and efficiency.
69
+
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:
67
70
68
-
### (Optional) Data Synthesis
71
+
```bash
72
+
./launch_train.sh --model $BASE_MODEL \
73
+
--output_dir $OUTPUT_DIR \
74
+
--data Daring-Anteater/train.jsonl \
75
+
--num_gpu $NUM_GPU \
76
+
--num_epochs $NUM_EPOCH \
77
+
--eagle_config eagle_config.json
78
+
```
79
+
80
+
This command will launch `main.py` with `accelerate`. See [section: interact with modelopt.torch.speculative](#interact-with-modelopttorchspeculative) for more details.
81
+
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
82
+
83
+
## Training Draft Model with Offline Base Model
84
+
85
+
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
86
+
87
+
First, dump the base model's hidden states with the following command:
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data, ensuring that the draft model’s output distribution closely aligns with that of the base model.
153
+
kv_cache_config:
154
+
enable_block_reuse: false
155
+
```
156
+
157
+
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
158
+
159
+
### SGLang
160
+
161
+
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
162
+
163
+
### Deploying Quantized model
164
+
165
+
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
166
+
167
+
## Advanced Usage
168
+
169
+
### Data Synthesis
170
+
171
+
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
71
172
72
173
To prepare such data, we launch an inference server with the base model:
@@ -88,7 +189,7 @@ To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
88
189
89
190
For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
90
191
91
-
### (Optional) Draft Vocabulary Compression
192
+
### Draft Vocabulary Compression
92
193
93
194
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
100
201
101
-
### (Optional) Configuring Draft Model
202
+
### Configuring Draft Model
102
203
103
204
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
104
205
@@ -108,7 +209,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
108
209
}
109
210
```
110
211
111
-
### Training Draft Model with Modelopt
212
+
### Interact with `modelopt.torch.speculative`
112
213
113
214
`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
114
215
First, load the base model and tokenizer from Hugging Face:
@@ -162,78 +263,6 @@ trainer.save_state()
162
263
trainer.save_model("<path to the output directory>")
163
264
```
164
265
165
-
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
166
-
167
-
```bash
168
-
./launch_train.sh --model $BASE_MODEL \
169
-
--output_dir $OUTPUT_DIR \
170
-
--data $DATA \
171
-
--num_gpu $NUM_GPU \
172
-
--num_epochs 10 \
173
-
--eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
174
-
```
175
-
176
-
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
177
-
178
-
### Model Validation
179
-
180
-
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
181
-
182
-
```bash
183
-
python ar_validate.py --model_path $OUTPUT_DIR
184
-
```
185
-
186
-
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
228
-
229
-
#### SGLang
230
-
231
-
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
232
-
233
-
#### Deploying Quantized model
234
-
235
-
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
0 commit comments