You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Resources | Extra links to relevant resources |\[[Link](#resources)\]|
@@ -61,13 +64,113 @@ This one-line command runs a minimal example workflow of training and exporting
61
64
- Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
62
65
- Exports a checkpoint ready for deployment
63
66
64
-
## Complete Workflow
67
+
## Training Draft Model with Online Base Model
65
68
66
-
This section presents a more comprehensive example for customizing speculative decoding training with Modelopt, including optional steps to enhance training quality and efficiency.
69
+
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:
67
70
68
-
### (Optional) Data Synthesis
71
+
```bash
72
+
./launch_train.sh --model $BASE_MODEL \
73
+
--output_dir $OUTPUT_DIR \
74
+
--data Daring-Anteater/train.jsonl \
75
+
--num_gpu $NUM_GPU \
76
+
--num_epochs $NUM_EPOCH \
77
+
--eagle_config eagle_config.json
78
+
```
79
+
80
+
This command will launch `main.py` with `accelerate`. See [section: interact with modelopt.torch.speculative](#interact-with-modelopttorchspeculative) for more details.
81
+
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
82
+
83
+
## Training Draft Model with Offline Base Model
84
+
85
+
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
86
+
87
+
First, dump the base model's hidden states with the following command:
See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
97
+
98
+
Then, train draft model with `--offline-data` argument:
99
+
100
+
```bash
101
+
./launch_train.sh --model $BASE_MODEL \
102
+
--output_dir $OUTPUT_DIR \
103
+
--data $DATA \
104
+
--num_gpu $NUM_GPU \
105
+
--num_epochs $NUM_EPOCH \
106
+
--eagle_config eagle_config.json \
107
+
--offline-data $HIDDEN_STATES_DIR
108
+
```
109
+
110
+
## Model Validation
111
+
112
+
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
113
+
114
+
```bash
115
+
python ar_validate.py --model_path $OUTPUT_DIR
116
+
```
117
+
118
+
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
160
+
161
+
### SGLang
69
162
70
-
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data, ensuring that the draft model’s output distribution closely aligns with that of the base model.
163
+
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
164
+
165
+
### Deploying Quantized model
166
+
167
+
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
168
+
169
+
## Advanced Usage
170
+
171
+
### Data Synthesis
172
+
173
+
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
71
174
72
175
To prepare such data, we launch an inference server with the base model:
@@ -88,7 +191,7 @@ To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
88
191
89
192
For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
90
193
91
-
### (Optional) Draft Vocabulary Compression
194
+
### Draft Vocabulary Compression
92
195
93
196
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
100
203
101
-
### (Optional) Configuring Draft Model
204
+
### Configuring Draft Model
102
205
103
206
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
104
207
@@ -108,7 +211,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
108
211
}
109
212
```
110
213
111
-
### Training Draft Model with Modelopt
214
+
### Interact with `modelopt.torch.speculative`
112
215
113
216
`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
114
217
First, load the base model and tokenizer from Hugging Face:
@@ -162,78 +265,6 @@ trainer.save_state()
162
265
trainer.save_model("<path to the output directory>")
163
266
```
164
267
165
-
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
166
-
167
-
```bash
168
-
./launch_train.sh --model $BASE_MODEL \
169
-
--output_dir $OUTPUT_DIR \
170
-
--data $DATA \
171
-
--num_gpu $NUM_GPU \
172
-
--num_epochs 10 \
173
-
--eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
174
-
```
175
-
176
-
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
177
-
178
-
### Model Validation
179
-
180
-
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
181
-
182
-
```bash
183
-
python ar_validate.py --model_path $OUTPUT_DIR
184
-
```
185
-
186
-
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
228
-
229
-
#### SGLang
230
-
231
-
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
232
-
233
-
#### Deploying Quantized model
234
-
235
-
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
0 commit comments