Skip to content

Commit 95bbe5d

Browse files
committed
update eagle offline to example
Signed-off-by: h-guo18 <[email protected]>
1 parent 26c203a commit 95bbe5d

File tree

1 file changed

+111
-82
lines changed

1 file changed

+111
-82
lines changed

examples/speculative_decoding/README.md

Lines changed: 111 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,11 @@ This example focuses on training with Hugging Face. To train with Megatron‑LM,
1515
| **Section** | **Description** | **Jump To** |
1616
| :------------: | :------------: | :------------: |
1717
| Pre-Requisites | Required & optional dependencies | \[[Link](#pre-requisites)\] |
18-
| Simplified Workflow | Train, evaluate, and export eagle model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
19-
| Complete Workflow | Full example with configurable training pipeline | \[[Link](#complete-workflow)\] |
18+
| Simplified Workflow | Train, evaluate, and export EAGLE model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
19+
| Online Training | Train draft model alongside base model in GPU memory | \[[Link](#training-draft-model-with-online-base-model)\] |
20+
| Offline Training | Train draft model using pre-computed hidden states | \[[Link](#training-draft-model-with-offline-base-model)\] |
21+
| After Training | Evaluation, export and deployment | \[[Link](#model-validation)\] |
22+
| Advanced Usage | Data synthesis, vocab compression, and configuration | \[[Link](#advanced-usage)\] |
2023
| Support Matrix | Supported models for speculative decoding training | \[[Link](#support-matrix)\] |
2124
| Speculation Module Checkpoints | View pre-trained speculation modules ready to deploy! | \[[Link](#speculation-module-checkpoints)\] |
2225
| Resources | Extra links to relevant resources | \[[Link](#resources)\] |
@@ -61,13 +64,111 @@ This one-line command runs a minimal example workflow of training and exporting
6164
- Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
6265
- Exports a checkpoint ready for deployment
6366

64-
## Complete Workflow
67+
## Training Draft Model with Online Base Model
6568

66-
This section presents a more comprehensive example for customizing speculative decoding training with Modelopt, including optional steps to enhance training quality and efficiency.
69+
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:
6770

68-
### (Optional) Data Synthesis
71+
```bash
72+
./launch_train.sh --model $BASE_MODEL \
73+
--output_dir $OUTPUT_DIR \
74+
--data Daring-Anteater/train.jsonl \
75+
--num_gpu $NUM_GPU \
76+
--num_epochs $NUM_EPOCH \
77+
--eagle_config eagle_config.json
78+
```
79+
80+
This command will launch `main.py` with `accelerate`. See [section: interact with modelopt.torch.speculative](#interact-with-modelopttorchspeculative) for more details.
81+
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
82+
83+
## Training Draft Model with Offline Base Model
84+
85+
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
86+
87+
First, dump the base model's hidden states with the following command:
88+
89+
```bash
90+
python collect_hidden_states/compute_hidden_states_hf.py \
91+
--model $BASE_MODEL \
92+
--input-file Daring-Anteater/train.jsonl \
93+
--output-dir $HIDDEN_STATES_DIR
94+
```
95+
96+
Then, train draft model with `--offline-data` argument:
97+
98+
```bash
99+
./launch_train.sh --model $BASE_MODEL \
100+
--output_dir $OUTPUT_DIR \
101+
--data $DATA \
102+
--num_gpu $NUM_GPU \
103+
--num_epochs $NUM_EPOCH \
104+
--eagle_config eagle_config.json \
105+
--offline-data $HIDDEN_STATES_DIR
106+
```
107+
108+
## Model Validation
109+
110+
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
111+
112+
```bash
113+
python ar_validate.py --model_path $OUTPUT_DIR
114+
```
115+
116+
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
117+
118+
## Export
119+
120+
```bash
121+
python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
122+
```
123+
124+
This exports the model from a ModelOpt checkpoint to a deployment-compatible format.
125+
126+
## Deployment
127+
128+
The exported checkpoint can be deployed on TRT-LLM or SGLang.
129+
130+
### TRT-LLM
131+
132+
To serve the checkpoint with TRT-LLM, run trtllm-serve with:
133+
134+
```bash
135+
trtllm-serve <base_model_checkpoint> --host 0.0.0.0 --port 8000 --backend pytorch --max_batch_size 32 --max_num_tokens 8192 --max_seq_len 8192 --extra_llm_api_options extra-llm-api-config.yml
136+
```
137+
138+
, with `extra-llm-api-config.yml` being
139+
140+
```yaml
141+
enable_attention_dp: false
142+
disable_overlap_scheduler: true
143+
enable_autotuner: false
144+
145+
cuda_graph_config:
146+
max_batch_size: 1
147+
148+
speculative_config:
149+
decoding_type: Eagle
150+
max_draft_len: 3
151+
speculative_model_dir: <draft_model_checkpoint>
69152

70-
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data, ensuring that the draft model’s output distribution closely aligns with that of the base model.
153+
kv_cache_config:
154+
enable_block_reuse: false
155+
```
156+
157+
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
158+
159+
### SGLang
160+
161+
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
162+
163+
### Deploying Quantized model
164+
165+
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
166+
167+
## Advanced Usage
168+
169+
### Data Synthesis
170+
171+
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
71172
72173
To prepare such data, we launch an inference server with the base model:
73174
@@ -78,7 +179,7 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --api-key token-abc123 --port 8000
78179

79180
Note: Add `--quantization=modelopt` flag for quantized models.
80181

81-
Then, we generate conversations with base model and prompts from Daring-Anteater:
182+
Then, we generate conversations with the base model using prompts from Daring-Anteater:
82183

83184
```bash
84185
python server_generate.py --data_path Daring-Anteater/train.jsonl --output_path synthetic/train.jsonl
@@ -88,7 +189,7 @@ To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
88189

89190
For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
90191

91-
### (Optional) Draft Vocabulary Compression
192+
### Draft Vocabulary Compression
92193

93194
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
94195

@@ -98,7 +199,7 @@ python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data
98199

99200
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
100201

101-
### (Optional) Configuring Draft Model
202+
### Configuring Draft Model
102203

103204
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
104205

@@ -108,7 +209,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
108209
}
109210
```
110211

111-
### Training Draft Model with Modelopt
212+
### Interact with `modelopt.torch.speculative`
112213

113214
`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
114215
First, load the base model and tokenizer from Hugging Face:
@@ -162,78 +263,6 @@ trainer.save_state()
162263
trainer.save_model("<path to the output directory>")
163264
```
164265

165-
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
166-
167-
```bash
168-
./launch_train.sh --model $BASE_MODEL \
169-
--output_dir $OUTPUT_DIR \
170-
--data $DATA \
171-
--num_gpu $NUM_GPU \
172-
--num_epochs 10 \
173-
--eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
174-
```
175-
176-
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
177-
178-
### Model Validation
179-
180-
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
181-
182-
```bash
183-
python ar_validate.py --model_path $OUTPUT_DIR
184-
```
185-
186-
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
187-
188-
### Export
189-
190-
```bash
191-
python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
192-
```
193-
194-
This exports the model from a ModelOpt checkpoint to a deployment‑compatible format.
195-
196-
### Deployment
197-
198-
The exported checkpoint can be deployed on TRT-LLM or SGLang.
199-
200-
#### TRT-LLM
201-
202-
To serve the checkpoint with trtllm, run trtllm-serve with:
203-
204-
```bash
205-
trtllm-serve <base_model_checkpoint> --host 0.0.0.0 --port 8000 --backend pytorch --max_batch_size 32 --max_num_tokens 8192 --max_seq_len 8192 --extra_llm_api_options extra-llm-api-config.yml
206-
```
207-
208-
, with `extra-llm-api-config.yml` being
209-
210-
```yaml
211-
enable_attention_dp: false
212-
disable_overlap_scheduler: true
213-
enable_autotuner: false
214-
215-
cuda_graph_config:
216-
max_batch_size: 1
217-
218-
speculative_config:
219-
decoding_type: Eagle
220-
max_draft_len: 3
221-
speculative_model_dir: <draft_model_checkpoint>
222-
223-
kv_cache_config:
224-
enable_block_reuse: false
225-
```
226-
227-
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
228-
229-
#### SGLang
230-
231-
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
232-
233-
#### Deploying Quantized model
234-
235-
See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
236-
237266
## Support Matrix
238267

239268
| Model | Medusa | EAGLE1/2 | EAGLE3 |

0 commit comments

Comments
 (0)