You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving throughput.
5
+
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving generation speed.
6
6
7
-
This folder contains end-to-end runnable speculative decoding fine-tuning pipeline where Llama3.2-1B from huggingface is trained on Daring-Anteater dataset.
7
+
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the Daring‑Anteater dataset.
8
8
9
-
This example focus on training with HF. To train with Megatron-LM, please refer to [this link](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) in Megatron-LM repo.
9
+
This example focuses on training with Hugging Face. To train with Megatron‑LM, see the [Megatron‑LM example](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
10
10
11
11
## Contents
12
12
@@ -15,9 +15,9 @@ This example focus on training with HF. To train with Megatron-LM, please refer
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
79
+
80
+
For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
81
+
78
82
### (Optional) Draft Vocabulary Compression
79
83
80
84
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft vocabs to full vocab that will be read by our draft model later.
87
91
88
92
### (Optional) Configuring Draft Model
89
93
90
-
For eagle1 and eagle3 we provide an[default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/eagle/default_config.py#L18) in modelopt. User can overwrite default settings by providing additional json dict. In this example, we overwrite the `draft_vocab_size` by in `eagle_config.json`:
94
+
For EAGLE‑1 and EAGLE‑3 we provide a[default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
91
95
92
96
```json
93
97
{
@@ -97,8 +101,8 @@ For eagle1 and eagle3 we provide an [default model architecture config](https://
97
101
98
102
### Training Draft Model with Modelopt
99
103
100
-
`main.py` provides a example for converting a base HF model for speculative decoding and training it. It consists of a few simple steps:
101
-
First, load base model and tokenzier from hugginface:
104
+
`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
105
+
First, load the base model and tokenizer from Hugging Face:
102
106
103
107
```python
104
108
model = transformers.AutoModelForCausalLM.from_pretrained(
Then, we convert model to a speculative deocoding model:
135
+
Then, we convert model to a speculative decoding model:
132
136
133
137
```python
134
138
mtsp.convert(model, [("eagle", config)])
@@ -149,15 +153,15 @@ trainer.save_state()
149
153
trainer.save_model("<path to the output directory>")
150
154
```
151
155
152
-
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch the training with huggingface accelrate in `launch_train.sh`, which can be runned by:
156
+
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
153
157
154
158
```bash
155
159
./launch_train.sh --model $BASE_MODEL \
156
160
--output_dir $OUTPUT_DIR \
157
161
--data $DATA \
158
162
--num_gpu $NUM_GPU \
159
163
--num_epochs 10 \
160
-
--eagle_config eagle_config.json #This is where we overwrite default eagle configs
164
+
--eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
161
165
```
162
166
163
167
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
@@ -166,27 +170,27 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
166
170
167
171
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
168
172
169
-
```python
173
+
```bash
170
174
python ar_validate.py --model_path $OUTPUT_DIR
171
175
```
172
176
173
177
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
@@ -233,7 +237,7 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
233
237
234
238
## Speculation Module Checkpoints
235
239
236
-
Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
240
+
Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - NVIDIA TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
237
241
Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)!\
bash distributed_generate/launch.sh $SLURM_JOB_ID vllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/train/ /data/output /scripts/ 0 10 n1,n2,n3,n4 "\"You are a helpful assistant.\""
22
+
```
23
+
24
+
`/scripts/` is the absolute path to `modelopt/examples/speculative_decoding` which contains `server_generate.py` and `distributed_generate`.
25
+
This will launch a vllm server (sglang is also available) on each node. Each node will work through 10 shards of data (10\*max_lines_per_shard number of samples).
26
+
In this case, the first 40 shards of data will be processed.
0 commit comments