You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving throughput.
5
+
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving generation speed.
6
6
7
-
This folder contains end-to-end runnable speculative decoding fine-tuning pipeline where Llama3.2-1B from huggingface is trained on Daring-Anteater dataset.
7
+
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the Daring‑Anteater dataset.
8
8
9
-
This example focus on training with HF. To train with Megatron-LM, please refer to [this link](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) in Megatron-LM repo.
9
+
This example focuses on training with Hugging Face. To train with Megatron‑LM, see the [Megatron‑LM example](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
10
10
11
11
## Contents
12
12
@@ -15,9 +15,9 @@ This example focus on training with HF. To train with Megatron-LM, please refer
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
78
+
79
+
For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
80
+
78
81
### (Optional) Draft Vocabulary Compression
79
82
80
83
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft vocabs to full vocab that will be read by our draft model later.
87
90
88
91
### (Optional) Configuring Draft Model
89
92
90
-
For eagle1 and eagle3 we provide an[default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/eagle/default_config.py#L18) in modelopt. User can overwrite default settings by providing additional json dict. In this example, we overwrite the `draft_vocab_size` by in `eagle_config.json`:
93
+
For EAGLE‑1 and EAGLE‑3 we provide a[default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
91
94
92
95
```json
93
96
{
@@ -97,8 +100,8 @@ For eagle1 and eagle3 we provide an [default model architecture config](https://
97
100
98
101
### Training Draft Model with Modelopt
99
102
100
-
`main.py` provides a example for converting a base HF model for speculative decoding and training it. It consists of a few simple steps:
101
-
First, load base model and tokenzier from hugginface:
103
+
`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
104
+
First, load the base model and tokenizer from Hugging Face:
102
105
103
106
```python
104
107
model = transformers.AutoModelForCausalLM.from_pretrained(
Then, we convert model to a speculative deocoding model:
134
+
Then, we convert model to a speculative decoding model:
132
135
133
136
```python
134
137
mtsp.convert(model, [("eagle", config)])
@@ -149,15 +152,15 @@ trainer.save_state()
149
152
trainer.save_model("<path to the output directory>")
150
153
```
151
154
152
-
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch the training with huggingface accelrate in `launch_train.sh`, which can be runned by:
155
+
We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
153
156
154
157
```bash
155
158
./launch_train.sh --model $BASE_MODEL \
156
159
--output_dir $OUTPUT_DIR \
157
160
--data $DATA \
158
161
--num_gpu $NUM_GPU \
159
162
--num_epochs 10 \
160
-
--eagle_config eagle_config.json #This is where we overwrite default eagle configs
163
+
--eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
161
164
```
162
165
163
166
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
@@ -166,27 +169,27 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
166
169
167
170
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
168
171
169
-
```python
172
+
```bash
170
173
python ar_validate.py --model_path $OUTPUT_DIR
171
174
```
172
175
173
176
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
215
218
216
-
#### vLLM
219
+
#### SGLang
217
220
218
-
Please refer to [vLLM Doc: Speculative Decoding](https://docs.vllm.ai/en/v0.9.0/features/spec_decode.html) for detailed usage.
221
+
Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
219
222
220
223
#### Deploying Quantized model
221
224
@@ -233,8 +236,8 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
233
236
234
237
## Speculation Module Checkpoints
235
238
236
-
Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
237
-
Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)!\
239
+
Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - NVIDIA TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
240
+
Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [SGLang](https://github.com/sgl-project/sglang)!\
bash distributed_generate/launch.sh $SLURM_JOB_ID vllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/train/ /data/output /scripts/ 0 10 n1,n2,n3,n4 "\"You are a helpful assistant.\""
22
+
```
23
+
24
+
`/scripts/` is the absolute path to `modelopt/examples/speculative_decoding` which contains `server_generate.py` and `distributed_generate`.
25
+
This will launch a vllm server (sglang is also available) on each node. Each node will work through 10 shards of data (10\*max_lines_per_shard number of samples).
26
+
In this case, the first 40 shards of data will be processed.
0 commit comments