address review comments

h-guo18 · h-guo18 · commit b607dd90b9eb · 2025-09-08T17:19:44.000Z
Signed-off-by: h-guo18 &lt;67671475+h-guo18@users.noreply.github.com&gt;
diff --git a/examples/speculative_decoding/README.md b/examples/speculative_decoding/README.md
@@ -2,11 +2,11 @@
 
 [![Documentation](https://img.shields.io/badge/Docs-TensorRT--Model--Optimizer-blue?logo=readthedocs&style=flat-square)](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/5_speculative_decoding.html)
 
-Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving throughput.
+Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving generation speed.
 
-This folder contains end-to-end runnable speculative decoding fine-tuning pipeline where Llama3.2-1B from huggingface is trained on Daring-Anteater dataset.
+This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the Daring‑Anteater dataset.
 
-This example focus on training with HF. To train with Megatron-LM, please refer to [this link](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) in Megatron-LM repo.
+This example focuses on training with Hugging Face. To train with Megatron‑LM, see the [Megatron‑LM example](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
 
 ## Contents
 
@@ -15,9 +15,9 @@ This example focus on training with HF. To train with Megatron-LM, please refer
 | **Section** | **Description** | **Jump To** |
 | :------------: | :------------: | :------------: |
 | Pre-Requisites | Required & optional dependencies | \[[Link](#pre-requisites)\] |
-| Simplified Workflow | Train, evaluate and export eagle model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
-| Complete Workflow | Full example with configurable traininig pipeline | \[[Link](#support-matrix)\] |
-| Support Matrix | Supported models for speculative decoding training | \[[Link](#deployment)\] |
+| Simplified Workflow | Train, evaluate, and export eagle model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
+| Complete Workflow | Full example with configurable training pipeline | \[[Link](#complete-workflow)\] |
+| Support Matrix | Supported models for speculative decoding training | \[[Link](#support-matrix)\] |
 | Speculation Module Checkpoints | View pre-trained speculation modules ready to deploy! | \[[Link](#speculation-module-checkpoints)\] |
 | Resources | Extra links to relevant resources | \[[Link](#resources)\] |
 
@@ -50,7 +50,6 @@ This one-line command runs a minimal example workflow of training and exporting
 - Fine-tunes the model on the [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset
 - Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
 - Exports a checkpoint ready for deployment
-- Run an interactive inference demo with vLLM using provided prompt
 
 ## Complete Workflow
 
@@ -75,19 +74,23 @@ Then, we generate conversations with base model and prompts from Daring-Anteater
 python server_generate.py --data_path Daring-Anteater/train.jsonl --output_path synthetic/train.jsonl
 ```
 
+To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
+
+For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
+
 ### (Optional) Draft Vocabulary Compression
 
 We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
 
 ```bash
-python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data Daring-Anteater/train.jsonl ----draft_vocab_size 32000 --save_dir draft_vocab_cache
+python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data Daring-Anteater/train.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
 ```
 
 This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft vocabs to full vocab that will be read by our draft model later.
 
 ### (Optional) Configuring Draft Model
 
-For eagle1 and eagle3 we provide an [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/eagle/default_config.py#L18) in modelopt. User can overwrite default settings by providing additional json dict. In this example, we overwrite the `draft_vocab_size` by in `eagle_config.json`:
+For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
 
 ```json
 {
@@ -97,8 +100,8 @@ For eagle1 and eagle3 we provide an [default model architecture config](https://
 
 ### Training Draft Model with Modelopt
 
-`main.py` provides a example for converting a base HF model for speculative decoding and training it. It consists of a few simple steps:
-First, load base model and tokenzier from hugginface:
+`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
+First, load the base model and tokenizer from Hugging Face:
 
 ```python
 model = transformers.AutoModelForCausalLM.from_pretrained(
@@ -116,7 +119,7 @@ config = {
 }[training_args.mode]["config"]
 
 # overwrite config with custom config
-config["eagle_architecture_config"].update({"<overwrite_kyes>": "<overwrite_values>"})
+config["eagle_architecture_config"].update({"<overwrite_keys>": "<overwrite_values>"})
 
 # Mandatory: hidden size, vocab size and max position embeddings must match base model
 config["eagle_architecture_config"].update(
@@ -128,7 +131,7 @@ config["eagle_architecture_config"].update(
 )
 ```
 
-Then, we convert model to a speculative deocoding model:
+Then, we convert model to a speculative decoding model:
 
 ```python
 mtsp.convert(model, [("eagle", config)])
@@ -149,15 +152,15 @@ trainer.save_state()
 trainer.save_model("<path to the output directory>")
 ```
 
-We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch the training with huggingface accelrate in `launch_train.sh`, which can be runned by:
+We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
 
 ```bash
 ./launch_train.sh --model $BASE_MODEL \
             --output_dir $OUTPUT_DIR \
             --data $DATA \
             --num_gpu $NUM_GPU \
             --num_epochs 10 \
-            --eagle_config eagle_config.json #This is where we overwrite default eagle configs
+            --eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
 ```
 
 The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
@@ -166,27 +169,27 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
 
 After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
 
-```python
+```bash
 python ar_validate.py --model_path $OUTPUT_DIR
 ```
 
 Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
 
 ### Export
 
-```python
+```bash
 python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
 ```
 
-This will export the model from a modelopt checkpoint to a deployment-compatible formart.
+This exports the model from a ModelOpt checkpoint to a deployment‑compatible format.
 
 ### Deployment
 
-The exported checkpoint can be deployed on TRT-LLM or vLLM.
+The exported checkpoint can be deployed on TRT-LLM or SGLang.
 
 #### TRT-LLM
 
-To serve the checkpoint with trtllm, we can run trtllm-serve with:
+To serve the checkpoint with trtllm, run trtllm-serve with:
 
 ```bash
 trtllm-serve <base_model_checkpoint> --host 0.0.0.0 --port 8000 --backend pytorch --max_batch_size 32 --max_num_tokens 8192 --max_seq_len 8192 --extra_llm_api_options extra-llm-api-config.yml
@@ -213,9 +216,9 @@ kv_cache_config:
 
 Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
 
-#### vLLM
+#### SGLang
 
-Please refer to [vLLM Doc: Speculative Decoding](https://docs.vllm.ai/en/v0.9.0/features/spec_decode.html) for detailed usage.
+Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
 
 #### Deploying Quantized model
 
@@ -233,8 +236,8 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
 
 ## Speculation Module Checkpoints
 
-Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
-Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)!\
+Ready-to-deploy speculation module checkpoints \[[🤗 Hugging Face - NVIDIA TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
+Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [SGLang](https://github.com/sgl-project/sglang)!\
 More models coming soon!
 
 ## Resources
diff --git a/examples/speculative_decoding/SLURM_prepare_data.md b/examples/speculative_decoding/SLURM_prepare_data.md
@@ -0,0 +1,37 @@
+# SLURM Prepare Data
+
+For basic parallelization of synthetic data generation we provide some SLURM support.
+Assuming a `$SLURM_JOB_ID` is present and nodes, n1, n2, n3, n4 are selected the following is achievable.
+
+Example of allocating 4 nodes for 120 minutes
+
+```sh
+salloc  -N4 -A <account> -p <partition>  -J <account>-synthetic:data-gen -t 120
+```
+
+Create shards of some given size
+
+```sh
+python3 distributed_generate/sharding_utils.py --input_path /data/train.jsonl --output_dir /data/train/ --max_lines_per_shard 10000
+```
+
+Run workers on SLURM
+
+```sh
+bash distributed_generate/launch.sh $SLURM_JOB_ID vllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/train/ /data/output /scripts/ 0 10 n1,n2,n3,n4 "\"You are a helpful assistant.\""
+```
+
+`/scripts/` is the absolute path to `modelopt/examples/speculative_decoding` which contains `server_generate.py` and `distributed_generate`.
+This will launch a vllm server (sglang is also available) on each node. Each node will work through 10 shards of data (10\*max_lines_per_shard number of samples).
+In this case, the first 40 shards of data will be processed.
+To process the next 40 shards
+
+```sh
+bash distributed_generate/launch.sh $SLURM_JOB_ID vllm TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/train/ /data/output /scripts/ 40 10 n1,n2,n3,n4
+```
+
+To combine the shards back
+
+```sh
+python3 distributed_generate/sharding_utils.py --input_dir /data/output/ --output_path /data/output.jsonl --combine
+```
diff --git a/examples/speculative_decoding/export_hf_checkpoint.py b/examples/speculative_decoding/export_hf_checkpoint.py
@@ -13,6 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+"""Export a HF checkpoint (with ModelOpt state) for deployment."""
+
 import argparse
 
 import torch
@@ -23,16 +25,21 @@
 
 
 def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_path", type=str, default="")
-    parser.add_argument("--export_path", type=str, default="")
+    parser = argparse.ArgumentParser(
+        description="Export a HF checkpoint (with ModelOpt state) for deployment."
+    )
+    parser.add_argument("--model_path", type=str, default="Path of the trained checkpoint.")
+    parser.add_argument(
+        "--export_path", type=str, default="Destination directory for exported files."
+    )
     return parser.parse_args()
 
 
 mto.enable_huggingface_checkpointing()
 
 args = parse_args()
 model = AutoModelForCausalLM.from_pretrained(args.model_path, torch_dtype="auto")
+model.eval()
 with torch.inference_mode():
     export_hf_checkpoint(
         model,  # The quantized model.
diff --git a/examples/speculative_decoding/train_eagle3_and_export.sh b/examples/speculative_decoding/train_eagle3_and_export.sh
@@ -22,7 +22,7 @@ BASE_MODEL=meta-llama/Llama-3.2-1B-Instruct
 NUM_GPU=1
 DATA=Daring-Anteater/train.jsonl
 
-# Parse input arguments --base-model, --num_gpu, and --data
+# Parse input arguments --base_model, --num_gpu, and --data
 while [[ $# -gt 0 ]]; do
   key="$1"
   case $key in
@@ -50,13 +50,15 @@ if [[ "$NUM_GPU" == 1 ]]; then
   export CUDA_VISIBLE_DEVICES=0
 else
   # Export as 0,1,...,N-1 for NUM_GPU GPUs
-  export CUDA_VISIBLE_DEVICES=$(seq -s, 0 $((NUM_GPU-1)))
+  devs="$(seq -s, 0 $((NUM_GPU-1)))"
+  export CUDA_VISIBLE_DEVICES="$devs"
 fi
 
 MODEL_BASENAME=$(basename "$BASE_MODEL")
 
 echo "==== [1/3] Training draft model ===="
 OUTPUT_DIR=ckpts/${MODEL_BASENAME}-$(date +%Y%m%d_%H%M)
+mkdir -p "$(dirname "$OUTPUT_DIR")"
 ./launch_train.sh --model $BASE_MODEL \
             --output_dir $OUTPUT_DIR \
             --data $DATA \
@@ -69,4 +71,5 @@ python ar_validate.py --model_path $OUTPUT_DIR
 
 echo "==== [3/3] Exporting checkpoint to deployment format ===="
 EXPORT_PATH=export/${MODEL_BASENAME}-$(date +%Y%m%d_%H%M)
+mkdir -p "$(dirname "$EXPORT_PATH")"
 python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
diff --git a/modelopt/torch/export/plugins/hf_spec_export.py b/modelopt/torch/export/plugins/hf_spec_export.py
@@ -13,16 +13,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""Modifiy stated_dict and config for exporting speculative decoding in official format."""
+"""Modify state_dict and config for exporting speculative decoding in official format."""
 
 import torch
 import torch.nn as nn
+import transformers
 
 from modelopt.torch.speculative.plugins.transformers import HFEagleModel
 
-SPECULATIVE_DECODING_MODES = ["eagle", "medusa"]
-
-EALGE_MODELOPT_TO_OFFICIAL = {
+EAGLE_MODELOPT_TO_OFFICIAL = {
     "required": {
         "layers.0.self_attn.q_proj.weight": "midlayer.self_attn.q_proj.weight",
         "layers.0.self_attn.k_proj.weight": "midlayer.self_attn.k_proj.weight",
@@ -55,26 +54,31 @@ def _check_state_dict_keys_match(draft_model: nn.Module, required_items: dict):
 def rename_and_prune_if_spec_decoding(model: nn.Module, post_state_dict: dict):
     """Only return the state dict of the draft model in official format and ignore the base model."""
     # check the model has only speculative decoding
-    opt_modes = model._modelopt_state
-    if len(opt_modes) != 1 or opt_modes[0][0] != "eagle":
+    opt_modes = getattr(model, "_modelopt_state", None)
+    if (
+        not isinstance(opt_modes, (list, tuple))
+        or len(opt_modes) != 1
+        or opt_modes[0][0] != "eagle"
+    ):
         # if there's other opts, return as is
         return post_state_dict
 
     assert isinstance(model, HFEagleModel)
     # Check if the state dict keys match
-    _check_state_dict_keys_match(model.eagle_module, EALGE_MODELOPT_TO_OFFICIAL["required"])
+    _check_state_dict_keys_match(model.eagle_module, EAGLE_MODELOPT_TO_OFFICIAL["required"])
 
     # Convert key names and save the state dict
+    eagle_state = model.eagle_module.state_dict()
     export_state_dict = {}
     for ours_key, export_key in {
-        **EALGE_MODELOPT_TO_OFFICIAL["required"],
-        **EALGE_MODELOPT_TO_OFFICIAL["optional"],
+        **EAGLE_MODELOPT_TO_OFFICIAL["required"],
+        **EAGLE_MODELOPT_TO_OFFICIAL["optional"],
     }.items():
-        if ours_key in model.eagle_module.state_dict():
-            export_state_dict[export_key] = model.eagle_module.state_dict()[ours_key]
+        if ours_key in eagle_state:
+            export_state_dict[export_key] = eagle_state[ours_key]
 
     # TODO: (hg) this is a temp fix. Find cleaner way to do this.
-    if "eagle_lm_head.weight" not in model.eagle_module.state_dict():
+    if "eagle_lm_head.weight" not in eagle_state:
         export_state_dict["lm_head.weight"] = model.state_dict()["lm_head.weight"]
 
     return export_state_dict
@@ -90,7 +94,7 @@ def set_config_if_spec_decoding(model: nn.Module, config_data: dict):
 
     # This is the config keys in official checkpoint.
     template_config = {
-        "architectures": ["LlamaForCausalLM"],
+        "architectures": ["LlamaForCausalLMEagle3"],
         "bos_token_id": None,
         "eos_token_id": None,
         "hidden_act": None,
@@ -106,7 +110,7 @@ def set_config_if_spec_decoding(model: nn.Module, config_data: dict):
         "rms_norm_eps": None,
         "tie_word_embeddings": False,
         "torch_dtype": None,
-        "transformers_version": None,
+        "transformers_version": transformers.__version__,
         "use_cache": None,
         "vocab_size": None,
         "draft_vocab_size": None,
diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py
@@ -496,10 +496,8 @@ def _export_hf_checkpoint(
 
 def _quant_applied(hf_quant_config: dict) -> bool:
     """Check if any quantization is applied."""
-    return not (
-        hf_quant_config["quantization"]["quant_algo"] == QUANTIZATION_NONE
-        and not hf_quant_config["quantization"]["quantized_layers"]
-    )
+    q = hf_quant_config.get("quantization", {})
+    return not (q.get("quant_algo") == QUANTIZATION_NONE and not q.get("quantized_layers"))
 
 
 def export_hf_checkpoint(
@@ -521,11 +519,10 @@ def export_hf_checkpoint(
     try:
         post_state_dict, hf_quant_config = _export_hf_checkpoint(model, dtype)
 
-        # When there's no quantization applied, we avoid saving hf_quant_config.json for compatibility
-        if _quant_applied(hf_quant_config):
-            # Save hf_quant_config.json for backward compatibility
-            with open(f"{export_dir}/hf_quant_config.json", "w") as file:
-                json.dump(hf_quant_config, file, indent=4)
+        # NOTE: (hg) Should we save hf_quant_config when there's no quantization applied?
+        # Save hf_quant_config.json for backward compatibility
+        with open(f"{export_dir}/hf_quant_config.json", "w") as file:
+            json.dump(hf_quant_config, file, indent=4)
 
         hf_quant_config = convert_hf_quant_config_format(hf_quant_config)
 
diff --git a/modelopt/torch/speculative/plugins/transformers.py b/modelopt/torch/speculative/plugins/transformers.py