NVIDIA
diff --git a/‎examples/speculative_decoding/README.md‎
Lines changed: 113 additions & 82 deletions b/‎examples/speculative_decoding/README.md‎
Lines changed: 113 additions & 82 deletions
diff --git a/‎examples/speculative_decoding/launch_train.sh‎
Lines changed: 1 addition & 1 deletion b/‎examples/speculative_decoding/launch_train.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎modelopt/torch/distill/plugins/megatron.py‎
Lines changed: 23 additions & 6 deletions b/‎modelopt/torch/distill/plugins/megatron.py‎
Lines changed: 23 additions & 6 deletions
@@ -15,8 +15,11 @@ This example focuses on training with Hugging Face. To train with Megatron‑LM,
 | **Section** | **Description** | **Jump To** |
 | :------------: | :------------: | :------------: |
 | Pre-Requisites | Required & optional dependencies | \[[Link](#pre-requisites)\] |
-| Simplified Workflow | Train, evaluate, and export eagle model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
-| Complete Workflow | Full example with configurable training pipeline | \[[Link](#complete-workflow)\] |
+| Simplified Workflow | Train, evaluate, and export EAGLE model with one-line command | \[[Link](#getting-started-simplified-workflow)\] |
+| Online Training | Train draft model alongside base model in GPU memory | \[[Link](#training-draft-model-with-online-base-model)\] |
+| Offline Training | Train draft model using pre-computed hidden states | \[[Link](#training-draft-model-with-offline-base-model)\] |
+| After Training | Evaluation, export and deployment | \[[Link](#model-validation)\] |
+| Advanced Usage | Data synthesis, vocab compression, and configuration | \[[Link](#advanced-usage)\] |
 | Support Matrix | Supported models for speculative decoding training | \[[Link](#support-matrix)\] |
 | Speculation Module Checkpoints | View pre-trained speculation modules ready to deploy! | \[[Link](#speculation-module-checkpoints)\] |
 | Resources | Extra links to relevant resources | \[[Link](#resources)\] |
@@ -61,13 +64,113 @@ This one-line command runs a minimal example workflow of training and exporting
 - Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
 - Exports a checkpoint ready for deployment
 
-## Complete Workflow
+## Training Draft Model with Online Base Model
 
-This section presents a more comprehensive example for customizing speculative decoding training with Modelopt, including optional steps to enhance training quality and efficiency.
+For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:
 
-### (Optional) Data Synthesis
+```bash
+./launch_train.sh --model $BASE_MODEL \
+            --output_dir $OUTPUT_DIR \
+            --data Daring-Anteater/train.jsonl  \
+            --num_gpu $NUM_GPU \
+            --num_epochs $NUM_EPOCH \
+            --eagle_config eagle_config.json
+```
+
+This command will launch `main.py` with `accelerate`. See [section: interact with modelopt.torch.speculative](#interact-with-modelopttorchspeculative) for more details.  
+The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
+
+## Training Draft Model with Offline Base Model
+
+For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
+
+First, dump the base model's hidden states with the following command:
+
+```bash
+python collect_hidden_states/compute_hidden_states_hf.py \
+            --model $BASE_MODEL \
+            --input-file Daring-Anteater/train.jsonl  \
+            --output-dir $HIDDEN_STATES_DIR
+```
+
+See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
+
+Then, train draft model with `--offline-data` argument:
+
+```bash
+./launch_train.sh --model $BASE_MODEL \
+            --output_dir $OUTPUT_DIR \
+            --data $DATA \
+            --num_gpu $NUM_GPU \
+            --num_epochs $NUM_EPOCH \
+            --eagle_config eagle_config.json \
+            --offline-data $HIDDEN_STATES_DIR
+```
+
+## Model Validation
+
+After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
+
+```bash
+python ar_validate.py --model_path $OUTPUT_DIR
+```
+
+Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
+
+## Export
+
+```bash
+python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
+```
+
+This exports the model from a ModelOpt checkpoint to a deployment-compatible format.
+
+## Deployment
+
+The exported checkpoint can be deployed on TRT-LLM or SGLang.
+
+### TRT-LLM
+
+To serve the checkpoint with TRT-LLM, run trtllm-serve with:
+
+```bash
+trtllm-serve <base_model_checkpoint> --host 0.0.0.0 --port 8000 --backend pytorch --max_batch_size 32 --max_num_tokens 8192 --max_seq_len 8192 --extra_llm_api_options extra-llm-api-config.yml
+```
+
+, with `extra-llm-api-config.yml` being
+
+```yaml
+enable_attention_dp: false
+disable_overlap_scheduler: true
+enable_autotuner: false
+
+cuda_graph_config:
+    max_batch_size: 1
+
+speculative_config:
+    decoding_type: Eagle
+    max_draft_len: 3
+    speculative_model_dir: <draft_model_checkpoint>
+
+kv_cache_config:
+    enable_block_reuse: false
+```
+
+Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
+
+### SGLang
 
-To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data, ensuring that the draft model’s output distribution closely aligns with that of the base model.
+Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
+
+### Deploying Quantized model
+
+See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
+
+## Advanced Usage
+
+### Data Synthesis
+
+To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
 
 To prepare such data, we launch an inference server with the base model:
 
@@ -78,7 +181,7 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --api-key token-abc123 --port 8000
 
 Note: Add `--quantization=modelopt` flag for quantized models.
 
-Then, we generate conversations with base model and prompts from Daring-Anteater:
+Then, we generate conversations with the base model using prompts from Daring-Anteater:
 
 ```bash
 python server_generate.py --data_path Daring-Anteater/train.jsonl --output_path synthetic/train.jsonl
@@ -88,7 +191,7 @@ To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
 
 For large scale data generation, please see [SLURM prepare data](SLURM_prepare_data.md) for SLURM support.
 
-### (Optional) Draft Vocabulary Compression
+### Draft Vocabulary Compression
 
 We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
 
@@ -98,7 +201,7 @@ python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data
 
 This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
 
-### (Optional) Configuring Draft Model
+### Configuring Draft Model
 
 For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. In this example, we override `draft_vocab_size` in `eagle_config.json`:
 
@@ -108,7 +211,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
 }
 ```
 
-### Training Draft Model with Modelopt
+### Interact with `modelopt.torch.speculative`
 
 `main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
 First, load the base model and tokenizer from Hugging Face:
@@ -162,78 +265,6 @@ trainer.save_state()
 trainer.save_model("<path to the output directory>")
 ```
 
-We omitted details like tokenizer initialization for simplicity. A complete training example is provided in `main.py`, along with a bash script to launch training with Hugging Face Accelerate in `launch_train.sh`, which can be run by:
-
-```bash
-./launch_train.sh --model $BASE_MODEL \
-            --output_dir $OUTPUT_DIR \
-            --data $DATA \
-            --num_gpu $NUM_GPU \
-            --num_epochs 10 \
-            --eagle_config eagle_config.json #This is where we optionally overwrite default eagle configs
-```
-
-The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.
-
-### Model Validation
-
-After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
-
-```bash
-python ar_validate.py --model_path $OUTPUT_DIR
-```
-
-Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
-
-### Export
-
-```bash
-python export_hf_checkpoint.py --model_path $OUTPUT_DIR --export_path $EXPORT_PATH
-```
-
-This exports the model from a ModelOpt checkpoint to a deployment‑compatible format.
-
-### Deployment
-
-The exported checkpoint can be deployed on TRT-LLM or SGLang.
-
-#### TRT-LLM
-
-To serve the checkpoint with trtllm, run trtllm-serve with:
-
-```bash
-trtllm-serve <base_model_checkpoint> --host 0.0.0.0 --port 8000 --backend pytorch --max_batch_size 32 --max_num_tokens 8192 --max_seq_len 8192 --extra_llm_api_options extra-llm-api-config.yml
-```
-
-, with `extra-llm-api-config.yml` being
-
-```yaml
-enable_attention_dp: false
-disable_overlap_scheduler: true
-enable_autotuner: false
-
-cuda_graph_config:
-    max_batch_size: 1
-
-speculative_config:
-    decoding_type: Eagle
-    max_draft_len: 3
-    speculative_model_dir: <draft_model_checkpoint>
-
-kv_cache_config:
-    enable_block_reuse: false
-```
-
-Please refer to [TRT-LLM Doc: Speculative Decoding](https://nvidia.github.io/TensorRT-LLM/examples/llm_speculative_decoding.html) for detailed usage.
-
-#### SGLang
-
-Please refer to [SGLang Doc: Speculative Decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-3-Decoding) for detailed usage.
-
-#### Deploying Quantized model
-
-See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/README.md).
-
 ## Support Matrix
 
 | Model | Medusa | EAGLE1/2 | EAGLE3 |
 
@@ -129,7 +129,7 @@ if [[ "$OFFLINE_DATA_PATH" != "" ]]; then
     echo "Offline data path $OFFLINE_DATA_PATH does not exist or is not a directory."
     exit 1
   else
-    OFFLINE_TRAINING_ARGS="--offline-data-path $OFFLINE_DATA_PATH"
+    OFFLINE_TRAINING_ARGS="--offline-data-path $OFFLINE_DATA_PATH --ar_validate_steps -1"
   fi
 else
   OFFLINE_TRAINING_ARGS=""
 
@@ -59,7 +59,7 @@ class DistillationConfig:
         logit_kl_temperature: Temperature for the logit KL-divergence loss.
     """
 
-    intermediate_layer_pairs: list[tuple[str, str]] = field(default_factory=list)
+    intermediate_layer_pairs: list[tuple[str, ...]] = field(default_factory=list)
     logit_layers: tuple[str, str] = ("output_layer", "output_layer")
     skip_lm_loss: bool = True
     kd_loss_scale: float = 1.0
@@ -69,12 +69,28 @@ class DistillationConfig:
 
     def __post_init__(self):
         assert len(self.logit_layers) == 2, f"{self.logit_layers=}"
-        assert all(len(pair) == 2 for pair in self.intermediate_layer_pairs), (
+        assert all(len(pair) in (2, 3) for pair in self.intermediate_layer_pairs), (
             f"{self.intermediate_layer_pairs=}"
         )
         assert self.kd_loss_scale > 0, f"{self.kd_loss_scale=}"
         assert self.logit_kl_temperature > 0, f"{self.logit_kl_temperature=}"
 
+    @staticmethod
+    def parse_intermediate_entry(entry: tuple[str, ...]) -> tuple[str, str, Callable]:
+        """Parse an intermediate entry into a student layer, teacher layer, and loss function."""
+        if len(entry) == 3:
+            student_layer, teacher_layer, loss_fn_name = entry
+            if loss_fn_name == "cosine":
+                loss_fn = HiddenStateCosineLoss
+            elif loss_fn_name == "mse":
+                loss_fn = MSELoss
+            else:
+                raise ValueError(f"Unknown intermediate loss function: {loss_fn_name}")
+        else:
+            student_layer, teacher_layer = entry
+            loss_fn = HiddenStateCosineLoss  # default to cosine loss
+        return student_layer, teacher_layer, loss_fn
+
 
 def load_distillation_config(
     config_path: str | None, student_cfg: "TransformerConfig", teacher_cfg: "TransformerConfig"
@@ -105,7 +121,8 @@ def load_distillation_config(
         # NOTE: Projection layer shared among intermediate layer pairs.
         projection_layer = ProjectionLayer(student_cfg, teacher_cfg)
 
-        for student_layer, teacher_layer in cfg.intermediate_layer_pairs:
+        for entry in cfg.intermediate_layer_pairs:
+            student_layer, teacher_layer, loss_fn = cfg.parse_intermediate_entry(entry)
             if parallel_state.get_tensor_and_context_parallel_rank() == 0:
                 logger.info(
                     "Distillation: Adding intermediate loss between"
@@ -114,7 +131,7 @@ def load_distillation_config(
                 )
             student_layer = _adjust_layer_index_for_pp(student_layer, student_cfg)
             teacher_layer = _adjust_layer_index_for_pp(teacher_layer, teacher_cfg)
-            criterion[(student_layer, teacher_layer)] = HiddenStateCosineLoss(
+            criterion[(student_layer, teacher_layer)] = loss_fn(
                 student_cfg, projection_layer=projection_layer
             )
 
@@ -202,9 +219,9 @@ def forward(self, predictions: Tensor, targets: Tensor) -> Tensor:
         predictions, targets = self.pre_forward(predictions, targets)
 
         loss = F.mse_loss(predictions, targets, reduction="none")
-        loss = loss.sum(dim=-1)
+        loss = loss.mean(dim=-1)
 
-        return self.post_forward(loss)
+        return self.post_forward(loss, is_sequence_parallel=self._config.sequence_parallel)
 
 
 class HiddenStateCosineLoss(BaseLoss):