More suggestions

AAnoosheh · AAnoosheh · commit 9269c342afe2 · 2025-10-01T06:16:07.000-07:00
Signed-off-by: Asha Anoosheh &lt;aanoosheh@nvidia.com&gt;
diff --git a/examples/nemo_run/prune_distill/README.md b/examples/nemo_run/prune_distill/README.md
@@ -12,35 +12,37 @@ After structured pruning, the compressed model may show some accuracy degradatio
 
 ## Flow Stages
 
-The Simplified Flow runs the following steps in order:
+The Simplified Flow runs the following steps:
 
 1. 01_import — Import HuggingFace model to NeMo format
-1. 02_prune — Apply structured pruning to create a compressed student model
+1. 02a_eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
+1. 02b_prune — Apply structured pruning to create a compressed student model
 1. 03_distill — Knowledge distillation from teacher to pruned student model
-1. 04_export — Export final compressed model to HuggingFace format
-1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
-1. eval_student — Evaluate student model on 5% of MMLU benchmark
+1. 04a_eval_student — Evaluate student model on 5% of MMLU benchmark
+1. 04b_export — Export final compressed model to HuggingFace format
 
 ```mermaid
 graph TD;
-01_import-->02_prune;
-01_import-->eval_teacher;
-02_prune-->03_distill;
-03_distill-->eval_student;
-03_distill-->04_export;
+01_import-->02a_eval_teacher;
+01_import-->02b_prune;
+02b_prune-->03_distill;
+03_distill-->04a_eval_student;
+03_distill-->04b_export;
 ```
 
 ## Results
 
 Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~28,000 steps (determined by sequence length, default 4096) with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
 
-|                                   | Tokens per Second | MMLU |
-|-----------------------------------|-------------------|------|
-| Qwen3-8B Original                 | 4420              | 74.9 |
-| Qwen3-6B Pruned+Distilled from 8B | 6950              | 72.5 |
-| Qwen3-4B Original (comparison)    | 5210              | 70.0 |
+|                                   | Tokens per Second * | MMLU |
+|-----------------------------------|---------------------|------|
+| Qwen3-8B Original                 | 4420                | 74.9 |
+| Qwen3-6B Pruned+Distilled from 8B | 6950                | 72.5 |
+| Qwen3-4B Original (comparison)    | 5210                | 70.0 |
 
-The resulting compressed student maintains competitive performance while being significantly faster with a smaller memory footprint than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
+The resulting compressed student maintains competitive performance while being significantly faster with fewer parameters than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
+
+\* _Measured on H100 using TRT-LLM, FP8 precision_
 
 ## Usage
 
@@ -76,14 +78,16 @@ This will download and process the ClimbMix dataset, creating the necessary data
 After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
 to reflect your environment, and then perform the following:
 
-From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
+Launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
 
 To perform Pruning + Knowledge Distillation, run:
 
 ```bash
 python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbmix_proc --use-slurm
 ```
 
+> **_NOTE:_** You can omit the `--use-slurm` flag to run locally for testing, and optionally with `--mock-run` to use a mock dataset.
+
 ## Supported models
 
 Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.
diff --git a/examples/nemo_run/prune_distill/nemo_prune_kd_flow.py b/examples/nemo_run/prune_distill/nemo_prune_kd_flow.py
@@ -73,7 +73,14 @@ def get_args():
     parser.add_argument(
         "--data-dir",
         type=str,
-        help="Path the preprocessed dataset",
+        help="Path to the preprocessed dataset",
+    )
+    parser.add_argument(
+        "--data-prefixes",
+        type=str,
+        nargs="*",
+        help="Prefixes of the .bin and .idx files in the data directory",
+        default=[f"part_{i}_text_document" for i in SUBSET_IDX],
     )
     parser.add_argument(
         "--train-gpus",
@@ -122,7 +129,7 @@ def main(args):
         )
         data = run.Config(
             PreTrainingDataModule,
-            paths=[f"{args.data_dir}/part_{i}_text_document" for i in SUBSET_IDX],
+            paths=[f"{args.data_dir}/{prefix}" for prefix in args.data_prefixes],
             seq_length=SEQUENCE_LENGTH,
             tokenizer=tokenizer,
             global_batch_size=DISTILL_GBS,
@@ -236,11 +243,18 @@ def main(args):
             tail_logs=True,
             name="01_import",
         )
+        _ = exp.add(
+            eval_teacher,
+            executor=multi_gpu_executor,
+            tail_logs=True,
+            name="02a_eval_teacher",
+            dependencies=[s1],
+        )
         s2 = exp.add(
             prune,
             executor=gpu_executor,
             tail_logs=True,
-            name="02_prune",
+            name="02b_prune",
             dependencies=[s1],
         )
         s3 = exp.add(
@@ -250,18 +264,11 @@ def main(args):
             name="03_distill",
             dependencies=[s2],
         )
-        _ = exp.add(
-            eval_teacher,
-            executor=multi_gpu_executor,
-            tail_logs=True,
-            name="eval_teacher",
-            dependencies=[s1],
-        )
         _ = exp.add(
             eval_student,
             executor=multi_gpu_executor,
             tail_logs=True,
-            name="eval_student",
+            name="04a_eval_student",
             dependencies=[s3],
         )
         # WAR: Export needs access to all GPUs but only 1 task due to bug in NeMo
@@ -270,7 +277,7 @@ def main(args):
             export_model,
             executor=multi_gpu_executor,
             tail_logs=True,
-            name="04_export",
+            name="04b_export",
             dependencies=[s3],
         )
         exp.run(detach=True)
@@ -316,8 +323,8 @@ def main(args):
     else:
         PRUNE_SAMPLES = 512
         DISTILL_GBS = 768
-        _NUM_TOKENS = int(90e9)
-        DISTILL_STEPS = int(_NUM_TOKENS / DISTILL_GBS / SEQUENCE_LENGTH)
+        NUM_TOKENS = int(90e9)
+        DISTILL_STEPS = int(NUM_TOKENS / DISTILL_GBS / SEQUENCE_LENGTH)
         VAL_INTERVAL = 1000
     # # # # # # # # # # # # # # # # # # # # # #