Skip to content

Commit 3090a98

Browse files
committed
Review suggestions
Signed-off-by: Asha Anoosheh <[email protected]>
1 parent d632578 commit 3090a98

File tree

3 files changed

+22
-37
lines changed

3 files changed

+22
-37
lines changed

examples/nemo_run/common/process_climbmix.py

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -41,27 +41,19 @@ def get_args():
4141
default="Qwen/Qwen3-8B",
4242
help="Tokenizer to use for preprocessing",
4343
)
44-
parser.add_argument(
45-
"--subset-indices",
46-
help="Comma-separated subset indices to download",
47-
)
4844
return parser.parse_args()
4945

5046

5147
if __name__ == "__main__":
5248
args = get_args()
53-
Path(args.output_dir).mkdir(exist_ok=True)
49+
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
5450

5551
# create raw and processed directories
5652
raw_dir = Path(args.output_dir) / "climbmix_raw"
5753
proc_dir = Path(args.output_dir) / "climbmix_proc"
5854

5955
# only download the subset of the data
60-
if args.subset_indices:
61-
subset_idx = [int(i) for i in args.subset_indices.split(",")]
62-
else:
63-
subset_idx = SUBSET_IDX
64-
subset_filenames = [f"part_{i}.jsonl" for i in subset_idx]
56+
subset_filenames = [f"part_{i}.jsonl" for i in SUBSET_IDX]
6557

6658
# download raw data
6759
snapshot_download(
@@ -72,7 +64,7 @@ def get_args():
7264
)
7365

7466
# preprocess (tokenize)
75-
print("Processing ClimbMix dataset...")
67+
print("Tokenizing ClimbMix dataset...")
7668
input_paths = [raw_dir / name for name in subset_filenames]
7769
megatron_preprocess_data(
7870
input_paths,

examples/nemo_run/prune_distill/README.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,6 @@
22

33
# NeMo Pruning + Knowledge Distillation Simplified Flow Example
44

5-
[Slurm Examples](ADVANCED.md) |
6-
[Advanced Topics](ADVANCED.md) |
7-
[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)
8-
95
</div>
106

117
## Overview
@@ -36,14 +32,15 @@ graph TD;
3632

3733
## Results
3834

39-
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~14,000 steps with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
35+
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~28,000 steps (determined by sequence length, default 4096) with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
4036

41-
| | Tokens per Second | MMLU |
42-
|---------------------------|-------------------|------|
43-
| Qwen3-8B Original | 4420 | 74.9 |
44-
| Qwen3-6B Pruned+Distilled | 6950 | 72.5 |
37+
| | Tokens per Second | MMLU |
38+
|-----------------------------------|-------------------|------|
39+
| Qwen3-8B Original | 4420 | 74.9 |
40+
| Qwen3-6B Pruned+Distilled from 8B | 6950 | 72.5 |
41+
| Qwen3-4B Original (comparison) | 5210 | 70.0 |
4542

46-
The resulting compressed model maintains competitive performance while being significantly faster with a smaller memory footprint.
43+
The resulting compressed student maintains competitive performance while being significantly faster with a smaller memory footprint than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
4744

4845
## Usage
4946

@@ -58,7 +55,7 @@ To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia
5855
Example docker command:
5956

6057
```bash
61-
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
58+
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer:/opt/TensorRT-Model-Optimizer --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
6259
```
6360

6461
You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written.
@@ -84,7 +81,7 @@ From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py`
8481
To perform Pruning + Knowledge Distillation, run:
8582

8683
```bash
87-
python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbix_proc --use-slurm
84+
python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbmix_proc --use-slurm
8885
```
8986

9087
## Supported models

examples/nemo_run/prune_distill/nemo_prune_kd_flow.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@ def get_args():
4141
default="prune_distill_flow",
4242
)
4343
parser.add_argument(
44-
"--model-name",
44+
"--model-id-or-path",
4545
type=str,
46-
help="Name of the HF model",
46+
help="ID or path of the HF model",
4747
default="Qwen/Qwen3-8B",
4848
)
4949
parser.add_argument(
@@ -55,12 +55,6 @@ def get_args():
5555
"<model_name>_<model_size>(_<long_sequence_length> or other special settings)"
5656
),
5757
)
58-
parser.add_argument(
59-
"--hf-tokenizer",
60-
type=str,
61-
help="Name of HF model to use for tokenizer.",
62-
default="Qwen/Qwen3-8B",
63-
)
6458
parser.add_argument(
6559
"--prune-target-num-layers",
6660
type=int,
@@ -119,10 +113,12 @@ def main(args):
119113
seq_length=SEQUENCE_LENGTH,
120114
)
121115
else:
116+
if not args.data_dir:
117+
raise ValueError("--data-dir must be provided unless --mock-run is enabled.")
122118
tokenizer = run.Config(
123119
get_nmt_tokenizer,
124120
library="huggingface",
125-
model_name=args.hf_tokenizer,
121+
model_name=args.model_id_or_path,
126122
)
127123
data = run.Config(
128124
PreTrainingDataModule,
@@ -140,7 +136,7 @@ def main(args):
140136
import_model = run.Partial(
141137
llm.import_ckpt,
142138
model=model_module.model(),
143-
source=f"hf://{args.model_name}",
139+
source=f"hf://{args.model_id_or_path}",
144140
output_path=initial_model_out,
145141
overwrite=True,
146142
)
@@ -154,7 +150,7 @@ def main(args):
154150
nemo_checkpoint=initial_model_out,
155151
save_path=pruned_model_out,
156152
)
157-
prune.tokenizer_path = args.hf_tokenizer
153+
prune.tokenizer_path = args.model_id_or_path
158154
prune.pruning_config.target_num_layers = args.prune_target_num_layers
159155
prune.devices = 1
160156
prune.pp_size = 1
@@ -304,7 +300,7 @@ def main(args):
304300

305301
# # # # # # # # # # # # # # # # # # # # # #
306302
# # # # # CONFIGURABLE PARAMETERS # # # # #
307-
SEQUENCE_LENGTH = 8192
303+
SEQUENCE_LENGTH = 4096
308304
PRUNE_MBS = 4
309305
DISTILL_MBS = 2
310306
VAL_BATCHES = 32
@@ -318,9 +314,9 @@ def main(args):
318314
DISTILL_STEPS = 20
319315
VAL_INTERVAL = 10
320316
else:
321-
PRUNE_SAMPLES = 1024
317+
PRUNE_SAMPLES = 512
322318
DISTILL_GBS = 768
323-
_NUM_TOKENS = 89694564352
319+
_NUM_TOKENS = int(90e9)
324320
DISTILL_STEPS = int(_NUM_TOKENS / DISTILL_GBS / SEQUENCE_LENGTH)
325321
VAL_INTERVAL = 1000
326322
# # # # # # # # # # # # # # # # # # # # # #

0 commit comments

Comments
 (0)