Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
c758ad5
The main compression function for a model
danielkorzekwa Oct 27, 2025
8af9903
Code formatting
danielkorzekwa Oct 27, 2025
5ba6c27
Model search space configuration used by test_compress.py test.
danielkorzekwa Oct 27, 2025
0bc5d84
Tokenizer used by test_compress.py test.
danielkorzekwa Oct 27, 2025
87d4fa5
Tokenizer utility used by test_compress.py test
danielkorzekwa Oct 27, 2025
ced1e99
e2e tests for compress.py
danielkorzekwa Oct 27, 2025
5de0bdc
Add convert_llama3_config_to_decilm_config + unit test
danielkorzekwa Oct 27, 2025
800414c
Remove unused bypass distillation config files.
danielkorzekwa Oct 27, 2025
16abcc9
Moving integration tests to tests/experimental to not trigger CICD
danielkorzekwa Oct 27, 2025
a5ba1c7
update docs
danielkorzekwa Oct 27, 2025
1bda391
Replace mprint with print and replace osp.join with path1 / path2 not…
danielkorzekwa Oct 27, 2025
bb38401
Refactor file checking assertions to use .is_file() and .exists()
danielkorzekwa Oct 27, 2025
8415548
Add a new dependency section to setyp.py for the modelopt.torch._comp…
danielkorzekwa Oct 27, 2025
b1b1833
Move test_convert_llama3_config_to_decilm_config.py to tests/experime…
danielkorzekwa Oct 27, 2025
d4ffc91
Merge branch 'feature/compress' into dkorzekwa/e2e_compression_test
kevalmorabia97 Oct 27, 2025
6f28e4a
Fix: Add missing LICENSE headers
kevalmorabia97 Oct 27, 2025
016fb63
Use spawn_multiprocess_job for test_compress test (to be able to use …
danielkorzekwa Oct 28, 2025
0ccf1c4
Add comments.
danielkorzekwa Oct 28, 2025
58439ca
Add _save_dummy_dataset to the test_compress.py
danielkorzekwa Oct 28, 2025
2e5f776
Refactoring: Move torch distributed env variables to dist_utils.py
danielkorzekwa Oct 28, 2025
6274db5
Refactoring: move torch distributed variables to dist_utils
danielkorzekwa Oct 28, 2025
d942e0a
Move os.environ["WANDB_DISABLED"] = "true" to dist_utils.py
danielkorzekwa Oct 28, 2025
f765921
Implement integration test for mnt.convert() for the _compress algori…
danielkorzekwa Oct 28, 2025
de876d6
Implement mtn.convert() for compress() algorithm.
danielkorzekwa Oct 28, 2025
72bdc7a
Merge branch 'dkorzekwa/e2e_compression_test' into dkorzekwa/llama3_t…
danielkorzekwa Oct 28, 2025
40d28af
Merge branch 'dkorzekwa/llama3_to_decilm_convertion' into dkorzekwa/n…
danielkorzekwa Oct 28, 2025
f7fe23c
Fix broken test - incorrect package names.
danielkorzekwa Oct 28, 2025
3d1d286
Merge branch 'dkorzekwa/llama3_to_decilm_convertion' into dkorzekwa/n…
danielkorzekwa Oct 28, 2025
a210483
Implementing nas.convert for compress algorithm.
danielkorzekwa Oct 28, 2025
739f868
Improve docs
danielkorzekwa Oct 28, 2025
b06d22b
Merge branch 'dkorzekwa/e2e_compression_test' into dkorzekwa/llama3_t…
danielkorzekwa Oct 28, 2025
9352978
Merge branch 'dkorzekwa/llama3_to_decilm_convertion' into dkorzekwa/n…
danielkorzekwa Oct 28, 2025
20a3c5e
Code cleanup.
danielkorzekwa Oct 28, 2025
18cb88b
Merge branch 'feature/compress' into dkorzekwa/llama3_to_decilm_conve…
danielkorzekwa Oct 28, 2025
1033c81
Fix import
danielkorzekwa Oct 28, 2025
0680c45
simplify code
danielkorzekwa Oct 29, 2025
2d9da30
implementing compress_nas_plugin
danielkorzekwa Oct 29, 2025
febab44
code clean up.
danielkorzekwa Oct 29, 2025
86bf394
code clean up
danielkorzekwa Oct 29, 2025
86e04a0
create conftest.py with shared test logic for compress tests.
danielkorzekwa Oct 29, 2025
ae61644
code cleanup
danielkorzekwa Oct 29, 2025
2998cdb
Merge branch 'dkorzekwa/llama3_to_decilm_convertion' into dkorzekwa/n…
danielkorzekwa Oct 29, 2025
3778ec2
code refactoring
danielkorzekwa Oct 29, 2025
d940000
refactoring
danielkorzekwa Oct 29, 2025
0bf9a92
move test utilities from conftest.py to test_utils.py
danielkorzekwa Oct 29, 2025
b56df9a
Improve comments
danielkorzekwa Oct 29, 2025
fd63130
Merge branch 'feature/compress' into dkorzekwa/nas_convert
danielkorzekwa Oct 29, 2025
9bfcc21
Added TODO.
danielkorzekwa Oct 29, 2025
1dc89c4
Implement mtn.search() for the compress algorithm
danielkorzekwa Oct 29, 2025
6bfa3ec
Refactoring
danielkorzekwa Oct 29, 2025
6d45e33
code refactoring
danielkorzekwa Oct 29, 2025
f9e09d9
Correct import paths
danielkorzekwa Oct 30, 2025
a0cfd13
Change llama_checkpoint_path, can't be inside of ckpts folder
danielkorzekwa Oct 30, 2025
2c2995c
Initial commit for compress tutorial
danielkorzekwa Oct 31, 2025
b152689
Update compress tutorial and implement main.py for compress tutorial.
danielkorzekwa Oct 31, 2025
24e30e6
Update compress tutorial
danielkorzekwa Oct 31, 2025
21f115e
Create a yaml file for llama 3.2-1B model compression
danielkorzekwa Oct 31, 2025
d19b9ab
fix input model path in the unit test.
danielkorzekwa Oct 31, 2025
78d7a87
compress tutorial
danielkorzekwa Oct 31, 2025
9753b8d
Merge branch 'feature/compress' into dkorzekwa/nas_search
danielkorzekwa Oct 31, 2025
f71c1b6
Code refactoring
danielkorzekwa Oct 31, 2025
7eb2fd7
refactoring
danielkorzekwa Oct 31, 2025
3eb39f9
code clean up
danielkorzekwa Oct 31, 2025
ca16d77
Merge branch 'dkorzekwa/nas_search' into dkorzekwa/compress_tutorial
danielkorzekwa Oct 31, 2025
8360de9
Implement compress cli tool.
danielkorzekwa Oct 31, 2025
9230d81
Add running mtn.search() to compress cli tool.
danielkorzekwa Oct 31, 2025
28b5c13
update docs
danielkorzekwa Oct 31, 2025
a7eba4b
Replace dummy dataset with Nemotron-Post-Training-Dataset-v2
danielkorzekwa Nov 1, 2025
21ed59b
Refactoring
danielkorzekwa Nov 1, 2025
e3ed0a4
Update docs
danielkorzekwa Nov 1, 2025
9e09e8f
Refactoring. Change the compress tutorial from Llama 3.2 1B-instruct …
danielkorzekwa Nov 1, 2025
abb39f3
Improve logging
danielkorzekwa Nov 1, 2025
64b33e2
Update docs
danielkorzekwa Nov 1, 2025
21a602c
Update compress tutorial
danielkorzekwa Nov 1, 2025
9a381fe
Update compress tutorial ffn search space
danielkorzekwa Nov 1, 2025
c47e0af
Update tutorial
danielkorzekwa Nov 1, 2025
ce8d53a
Implement mip_only mode.
danielkorzekwa Nov 2, 2025
c754419
Improve logging. Convert HF to DeciLM checkpoint only once (single-gpu)
danielkorzekwa Nov 2, 2025
6505631
update docs
danielkorzekwa Nov 2, 2025
734c32c
Update compress tutorial with --mip_only part.
danielkorzekwa Nov 2, 2025
ee14792
Update docs
danielkorzekwa Nov 2, 2025
5dca0aa
Update tutorial llama config file.
danielkorzekwa Nov 2, 2025
5454c59
Update compress tutorial
danielkorzekwa Nov 2, 2025
b3fd9df
Update docs
danielkorzekwa Nov 2, 2025
d4ed34a
Update compress setting to increase the number of eval samples.
danielkorzekwa Nov 2, 2025
9979872
Update compress tutorial
danielkorzekwa Nov 2, 2025
8cb50d4
Update tutorial
danielkorzekwa Nov 2, 2025
2856ca1
Update tutorial.
danielkorzekwa Nov 2, 2025
553107a
Update compress tutorial.
danielkorzekwa Nov 2, 2025
3917a78
Add Dockerfile for the compress tutorial
danielkorzekwa Nov 3, 2025
6e1d910
Update compress tutorial
danielkorzekwa Nov 3, 2025
25b4aed
Merge branch 'feature/compress' into dkorzekwa/compress_tutorial
danielkorzekwa Nov 3, 2025
bb91d73
Update Puzzle Compression Tutorial (#493)
LianaMikael Nov 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions examples/compress/Dockerfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a dockerfile? If puzzle is self contained modelopt and dependencies are in setup.py then install modelopt will install everything needed and then we just need users to use a trtllm docker image withou any custom dockerfile or docker build step

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Docker file for compress example

FROM nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5

# TODO: The MIP solver would not work with this torch version.
# Fix it, otherwise, mamba models will not be supported by the Compress algorithm.
# # Required for mamba_ssm to work (the default torch version in the 1.1.0rc5 does not work)
# RUN pip uninstall -y torch
# RUN pip uninstall -y torchvision
# RUN pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# # Mamba SSM
# RUN pip install causal-conv1d --no-build-isolation
# RUN pip install mamba_ssm --no-build-isolation

# Required for puzzletron calc_subblock_stats
RUN pip install hydra-core==1.3.2
RUN pip install wandb~=0.17.5
RUN pip install "frozendict>=2.4.4"
RUN pip install fire
RUN pip install mip
RUN pip install lru-dict

WORKDIR /workspace/

COPY . .
194 changes: 194 additions & 0 deletions examples/compress/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Compress Algorithm Tutorial

This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146).
The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture.
The supported modifications are:

- `ffn_intermediate_size`: different FFN intermediate sizes
- `attention op/noop`: complete removal of attention layers
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt we decide to keep PoC just ffn pruning and no attn module replacement?


To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements.

In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what parameter reduction we see as well? That would be useful info to add here


## Environment

- [Dockerfile](./Dockerfile) to use.
- 2x NVIDIA H100 80GB HBM3 (1 card will be good as well).

## Compress the Model

1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file.

**_NOTE:_**
How to choose `intermediate_size_list`?
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a multiple of 2) to avoid issues with tensor operations in subsequent steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's recommend divisible by 64 instead? FFN value is 14336 so having only multiples of 64 in search space should be more than enougha no?


Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements.

2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).

dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB)

```bash
python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2
```

3. Run the compression script.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress"
```

This will save the full output to `log.txt` and display the following progress on screen:

```bash
[2025-11-02 12:06:34] Compress Progress 1/8: starting compression pipeline
[2025-11-02 12:06:45] Compress Progress 2/8: converting model from HF to DeciLM (single-gpu)
[2025-11-02 12:07:07] Compress Progress 3/8: scoring pruning activations (multi-gpu)
[2025-11-02 12:11:36] Compress Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu)
[2025-11-02 12:12:20] Compress Progress 5/8: building replacement library and subblock statistics (single-gpu)
[2025-11-02 12:12:21] Compress Progress 6/8: calculating one block scores (multi-gpu)
[2025-11-02 12:50:41] Compress Progress 7/8: running MIP and realizing models (multi-gpu)
[2025-11-02 12:52:34] Compress Progress 8/8: compression pipeline completed (multi-gpu)
```

Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review:

```bash
...
block_0: attention gqa_4 ffn intermediate_14336
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GQA4 will only work with TP4 if training in Megatron-fw. Maybe deployment also but I dont know for sure. Should we remove GQA pruning from search space?

block_1: attention gqa_4 ffn intermediate_14336
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is no ffn being pruned here? Is it because we use memory target and attention takes more memory so its pruned first by puzzle?

block_2: attention gqa_4 ffn intermediate_14336
block_3: attention gqa_4 ffn intermediate_14336
block_4: attention gqa_4 ffn intermediate_14336
block_5: attention gqa_4 ffn intermediate_14336
block_6: attention gqa_4 ffn intermediate_14336
block_7: attention gqa_4 ffn intermediate_14336
block_8: attention gqa_4 ffn intermediate_14336
block_9: attention gqa_4 ffn intermediate_14336
block_10: attention gqa_4 ffn intermediate_14336
block_11: attention gqa_4 ffn intermediate_14336
block_12: attention gqa_4 ffn intermediate_14336
block_13: attention gqa_4 ffn intermediate_14336
block_14: attention gqa_4 ffn intermediate_14336
block_15: attention gqa_4 ffn intermediate_14336
block_16: attention gqa_4 ffn intermediate_14336
block_17: attention no_op ffn intermediate_14336
block_18: attention no_op ffn intermediate_14336
block_19: attention no_op ffn intermediate_14336
block_20: attention no_op ffn intermediate_14336
block_21: attention no_op ffn intermediate_14336
block_22: attention no_op ffn intermediate_14336
block_23: attention no_op ffn intermediate_14336
block_24: attention no_op ffn intermediate_14336
block_25: attention no_op ffn intermediate_14336
block_26: attention no_op ffn intermediate_14336
block_27: attention no_op ffn intermediate_14336
block_28: attention no_op ffn intermediate_14336
block_29: attention gqa_4 ffn intermediate_14336
block_30: attention gqa_4 ffn intermediate_14336
block_31: attention gqa_4 ffn intermediate_14336

[2025-11-02 04:53:11,332]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 75796.4140625, 'stats.ffn_num_params': 5637275648, 'stats.num_kv_heads': 160, 'stats.kv_cache_memory_mib': 61440.0, 'stats.ffn_memory_mib': 10752.25, 'stats.attention_memory_mib': 63040.15625, 'stats.attention_num_params': 838942720, 'stats.num_params': 7526895616, 'stats.has_attention': 20, 'stats.has_ffn': 32}
...
################################################################
validate_model_and_extract_token_probs(model_name='teacher')
################################################################
...
Average losses = {'lm_loss': 1.118250765837729, 'token_accuracy_top_1': 0.7331905364990234, 'token_accuracy_top_5': 0.9094219207763672, 'token_accuracy_top_10': 0.9423646926879883}
...
################################################################
validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True)
################################################################
....
Average losses = {'lm_loss': 1.7577573340386152, 'token_accuracy_top_1': 0.6225490570068359, 'token_accuracy_top_5': 0.846257209777832, 'token_accuracy_top_10': 0.8987817764282227}
```

30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction.

## Re-run MIP Search with different constraints

If you want to try different constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag.
This assumes pruning, replacement library building, NAS scoring, and subblock stats calculation have already been completed.

For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`.

```bash
torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress"
```

This will generate the following network architecture (see `log.txt`):

```bash
block_0: attention gqa_4 ffn intermediate_14336
block_1: attention gqa_4 ffn intermediate_14336
block_2: attention gqa_4 ffn intermediate_14336
block_3: attention gqa_4 ffn intermediate_14336
block_4: attention gqa_4 ffn intermediate_14336
block_5: attention gqa_4 ffn intermediate_14336
block_6: attention gqa_4 ffn intermediate_14336
block_7: attention gqa_4 ffn intermediate_14336
block_8: attention gqa_4 ffn intermediate_14336
block_9: attention gqa_4 ffn intermediate_14336
block_10: attention gqa_4 ffn intermediate_14336
block_11: attention gqa_4 ffn intermediate_14336
block_12: attention gqa_4 ffn intermediate_14336
block_13: attention gqa_4 ffn intermediate_14336
block_14: attention gqa_4 ffn intermediate_14336
block_15: attention gqa_4 ffn intermediate_14336
block_16: attention gqa_4 ffn intermediate_14336
block_17: attention gqa_4 ffn intermediate_14336
block_18: attention no_op ffn intermediate_14336
block_19: attention no_op ffn intermediate_14336
block_20: attention no_op ffn intermediate_14336
block_21: attention gqa_4 ffn intermediate_14336
block_22: attention no_op ffn intermediate_14336
block_23: attention no_op ffn intermediate_14336
block_24: attention no_op ffn intermediate_14336
block_25: attention gqa_4 ffn intermediate_14336
block_26: attention gqa_4 ffn intermediate_14336
block_27: attention gqa_4 ffn intermediate_14336
block_28: attention gqa_4 ffn intermediate_14336
block_29: attention gqa_4 ffn intermediate_14336
block_30: attention gqa_4 ffn intermediate_14336
block_31: attention gqa_4 ffn intermediate_14336

[2025-11-02 12:50:42,024]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 94708.4609375, 'stats.has_ffn': 32, 'stats.ffn_memory_mib': 10752.25, 'stats.kv_cache_memory_mib': 79872.0, 'stats.attention_num_params': 1090625536, 'stats.ffn_num_params': 5637275648, 'stats.has_attention': 26, 'stats.num_params': 7778578432, 'stats.attention_memory_mib': 81952.203125, 'stats.num_kv_heads': 208}
...
################################################################
validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True)
################################################################
Average losses = {'lm_loss': 1.2425934937782586, 'token_accuracy_top_1': 0.703862190246582, 'token_accuracy_top_5': 0.8954982757568359, 'token_accuracy_top_10': 0.9336576461791992
```

On the other hand, if you set `target_memory: 28_000`, you'll observe that the intermediate FFN sizes are significantly reduced in certain layers (see `log.txt` for details):

```bash
block_5: attention no_op ffn intermediate_11520
block_6: attention no_op ffn intermediate_14336
block_7: attention no_op ffn intermediate_8704
block_8: attention no_op ffn intermediate_14336
block_9: attention no_op ffn intermediate_3072
block_10: attention no_op ffn intermediate_11520
block_11: attention no_op ffn intermediate_11520
block_12: attention no_op ffn intermediate_11520
block_13: attention no_op ffn intermediate_11520
block_14: attention no_op ffn intermediate_3072
```

## Evaluation

Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on a subset of [MMLU](https://huggingface.co/datasets/cais/mmlu).

```bash
lm_eval --model hf \
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
--tasks mmlu_humanities \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why mmlu_humanities instead of generic mmlu?

--num_fewshot 5 \
--batch_size 4
```

## Advanced usage

Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios.
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
defaults:
- pruning: ffn_pruning
- scoring: ../validate_solutions_defaults
- realize_model: ../validate_solutions_defaults
- bypass:
- override hydra/hydra_logging: disabled
- _self_

puzzle_dir: ???
teacher_dir: ${puzzle_dir}/ckpts/teacher/
replacement_library_path: ${puzzle_dir}/replacement_library.json
dataset_path: ??? # path to v0.4_mini

skip_realize_model: false

build_replacement_library:
add_ffn_no_ops: true
add_attention_no_ops: true

calc_subblock_stats:
batch_sizes: [64, 96, 128]
prefill_seq_len: 4096
generation_seq_len: 4096
num_active_tokens_override: # Optional override for sequence lengths
prefill_queue_size: 0
allocate_prefill_query: false
benchmark_iterations: # Set to a number (e.g., 1000) to enable runtime benchmarking
merge_with_existing_stats: false
subblock_stats_filename: "subblock_stats.json"
moe_stats_filename: "moe_stats.json"
runtime_stats:
backend: trt_torch

scoring:
solutions_to_validate:
skip_existing_solutions: true

replacement_library_path: ${replacement_library_path}
solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json}
teacher_dir: ${to_path:${teacher_dir}}
output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation

eval_samples: 10 # default is 128
micro_batch_size: 1
seed: 42
shuffle_seed: 444
dataset_path: ${dataset_path}

mip:
single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}}
subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}}
output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions}
gathered_metrics_path:
puzzle_profile:

# puzzle_profile:
objective: metrics.cosine_embedding_loss_hidden_states
bigger_is_better: false
num_solutions: 1
minimal_diversity: 2

subblock_stats_args:
- batch_size: 96
weights_dtype: torch.bfloat16
activations_dtype: torch.bfloat16
kv_cache_dtype: torch.bfloat16

report_additional_costs:
- stats.memory_mib
- stats.num_params
- stats.num_kv_heads
- stats.has_attention
- stats.has_ffn
- stats.kv_cache_memory_mib
- stats.attention_memory_mib
- stats.ffn_memory_mib
- stats.ffn_num_params
- stats.attention_num_params

human_constraints:
target_memory: 78_000

mip_constraints:
use_greedy_search: false
is_multi_layer_puzzle: true
metric_overrides:
constrain_search_func:
max_seconds_per_solution: 60

realize_model:
teacher_dir: ${to_path:${teacher_dir}}
tokenizer_name: ${to_path:${teacher_dir}}
replacement_library_path: ${replacement_library_path}
save_models: true
solutions_path: # Filled dynamically

# Validate params
skip_validation: false # To enable validation of the model solution set `skip_validation` as False
eval_samples: 128
micro_batch_size: 1
seed: 42
shuffle_seed: 444
dataset_path: ${dataset_path}

nccl_timeout_minutes: ${timedelta_minutes:10}

# This section redirects Hydra outputs
hydra:
run:
dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S}
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
defaults:
- Llama-3_1-8B
- _self_

# Input Hugging Face model to compress
input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct

# Dataset path for pruning and NAS scoring
dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2

# Working directory for compression outputs
puzzle_dir: /workspace/puzzle_dir

# MIP memory constraint (in MiB)
mip:
human_constraints:
target_memory: 96_000 # 96 GiB

# FFN intermediate sizes to search over (heterogeneous architecture)
pruning:
intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
defaults:
- pruning_defaults

activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/attn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id}

activation_hooks_kwargs:
method: independent_kv_head_contribution
optimize_for: memory # IndependentKvHeadContributionHook implementation that consumes less memory
target_layer: "self_attn.o_proj"
layer_input_descriptors_path:

# n_heads_in_group: 4
# num_attention_heads: 32 # num query heads
# num_kv_heads: 32 / 4 = 8 # num_query_heads // n_heads_in_group
n_heads_in_group_list: [8, 16, 32] # num_kv_heads = [4, 2, 1]
gqa_init_mode: "PruneKVHeads"
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
defaults:
- pruning_defaults

activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/ffn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id}

activation_hooks_kwargs:
method: iterative
target_layer: "mlp.down_proj"
layer_input_descriptors_path:

intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336
mlp_init_mode: "PruneByActivationsLog"
Loading
Loading