-
Notifications
You must be signed in to change notification settings - Fork 192
Compress tutorial (PoC) #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/compress
Are you sure you want to change the base?
Changes from all commits
c758ad5
8af9903
5ba6c27
0bc5d84
87d4fa5
ced1e99
5de0bdc
800414c
16abcc9
a5ba1c7
1bda391
bb38401
8415548
b1b1833
d4ffc91
6f28e4a
016fb63
0ccf1c4
58439ca
2e5f776
6274db5
d942e0a
f765921
de876d6
72bdc7a
40d28af
f7fe23c
3d1d286
a210483
739f868
b06d22b
9352978
20a3c5e
18cb88b
1033c81
0680c45
2d9da30
febab44
86bf394
86e04a0
ae61644
2998cdb
3778ec2
d940000
0bf9a92
b56df9a
fd63130
9bfcc21
1dc89c4
6bfa3ec
6d45e33
f9e09d9
a0cfd13
2c2995c
b152689
24e30e6
21f115e
d19b9ab
78d7a87
9753b8d
f71c1b6
7eb2fd7
3eb39f9
ca16d77
8360de9
9230d81
28b5c13
a7eba4b
21ed59b
e3ed0a4
9e09e8f
abb39f3
64b33e2
21a602c
9a381fe
c47e0af
ce8d53a
c754419
6505631
734c32c
ee14792
5dca0aa
5454c59
b3fd9df
d4ed34a
9979872
8cb50d4
2856ca1
553107a
3917a78
6e1d910
25b4aed
bb91d73
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Docker file for compress example | ||
|
|
||
| FROM nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5 | ||
|
|
||
| # TODO: The MIP solver would not work with this torch version. | ||
| # Fix it, otherwise, mamba models will not be supported by the Compress algorithm. | ||
| # # Required for mamba_ssm to work (the default torch version in the 1.1.0rc5 does not work) | ||
| # RUN pip uninstall -y torch | ||
| # RUN pip uninstall -y torchvision | ||
| # RUN pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 | ||
|
|
||
| # # Mamba SSM | ||
| # RUN pip install causal-conv1d --no-build-isolation | ||
| # RUN pip install mamba_ssm --no-build-isolation | ||
|
|
||
| # Required for puzzletron calc_subblock_stats | ||
| RUN pip install hydra-core==1.3.2 | ||
| RUN pip install wandb~=0.17.5 | ||
| RUN pip install "frozendict>=2.4.4" | ||
| RUN pip install fire | ||
| RUN pip install mip | ||
| RUN pip install lru-dict | ||
|
|
||
| WORKDIR /workspace/ | ||
|
|
||
| COPY . . |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,194 @@ | ||
| # Compress Algorithm Tutorial | ||
|
|
||
| This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146). | ||
| The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture. | ||
| The supported modifications are: | ||
|
|
||
| - `ffn_intermediate_size`: different FFN intermediate sizes | ||
| - `attention op/noop`: complete removal of attention layers | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didnt we decide to keep PoC just ffn pruning and no attn module replacement? |
||
|
|
||
| To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements. | ||
|
|
||
| In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know what parameter reduction we see as well? That would be useful info to add here |
||
|
|
||
| ## Environment | ||
|
|
||
| - [Dockerfile](./Dockerfile) to use. | ||
| - 2x NVIDIA H100 80GB HBM3 (1 card will be good as well). | ||
|
|
||
| ## Compress the Model | ||
|
|
||
| 1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file. | ||
|
|
||
| **_NOTE:_** | ||
| How to choose `intermediate_size_list`? | ||
| The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a multiple of 2) to avoid issues with tensor operations in subsequent steps. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's recommend divisible by 64 instead? FFN value is 14336 so having only multiples of 64 in search space should be more than enougha no? |
||
|
|
||
| Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements. | ||
|
|
||
| 2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2). | ||
|
|
||
| dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB) | ||
|
|
||
| ```bash | ||
| python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2 | ||
| ``` | ||
|
|
||
| 3. Run the compression script. | ||
|
|
||
| ```bash | ||
| torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress" | ||
| ``` | ||
|
|
||
| This will save the full output to `log.txt` and display the following progress on screen: | ||
|
|
||
| ```bash | ||
| [2025-11-02 12:06:34] Compress Progress 1/8: starting compression pipeline | ||
| [2025-11-02 12:06:45] Compress Progress 2/8: converting model from HF to DeciLM (single-gpu) | ||
| [2025-11-02 12:07:07] Compress Progress 3/8: scoring pruning activations (multi-gpu) | ||
| [2025-11-02 12:11:36] Compress Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu) | ||
| [2025-11-02 12:12:20] Compress Progress 5/8: building replacement library and subblock statistics (single-gpu) | ||
| [2025-11-02 12:12:21] Compress Progress 6/8: calculating one block scores (multi-gpu) | ||
| [2025-11-02 12:50:41] Compress Progress 7/8: running MIP and realizing models (multi-gpu) | ||
| [2025-11-02 12:52:34] Compress Progress 8/8: compression pipeline completed (multi-gpu) | ||
| ``` | ||
|
|
||
| Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review: | ||
|
|
||
| ```bash | ||
| ... | ||
| block_0: attention gqa_4 ffn intermediate_14336 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GQA4 will only work with TP4 if training in Megatron-fw. Maybe deployment also but I dont know for sure. Should we remove GQA pruning from search space? |
||
| block_1: attention gqa_4 ffn intermediate_14336 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is no ffn being pruned here? Is it because we use memory target and attention takes more memory so its pruned first by puzzle? |
||
| block_2: attention gqa_4 ffn intermediate_14336 | ||
| block_3: attention gqa_4 ffn intermediate_14336 | ||
| block_4: attention gqa_4 ffn intermediate_14336 | ||
| block_5: attention gqa_4 ffn intermediate_14336 | ||
| block_6: attention gqa_4 ffn intermediate_14336 | ||
| block_7: attention gqa_4 ffn intermediate_14336 | ||
| block_8: attention gqa_4 ffn intermediate_14336 | ||
| block_9: attention gqa_4 ffn intermediate_14336 | ||
| block_10: attention gqa_4 ffn intermediate_14336 | ||
| block_11: attention gqa_4 ffn intermediate_14336 | ||
| block_12: attention gqa_4 ffn intermediate_14336 | ||
| block_13: attention gqa_4 ffn intermediate_14336 | ||
| block_14: attention gqa_4 ffn intermediate_14336 | ||
| block_15: attention gqa_4 ffn intermediate_14336 | ||
| block_16: attention gqa_4 ffn intermediate_14336 | ||
| block_17: attention no_op ffn intermediate_14336 | ||
| block_18: attention no_op ffn intermediate_14336 | ||
| block_19: attention no_op ffn intermediate_14336 | ||
| block_20: attention no_op ffn intermediate_14336 | ||
| block_21: attention no_op ffn intermediate_14336 | ||
| block_22: attention no_op ffn intermediate_14336 | ||
| block_23: attention no_op ffn intermediate_14336 | ||
| block_24: attention no_op ffn intermediate_14336 | ||
| block_25: attention no_op ffn intermediate_14336 | ||
| block_26: attention no_op ffn intermediate_14336 | ||
| block_27: attention no_op ffn intermediate_14336 | ||
| block_28: attention no_op ffn intermediate_14336 | ||
| block_29: attention gqa_4 ffn intermediate_14336 | ||
| block_30: attention gqa_4 ffn intermediate_14336 | ||
| block_31: attention gqa_4 ffn intermediate_14336 | ||
|
|
||
| [2025-11-02 04:53:11,332]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 75796.4140625, 'stats.ffn_num_params': 5637275648, 'stats.num_kv_heads': 160, 'stats.kv_cache_memory_mib': 61440.0, 'stats.ffn_memory_mib': 10752.25, 'stats.attention_memory_mib': 63040.15625, 'stats.attention_num_params': 838942720, 'stats.num_params': 7526895616, 'stats.has_attention': 20, 'stats.has_ffn': 32} | ||
| ... | ||
| ################################################################ | ||
| validate_model_and_extract_token_probs(model_name='teacher') | ||
| ################################################################ | ||
| ... | ||
| Average losses = {'lm_loss': 1.118250765837729, 'token_accuracy_top_1': 0.7331905364990234, 'token_accuracy_top_5': 0.9094219207763672, 'token_accuracy_top_10': 0.9423646926879883} | ||
| ... | ||
| ################################################################ | ||
| validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True) | ||
| ################################################################ | ||
| .... | ||
| Average losses = {'lm_loss': 1.7577573340386152, 'token_accuracy_top_1': 0.6225490570068359, 'token_accuracy_top_5': 0.846257209777832, 'token_accuracy_top_10': 0.8987817764282227} | ||
| ``` | ||
|
|
||
| 30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction. | ||
|
|
||
| ## Re-run MIP Search with different constraints | ||
|
|
||
| If you want to try different constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag. | ||
| This assumes pruning, replacement library building, NAS scoring, and subblock stats calculation have already been completed. | ||
|
|
||
| For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`. | ||
|
|
||
| ```bash | ||
| torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress" | ||
| ``` | ||
|
|
||
| This will generate the following network architecture (see `log.txt`): | ||
|
|
||
| ```bash | ||
| block_0: attention gqa_4 ffn intermediate_14336 | ||
| block_1: attention gqa_4 ffn intermediate_14336 | ||
| block_2: attention gqa_4 ffn intermediate_14336 | ||
| block_3: attention gqa_4 ffn intermediate_14336 | ||
| block_4: attention gqa_4 ffn intermediate_14336 | ||
| block_5: attention gqa_4 ffn intermediate_14336 | ||
| block_6: attention gqa_4 ffn intermediate_14336 | ||
| block_7: attention gqa_4 ffn intermediate_14336 | ||
| block_8: attention gqa_4 ffn intermediate_14336 | ||
| block_9: attention gqa_4 ffn intermediate_14336 | ||
| block_10: attention gqa_4 ffn intermediate_14336 | ||
| block_11: attention gqa_4 ffn intermediate_14336 | ||
| block_12: attention gqa_4 ffn intermediate_14336 | ||
| block_13: attention gqa_4 ffn intermediate_14336 | ||
| block_14: attention gqa_4 ffn intermediate_14336 | ||
| block_15: attention gqa_4 ffn intermediate_14336 | ||
| block_16: attention gqa_4 ffn intermediate_14336 | ||
| block_17: attention gqa_4 ffn intermediate_14336 | ||
| block_18: attention no_op ffn intermediate_14336 | ||
| block_19: attention no_op ffn intermediate_14336 | ||
| block_20: attention no_op ffn intermediate_14336 | ||
| block_21: attention gqa_4 ffn intermediate_14336 | ||
| block_22: attention no_op ffn intermediate_14336 | ||
| block_23: attention no_op ffn intermediate_14336 | ||
| block_24: attention no_op ffn intermediate_14336 | ||
| block_25: attention gqa_4 ffn intermediate_14336 | ||
| block_26: attention gqa_4 ffn intermediate_14336 | ||
| block_27: attention gqa_4 ffn intermediate_14336 | ||
| block_28: attention gqa_4 ffn intermediate_14336 | ||
| block_29: attention gqa_4 ffn intermediate_14336 | ||
| block_30: attention gqa_4 ffn intermediate_14336 | ||
| block_31: attention gqa_4 ffn intermediate_14336 | ||
|
|
||
| [2025-11-02 12:50:42,024]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 94708.4609375, 'stats.has_ffn': 32, 'stats.ffn_memory_mib': 10752.25, 'stats.kv_cache_memory_mib': 79872.0, 'stats.attention_num_params': 1090625536, 'stats.ffn_num_params': 5637275648, 'stats.has_attention': 26, 'stats.num_params': 7778578432, 'stats.attention_memory_mib': 81952.203125, 'stats.num_kv_heads': 208} | ||
| ... | ||
| ################################################################ | ||
| validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True) | ||
| ################################################################ | ||
| Average losses = {'lm_loss': 1.2425934937782586, 'token_accuracy_top_1': 0.703862190246582, 'token_accuracy_top_5': 0.8954982757568359, 'token_accuracy_top_10': 0.9336576461791992 | ||
| ``` | ||
|
|
||
| On the other hand, if you set `target_memory: 28_000`, you'll observe that the intermediate FFN sizes are significantly reduced in certain layers (see `log.txt` for details): | ||
|
|
||
| ```bash | ||
| block_5: attention no_op ffn intermediate_11520 | ||
| block_6: attention no_op ffn intermediate_14336 | ||
| block_7: attention no_op ffn intermediate_8704 | ||
| block_8: attention no_op ffn intermediate_14336 | ||
| block_9: attention no_op ffn intermediate_3072 | ||
| block_10: attention no_op ffn intermediate_11520 | ||
| block_11: attention no_op ffn intermediate_11520 | ||
| block_12: attention no_op ffn intermediate_11520 | ||
| block_13: attention no_op ffn intermediate_11520 | ||
| block_14: attention no_op ffn intermediate_3072 | ||
| ``` | ||
|
|
||
| ## Evaluation | ||
|
|
||
| Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on a subset of [MMLU](https://huggingface.co/datasets/cais/mmlu). | ||
|
|
||
| ```bash | ||
| lm_eval --model hf \ | ||
| --model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \ | ||
| --tasks mmlu_humanities \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why |
||
| --num_fewshot 5 \ | ||
| --batch_size 4 | ||
| ``` | ||
|
|
||
| ## Advanced usage | ||
|
|
||
| Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| defaults: | ||
| - pruning: ffn_pruning | ||
| - scoring: ../validate_solutions_defaults | ||
| - realize_model: ../validate_solutions_defaults | ||
| - bypass: | ||
| - override hydra/hydra_logging: disabled | ||
| - _self_ | ||
|
|
||
| puzzle_dir: ??? | ||
| teacher_dir: ${puzzle_dir}/ckpts/teacher/ | ||
| replacement_library_path: ${puzzle_dir}/replacement_library.json | ||
| dataset_path: ??? # path to v0.4_mini | ||
|
|
||
| skip_realize_model: false | ||
|
|
||
| build_replacement_library: | ||
| add_ffn_no_ops: true | ||
| add_attention_no_ops: true | ||
|
|
||
| calc_subblock_stats: | ||
| batch_sizes: [64, 96, 128] | ||
| prefill_seq_len: 4096 | ||
| generation_seq_len: 4096 | ||
| num_active_tokens_override: # Optional override for sequence lengths | ||
| prefill_queue_size: 0 | ||
| allocate_prefill_query: false | ||
| benchmark_iterations: # Set to a number (e.g., 1000) to enable runtime benchmarking | ||
| merge_with_existing_stats: false | ||
| subblock_stats_filename: "subblock_stats.json" | ||
| moe_stats_filename: "moe_stats.json" | ||
| runtime_stats: | ||
| backend: trt_torch | ||
|
|
||
| scoring: | ||
| solutions_to_validate: | ||
| skip_existing_solutions: true | ||
|
|
||
| replacement_library_path: ${replacement_library_path} | ||
| solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json} | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation | ||
|
|
||
| eval_samples: 10 # default is 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| mip: | ||
| single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}} | ||
| subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}} | ||
| output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions} | ||
| gathered_metrics_path: | ||
| puzzle_profile: | ||
|
|
||
| # puzzle_profile: | ||
| objective: metrics.cosine_embedding_loss_hidden_states | ||
| bigger_is_better: false | ||
| num_solutions: 1 | ||
| minimal_diversity: 2 | ||
|
|
||
| subblock_stats_args: | ||
| - batch_size: 96 | ||
| weights_dtype: torch.bfloat16 | ||
| activations_dtype: torch.bfloat16 | ||
| kv_cache_dtype: torch.bfloat16 | ||
|
|
||
| report_additional_costs: | ||
| - stats.memory_mib | ||
| - stats.num_params | ||
| - stats.num_kv_heads | ||
| - stats.has_attention | ||
| - stats.has_ffn | ||
| - stats.kv_cache_memory_mib | ||
| - stats.attention_memory_mib | ||
| - stats.ffn_memory_mib | ||
| - stats.ffn_num_params | ||
| - stats.attention_num_params | ||
|
|
||
| human_constraints: | ||
| target_memory: 78_000 | ||
|
|
||
| mip_constraints: | ||
| use_greedy_search: false | ||
| is_multi_layer_puzzle: true | ||
| metric_overrides: | ||
| constrain_search_func: | ||
| max_seconds_per_solution: 60 | ||
|
|
||
| realize_model: | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| tokenizer_name: ${to_path:${teacher_dir}} | ||
| replacement_library_path: ${replacement_library_path} | ||
| save_models: true | ||
| solutions_path: # Filled dynamically | ||
|
|
||
| # Validate params | ||
| skip_validation: false # To enable validation of the model solution set `skip_validation` as False | ||
| eval_samples: 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| nccl_timeout_minutes: ${timedelta_minutes:10} | ||
|
|
||
| # This section redirects Hydra outputs | ||
| hydra: | ||
| run: | ||
| dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| defaults: | ||
| - Llama-3_1-8B | ||
| - _self_ | ||
|
|
||
| # Input Hugging Face model to compress | ||
| input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct | ||
|
|
||
| # Dataset path for pruning and NAS scoring | ||
| dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2 | ||
|
|
||
| # Working directory for compression outputs | ||
| puzzle_dir: /workspace/puzzle_dir | ||
|
|
||
| # MIP memory constraint (in MiB) | ||
| mip: | ||
| human_constraints: | ||
| target_memory: 96_000 # 96 GiB | ||
|
|
||
| # FFN intermediate sizes to search over (heterogeneous architecture) | ||
| pruning: | ||
| intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| defaults: | ||
| - pruning_defaults | ||
|
|
||
| activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/attn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id} | ||
|
|
||
| activation_hooks_kwargs: | ||
| method: independent_kv_head_contribution | ||
| optimize_for: memory # IndependentKvHeadContributionHook implementation that consumes less memory | ||
| target_layer: "self_attn.o_proj" | ||
| layer_input_descriptors_path: | ||
|
|
||
| # n_heads_in_group: 4 | ||
| # num_attention_heads: 32 # num query heads | ||
| # num_kv_heads: 32 / 4 = 8 # num_query_heads // n_heads_in_group | ||
| n_heads_in_group_list: [8, 16, 32] # num_kv_heads = [4, 2, 1] | ||
| gqa_init_mode: "PruneKVHeads" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| defaults: | ||
| - pruning_defaults | ||
|
|
||
| activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/ffn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id} | ||
|
|
||
| activation_hooks_kwargs: | ||
| method: iterative | ||
| target_layer: "mlp.down_proj" | ||
| layer_input_descriptors_path: | ||
|
|
||
| intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336 | ||
| mlp_init_mode: "PruneByActivationsLog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a dockerfile? If puzzle is self contained modelopt and dependencies are in
setup.pythen install modelopt will install everything needed and then we just need users to use a trtllm docker image withou any custom dockerfile or docker build step