Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
7f81e00
Changes to native runner to run tt
jackzhxng Oct 9, 2024
0b5a9a7
Add kwarg example inputs to eager model base
jackzhxng Sep 30, 2024
a9647d2
Create create new method for example kwarg inputs instead
jackzhxng Oct 7, 2024
fa3b1d2
Add kwarg example inputs to eager model base
jackzhxng Sep 30, 2024
e8715ba
Lint
jackzhxng Oct 8, 2024
a6f96a2
Accept model type parameter in export_llama
jackzhxng Oct 5, 2024
328c72c
Remove future implementation
jackzhxng Oct 5, 2024
ec80bba
Lint
jackzhxng Oct 15, 2024
c9bbe12
Create create new method for example kwarg inputs instead
jackzhxng Oct 7, 2024
99d5bfb
Accept model type parameter in export_llama
jackzhxng Oct 5, 2024
1fb2236
Torchtune llama3_2_vision model in ET, no quantization
jackzhxng Oct 5, 2024
e0c4b8a
Fix vision model example input
jackzhxng Oct 8, 2024
e145bd1
Lint
jackzhxng Oct 22, 2024
ed906cb
Kv cache
jackzhxng Oct 25, 2024
6dd47e7
Merge branch 'main' into jz/tt-llama
jackzhxng Oct 25, 2024
1825972
Update READMEs
jackzhxng Oct 25, 2024
196499a
Change model default arg
jackzhxng Oct 25, 2024
96ba40b
Update eager runner and eval llama
jackzhxng Oct 25, 2024
18a82e1
Merge branch 'jz/tt-llama-rebased' into jz/tt-llama-2
jackzhxng Oct 25, 2024
0f3035d
Fix tests
jackzhxng Oct 25, 2024
e677e14
Merge branch 'jz/tt-llama-rebased' into jz/tt-llama-2
jackzhxng Oct 25, 2024
b1f6678
Fix tests again
jackzhxng Oct 28, 2024
13d004b
Merge branch 'jz/tt-llama-rebased' into jz/tt-llama-2
jackzhxng Oct 28, 2024
c79b773
Strict = True
jackzhxng Oct 31, 2024
b8ff8e2
Things work
jackzhxng Oct 31, 2024
25ec7ce
Merge branch 'jz/tt-llama-rebased' into jz/native-runner-tt
jackzhxng Oct 31, 2024
6e38763
Clip logits if torchtune
jackzhxng Oct 31, 2024
7a7041d
Merge branch 'jz/tt-llama-2' into jz/native-runner-tt
jackzhxng Oct 31, 2024
96d5798
Fix
jackzhxng Oct 31, 2024
f275e2e
Kv cache by default is false
jackzhxng Nov 1, 2024
37011d3
Clean up
jackzhxng Nov 1, 2024
7d52002
Export model with KV cache + runner for Torchtune models
jackzhxng Nov 4, 2024
e44b259
Export with no kv cache + non-strict load checkpoint
jackzhxng Nov 6, 2024
b7c8315
Add Llama3.2 1B as an new example model
jackzhxng Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .ci/scripts/test_eval_llama_mmlu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ run_and_verify() {
exit 1
fi
$PYTHON_EXECUTABLE -m examples.models.llama.eval_llama \
--model llama2 \
-c stories110M.pt \
-p params.json \
-t tokenizer.model \
Expand Down
1 change: 1 addition & 0 deletions .ci/scripts/test_eval_llama_wikitext.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ run_and_verify() {
exit 1
fi
$PYTHON_EXECUTABLE -m examples.models.llama.eval_llama \
--model llama2 \
-c stories110M.pt \
-p params.json \
-t tokenizer.model \
Expand Down
2 changes: 1 addition & 1 deletion .ci/scripts/test_llama.sh
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ if [[ "${QNN}" == "ON" ]]; then
EXPORT_ARGS="${EXPORT_ARGS} -kv -v --qnn --disable_dynamic_shape"
fi
# Add dynamically linked library location
$PYTHON_EXECUTABLE -m examples.models.llama.export_llama ${EXPORT_ARGS}
$PYTHON_EXECUTABLE -m examples.models.llama.export_llama --model llama3 ${EXPORT_ARGS}

# Create tokenizer.bin.
echo "Creating tokenizer.bin"
Expand Down
1 change: 1 addition & 0 deletions .ci/scripts/test_llama_runner_eager.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ run_and_verify() {
exit 1
fi
$PYTHON_EXECUTABLE -m examples.models.llama.runner.eager \
--model llama2 \
-c stories110M.pt \
-p params.json \
-t tokenizer.model \
Expand Down
2 changes: 1 addition & 1 deletion .ci/scripts/test_model.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ test_model() {
# Install requirements for export_llama
bash examples/models/llama/install_requirements.sh
# Test export_llama script: python3 -m examples.models.llama.export_llama
"${PYTHON_EXECUTABLE}" -m examples.models.llama.export_llama -c examples/models/llama/params/demo_rand_params.pth -p examples/models/llama/params/demo_config.json
"${PYTHON_EXECUTABLE}" -m examples.models.llama.export_llama --model llama2 -c examples/models/llama/params/demo_rand_params.pth -p examples/models/llama/params/demo_config.json
run_portable_executor_runner
rm "./${MODEL_NAME}.pte"
fi
Expand Down
1 change: 1 addition & 0 deletions backends/vulkan/docs/android_demo.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ partially lower the Llama model to Vulkan.
```shell
# The files will usually be downloaded to ~/.llama
python -m examples.models.llama.export_llama \
--model llama3_2
--disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \
-c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
-p ~/.llama/checkpoints/Llama3.2-1B/params.json \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure

```bash
# Please note that calibration_data must include the prompt template for special tokens.
python -m examples.models.llama.export_llama -t <path_to_tokenizer.model>
python -m examples.models.llama.export_llama --model llama3 -t <path_to_tokenizer.model>
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,12 @@ We support PTQ by default. The entire export may take ~20 minutes (Llama 3.1 8B)
Examples:
```
# 4 bits weight only quantize
python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
python -m examples.models.llama.export_llama --model llama3 --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
```
If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example:
```
# 8 bits quantization with 4 shards
python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
python -m examples.models.llama.export_llama --model llama3 --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
```
Note: if you encountered issues below
```
Expand Down Expand Up @@ -158,7 +158,7 @@ To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure
* 8B models might need 16GB RAM on the device to run.
```
# Please note that calibration_data must include the prompt template for special tokens.
python -m examples.models.llama.export_llama -t <path_to_tokenizer.model> -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
python -m examples.models.llama.export_llama --model llama3 -t <path_to_tokenizer.model> -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
```

## Pushing Model and Tokenizer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,14 @@ In this demo app, we support text-only inference with up-to-date Llama models an
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
```

### For Llama 3.2 1B and 3B QAT+LoRA models
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
```

### For Llama 3.2 1B and 3B BF16 models
Expand All @@ -72,7 +72,7 @@ We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
```

For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
Expand All @@ -87,7 +87,7 @@ To safeguard your application, you can use our Llama Guard models for prompt cla
* We prepared this model using the following command

```
python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
```


Expand All @@ -97,7 +97,7 @@ python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
```

You may wonder what the ‘--metadata’ flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ Install the required packages to export the model
sh examples/models/llama/install_requirements.sh
```

Export the model
Export the model (Llama 3 in this case)
```
python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" --params "${MODEL_DIR}/params.json" -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32
python -m examples.models.llama.export_llama --model llama3 --checkpoint "${MODEL_DIR}/consolidated.00.pth" --params "${MODEL_DIR}/params.json" -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32
```

## Pushing Model and Tokenizer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@ sh examples/models/llama/install_requirements.sh
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
```

### For Llama 3.2 1B and 3B QAT+LoRA models
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
```

### For Llama 3.2 1B and 3B BF16 models
Expand All @@ -64,7 +64,7 @@ We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
```

For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
Expand All @@ -73,7 +73,7 @@ For more detail using Llama 3.2 lightweight models including prompt template, pl

Export the model
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> -p <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
python -m examples.models.llama.export_llama --model llama3_2 --checkpoint <path-to-your-checkpoint.pth> -p <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
```

### For LLaVA model
Expand Down
18 changes: 16 additions & 2 deletions examples/models/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ LLAMA_CHECKPOINT=path/to/checkpoint.pth
LLAMA_PARAMS=path/to/params.json

python -m examples.models.llama.export_llama \
--model llama3_2
--checkpoint "${LLAMA_CHECKPOINT:?}" \
--params "${LLAMA_PARAMS:?}" \
-kv \
Expand All @@ -187,6 +188,7 @@ LLAMA_QUANTIZED_CHECKPOINT=path/to/spinquant/checkpoint.pth
LLAMA_PARAMS=path/to/spinquant/params.json

python -m examples.models.llama.export_llama \
--model llama3_2
--checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
--params "${LLAMA_PARAMS:?}" \
--use_sdpa_with_kv_cache \
Expand All @@ -212,6 +214,7 @@ LLAMA_QUANTIZED_CHECKPOINT=path/to/qlora/checkpoint.pth
LLAMA_PARAMS=path/to/qlora/params.json

python -m examples.models.llama.export_llama \
--model llama3_2
--checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
--params "${LLAMA_PARAMS:?}" \
-qat \
Expand All @@ -237,9 +240,20 @@ You can export and run the original Llama 3 8B instruct model.

2. Export model and generate `.pte` file
```
python -m examples.models.llama.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
python -m examples.models.llama.export_llama
--model llama3
--checkpoint <consolidated.00.pth>
-p <params.json>
-kv
--use_sdpa_with_kv_cache
-X
-qmode 8da4w
--group_size 128
-d fp32
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
--embedding-quantize 4,32
--output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
```

Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.

## Step 3: Run on your computer to validate
Expand Down
8 changes: 4 additions & 4 deletions examples/models/llama/UTILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,17 @@ From `executorch` root:
```
3. Export model and generate `.pte` file.
```
python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
python -m examples.models.llama.export_llama --model llama3 -c stories110M.pt -p params.json -X -kv
```

## Smaller model delegated to other backends

Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is

- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `
- Lower to CoreML: `python -m examples.models.llama.export_llama --model llama3 -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
- MPS: `python -m examples.models.llama.export_llama --model llama3 -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
- QNN: `python -m examples.models.llama.export_llama --model llama3 -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `

The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.

Expand Down
2 changes: 1 addition & 1 deletion examples/models/llama/eval_llama_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def gen_eval_wrapper(

pt2e_quant_params, quantizers, quant_dtype = get_quantizer_and_quant_params(args)
# GPTFastEvalWrapper: Create a wrapper around a pre-exported model
manager: LLMEdgeManager = _prepare_for_llama_export(model_name, args)
manager: LLMEdgeManager = _prepare_for_llama_export(args)

if len(quantizers) != 0:
manager = manager.export().pt2e_quantize(quantizers)
Expand Down
Loading
Loading