Skip to content

Commit 7fb6731

Browse files
kirklandsignlucylq
authored andcommitted
Add Model type toString() (#85)
1 parent 9fc32b1 commit 7fb6731

File tree

3 files changed

+228
-59
lines changed

3 files changed

+228
-59
lines changed

program-data-separation/cpp/lora_example/README.md

Lines changed: 93 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
This directory contains the C++ code for the LoRA demo.
44

55
You'll learn how to:
6-
1. Export two LoRA PTE files that share a single foundation weight file.
6+
1. Export LoRA PTE files that share a single foundation weight file.
77
2. Load and run the LoRA PTE files, and notice that the runtime memory is not doubled as the foundation weights are shared.
88

99
Note:
@@ -13,12 +13,43 @@ Note:
1313

1414
## Size savings.
1515

16+
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
1617
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
1718

1819
### XNNPACK weight sharing.
1920

2021
The XNNPACK backend is a singleton. Weight sharing is implemented via the XNNPACK weight cache. At delegate init time, XNNPACK checks the weight cache for the weights it needs. If they don't exist, XNNPACK will fetch weights from the NamedDataMap (the API that exposes weights in a PTD file), pack them, store them in the weight cache and free the original. This means we won't keep around multiple copies of the same weights.
2122

23+
## [Quick Start](quick_start.md)
24+
Download pre-trained dummy adapter to export and run along with a regular Llama-3-2-1B model.
25+
26+
## Fine-tune from scratch with Unsloth and Llama-3-2-1B.
27+
We can use [Unsloth](https://unsloth.ai/), a popular tool to finetune and train LLMs, to create our LoRA adapters. Unsloth provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit.
28+
29+
The training notebook takes a few shortcuts to reduce the latency/compute. You can change these settings for better results.
30+
1. Play around with the chunk sizes and overlap to see what works best for your dataset.
31+
2. The notebook trains on the last three data files generated; increase this for better coverage of your dataset.
32+
3. At the training step, the notebook uses max_steps=60 to speed things up. Setting num_train_epochs=1 (or greater) for a full run and max_steps=None has better results.
33+
34+
For this demo, we trained on two datasets:
35+
1. executorch/docs/source: an adapter with domain knowledge of executorch. Using Meta Synthetic Data Kit, you can generate qa pairs based on the executorch documentation.
36+
2. Recent Nobel prize winners (2024-2025): an adapter with knowledge beyond the cutoff date of Llama-3-2-1B. This data was taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates).
37+
38+
Unsloth will output the adapter artifacts to the specified directory (in the colab notebook, 'lora_model/'). You will see a few files like such:
39+
```
40+
-rw-r--r-- 1 lfq users 1092 Oct 15 11:01 adapter_config.json
41+
-rw-r--r-- 1 lfq users 45118424 Oct 15 11:01 adapter_model.safetensors
42+
-rw-r--r-- 1 lfq users 3827 Oct 15 11:01 chat_template.jinja
43+
-rw-r--r-- 1 lfq users 5268 Oct 15 11:01 README.md
44+
-rw-r--r-- 1 lfq users 454 Oct 15 11:01 special_tokens_map.json
45+
-rw-r--r-- 1 lfq users 50642 Oct 15 11:01 tokenizer_config.json
46+
-rw-r--r-- 1 lfq users 17209920 Oct 15 11:01 tokenizer.json
47+
```
48+
49+
The files we want are:
50+
- adapter_config.json
51+
- adapter_model.safetensors
52+
2253
## Virtual environment setup.
2354
Create and activate a Python virtual environment:
2455
```bash
@@ -30,11 +61,6 @@ conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
3061
```
3162

3263
## Install executorch
33-
Please install executorch. If you are using your own trained adapter (not the example one), please use a recent nightly build or install from source.
34-
35-
```
36-
pip install executorch==1.0.0
37-
```
3864

3965
You can also install from the nightly build.
4066
```
@@ -43,47 +69,63 @@ pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pyt
4369

4470
Or [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
4571

72+
```
73+
# Clone the ExecuTorch repo from GitHub.
74+
git clone https://github.com/pytorch/executorch.git && cd executorch
4675
47-
## Export the model/s.
48-
Change into the program-data-separation directory and create a directory to hold exported artifacts.
49-
```bash
50-
cd ~/executorch-examples/program-data-separation
51-
mkdir models
76+
# Install ExecuTorch pip package.
77+
./install_executorch.sh --editable
5278
```
5379

54-
Export models into the `models` directory.
55-
- The first command generates a regular llama_3_2_1B model.
56-
- The second command generates a llama_3_2_1B lora model.
80+
NOTE: some features are not available in executorch==1.0.0, use main or a recent nightly.
5781

58-
```bash
59-
sh export_lora.sh
82+
## Download base model
83+
We're using https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct.
6084
```
61-
Expect the files:
62-
- llama_3_2_1B.pte
63-
- llama_3_2_1B.ptd
64-
- llama_3_2_1B_lora.pte
65-
- foundation_weights.ptd
66-
- tokenizer.model
85+
pip install huggingface_hub
6786
68-
llama_3_2_1B.ptd and foundation_weights.ptd contain the same contents, and you can remove llama_3_2_1B.ptd.
69-
tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It is used at runtime.
87+
# As this is a gated model, login.
88+
huggingface-cli login
89+
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --local-dir ./Llama-3.2-1B-Instruct
90+
```
7091

71-
Note:
72-
- PTE: contains the program execution logic.
73-
- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors. It relies on flatbuffers instead of json for serde.
92+
## Export the adapter models.
7493

75-
Sample file sizes:
94+
Set your paths and the model name.
7695
```
77-
-rw-r--r-- 1 lfq users 5994013600 Oct 17 14:31 foundation.ptd
78-
-rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte
79-
-rw-r--r-- 1 lfq users 317248 Oct 17 14:28 llama_3_2_1B.pte
96+
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
97+
ADAPTER_PATH=lora_model
98+
MODEL_NAME=<model_name>
8099
```
81100

82-
Notice the lora - llama file size difference is about 27.3MB. This is the size of the adapter weights, and changes depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json.
101+
Export command. Run this with different MODEL_NAMEs for each adapter.
102+
```
103+
python -m executorch.extension.llm.export.export_llm \
104+
base.checkpoint="${DOWNLOADED_PATH}/original/consolidated.00.pth" \
105+
base.params="${DOWNLOADED_PATH}/original/params.json" \
106+
base.tokenizer_path="${DOWNLOADED_PATH}/original/tokenizer.model" \
107+
base.adapter_checkpoint="${ADAPTER_PATH}/adapter_model.safetensors" \
108+
base.adapter_config="${ADAPTER_PATH}/adapter_config.json" \
109+
model.use_kv_cache=true \
110+
model.use_sdpa_with_kv_cache=true \
111+
model.dtype_override="fp32" \
112+
backend.xnnpack.enabled=true \
113+
backend.xnnpack.extended_ops=true \
114+
export.output_name="${MODEL_NAME}.pte" \
115+
export.foundation_weights_file="foundation.ptd"
116+
```
117+
118+
Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. For example:
119+
83120
```
84-
{"r": 64, "lora_alpha": 128, "target_modules": ["q_proj", "v_proj", "o_proj"], "peft_type": "LORA", "base_model_name_or_path": "meta-llama/Llama-3.2-1B-Instruct"}
121+
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:05 et.pte
122+
-rw-r--r-- 1 lfq users 5994013600 Oct 17 18:05 foundation.ptd
123+
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:00 nobel.pte
85124
```
86125

126+
The `foundation.ptd` file should be the same regardless of the adapter.
127+
Notice the adapter PTE files about the size of the adapter_model.safetensors file generated during training. The PTE contains the adapter weights (which are not shared) and the program.
128+
87129
## Install runtime dependencies.
88130
The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:
89131
```bash
@@ -115,35 +157,28 @@ sh build_example.sh
115157
```bash
116158
cd ~/executorch-examples/program-data-separation/cpp/lora_example
117159

160+
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
118161
./build/bin/executorch_program_data_separation \
119-
--tokenizer_path="../../tokenizer.model" \
120-
--model1="../../models/llama_3_2_1B_lora.pte" \
121-
--model2="../../models/llama_3_2_1B.pte" \
122-
--weights="../../models/foundation.ptd"
162+
--tokenizer_path="${DOWNLOADED_PATH}" \
163+
--model1="et.pte" \
164+
--model2="nobel.pte" \
165+
--weights="foundation.ptd" \
166+
--prompt="Who were the winners of the Nobel Prize in Physics in 2025?" \
167+
--apply_chat_template
123168
```
169+
Set `apply_chat_template` to true as this was trained as a chatbot.
124170

125-
You should see some logs showing the Resident Set Size (RSS) at various points of the execution. Some sample logs may look like this:
126-
127-
```
128-
Generating with model <model file path>
129-
RSS after loading model: 6909.328125 MiB
130-
RSS after prompt prefill: 6909.328125 MiB
131-
RSS after finishing text generation: 6909.328125 MiB
171+
Sample output:
132172

133-
Generating with lora...
134-
RSS after loading model: 7941.667969 MiB
135-
RSS after prompt prefill: 7941.667969 MiB
136-
RSS after finishing text generation: 7941.667969 MiB
137-
```
138-
There is about ~1.4GB memory increase between running the two models.
139-
~1GB comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
140-
~40MB comes from running the non-lora model, to running the lora model.
141173

142-
You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`. Expect to see almost double the memory usage, ie. ~14-15GB instead of ~8GB.
143174

144-
## Clean up.
145-
```bash
146-
rm -rf build
147-
cd ~/executorch-examples/program-data-separation
148-
rm -rf models/
175+
Let's try with an executorch-specific prompt.
176+
```
177+
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
178+
./build/bin/executorch_program_data_separation \
179+
--tokenizer_path="${DOWNLOADED_PATH}" \
180+
--model1="adapter_model1.pte" \
181+
--model2="adapter_model2.pte" \
182+
--weights="foundation.ptd" \
183+
--prompt="Help me get started with ExecuTorch"
149184
```

program-data-separation/cpp/lora_example/main.cpp

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,6 @@ int main(int argc, char *argv[]) {
130130
}
131131

132132
ET_LOG(Info, "Generating with model %s...", model1);
133-
ET_LOG(Info, "Formatted prompt: %s", formatted_prompt.c_str());
134133
auto error = runner1->generate(formatted_prompt, config);
135134
if (error != Error::Ok) {
136135
ET_LOG(Error, "Failed to generate with model %s, error code %zu.",
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Quick Start
2+
3+
Use the provided export scripts to generate and run LoRA models.
4+
5+
## Virtual environment setup.
6+
Create and activate a Python virtual environment:
7+
```bash
8+
python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip
9+
```
10+
Or alternatively, [install conda on your machine](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)
11+
```bash
12+
conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
13+
```
14+
15+
## Install executorch
16+
Please install executorch. If you are using your own trained adapter (not the example one), please use a recent nightly build or install from source.
17+
18+
```
19+
pip install executorch==1.0.0
20+
```
21+
22+
You can also install from the nightly build.
23+
```
24+
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
25+
```
26+
27+
Or [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
28+
29+
30+
## Export the model/s.
31+
Change into the program-data-separation directory and create a directory to hold exported artifacts.
32+
```bash
33+
cd ~/executorch-examples/program-data-separation
34+
mkdir models
35+
```
36+
37+
Export models into the `models` directory.
38+
- The first command generates a regular llama_3_2_1B model.
39+
- The second command generates a llama_3_2_1B lora model.
40+
41+
```bash
42+
sh export_lora.sh
43+
```
44+
Expect the files:
45+
- llama_3_2_1B.pte
46+
- llama_3_2_1B.ptd
47+
- llama_3_2_1B_lora.pte
48+
- foundation_weights.ptd
49+
- tokenizer.model
50+
51+
llama_3_2_1B.ptd and foundation_weights.ptd contain the same contents, and you can remove llama_3_2_1B.ptd.
52+
tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It is used at runtime.
53+
54+
Note:
55+
- PTE: contains the program execution logic.
56+
- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors. It relies on flatbuffers instead of json for serde.
57+
58+
Sample file sizes:
59+
```
60+
-rw-r--r-- 1 lfq users 5994013600 Oct 17 14:31 foundation.ptd
61+
-rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte
62+
-rw-r--r-- 1 lfq users 317248 Oct 17 14:28 llama_3_2_1B.pte
63+
```
64+
65+
Notice the lora - llama file size difference is about 27.3MB. This is the size of the adapter weights, and changes depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json.
66+
```
67+
{"r": 64, "lora_alpha": 128, "target_modules": ["q_proj", "v_proj", "o_proj"], "peft_type": "LORA", "base_model_name_or_path": "meta-llama/Llama-3.2-1B-Instruct"}
68+
```
69+
70+
## Install runtime dependencies.
71+
The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:
72+
```bash
73+
cd ~/executorch-examples/
74+
git submodule sync
75+
git submodule update --init --recursive
76+
```
77+
Install dev requirements for ExecuTorch:
78+
79+
```bash
80+
cd ~/executorch-examples/program-data-separation/cpp/executorch
81+
pip install -r requirements-dev.txt
82+
```
83+
84+
## Build the runtime.
85+
Install some dependencies:
86+
```bash
87+
cd ~/executorch-examples/program-data-separation/cpp/executorch
88+
sh examples/models/llama/install_requirements.sh
89+
```
90+
91+
Build the executable:
92+
```bash
93+
cd ~/executorch-examples/program-data-separation/cpp/lora_example
94+
sh build_example.sh
95+
```
96+
97+
## Run the executable.
98+
```bash
99+
cd ~/executorch-examples/program-data-separation/cpp/lora_example
100+
101+
./build/bin/executorch_program_data_separation \
102+
--tokenizer_path="../../tokenizer.model" \
103+
--model1="../../models/llama_3_2_1B_lora.pte" \
104+
--model2="../../models/llama_3_2_1B.pte" \
105+
--weights="../../models/foundation.ptd"
106+
```
107+
108+
You should see some logs showing the Resident Set Size (RSS) at various points of the execution. Some sample logs may look like this:
109+
110+
```
111+
Generating with model <model file path>
112+
RSS after loading model: 6909.328125 MiB
113+
RSS after prompt prefill: 6909.328125 MiB
114+
RSS after finishing text generation: 6909.328125 MiB
115+
116+
Generating with model <model file path>...
117+
RSS after loading model: 7941.667969 MiB
118+
RSS after prompt prefill: 7941.667969 MiB
119+
RSS after finishing text generation: 7941.667969 MiB
120+
```
121+
There is about ~1.4GB memory increase between running the two models.
122+
~1GB comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
123+
~40MB comes from running the non-lora model, to running the lora model.
124+
125+
You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`. Expect to see almost double the memory usage, ie. ~14-15GB instead of ~8GB.
126+
127+
## Fine-tuned adapter output.
128+
129+
130+
## Clean up.
131+
```bash
132+
rm -rf build
133+
cd ~/executorch-examples/program-data-separation
134+
rm -rf models/
135+
```

0 commit comments

Comments
 (0)