Skip to content

Commit 7b5a7d4

Browse files
committed
finetune lora example
1 parent 7fb6731 commit 7b5a7d4

File tree

4 files changed

+96
-45
lines changed

4 files changed

+96
-45
lines changed

.gitmodules

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
[submodule "program-data-separation/cpp/executorch"]
1515
path = program-data-separation/cpp/executorch
1616
url = https://github.com/pytorch/executorch.git
17-
branch = release/0.7
17+
branch = main
1818

1919
[submodule "stories110M/wasm/executorch"]
2020
path = stories110M/wasm/executorch

program-data-separation/cpp/lora_example/README.md

Lines changed: 87 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,15 @@ This directory contains the C++ code for the LoRA demo.
44

55
You'll learn how to:
66
1. Export LoRA PTE files that share a single foundation weight file.
7-
2. Load and run the LoRA PTE files, and notice that the runtime memory is not doubled as the foundation weights are shared.
7+
2. Load and run the LoRA PTE files, and notice that the runtime memory increases by the LoRA adapter size (small) and not the foundation weight size (large), because the foundation weights are shared.
88

99
Note:
1010
- Weight-sharing is supported with the XNNPACK backend.
11-
- Quantization (outside of embedding quantization) is not supported when weight-sharing.
11+
- Quantization (outside of embedding quantization) is currently not supported when weight-sharing.
1212
- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo.
1313

1414
## Size savings.
1515

16-
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
1716
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
1817

1918
### XNNPACK weight sharing.
@@ -24,19 +23,18 @@ The XNNPACK backend is a singleton. Weight sharing is implemented via the XNNPAC
2423
Download pre-trained dummy adapter to export and run along with a regular Llama-3-2-1B model.
2524

2625
## Fine-tune from scratch with Unsloth and Llama-3-2-1B.
27-
We can use [Unsloth](https://unsloth.ai/), a popular tool to finetune and train LLMs, to create our LoRA adapters. Unsloth provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit.
28-
29-
The training notebook takes a few shortcuts to reduce the latency/compute. You can change these settings for better results.
30-
1. Play around with the chunk sizes and overlap to see what works best for your dataset.
31-
2. The notebook trains on the last three data files generated; increase this for better coverage of your dataset.
32-
3. At the training step, the notebook uses max_steps=60 to speed things up. Setting num_train_epochs=1 (or greater) for a full run and max_steps=None has better results.
26+
[Unsloth](https://unsloth.ai/) provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit, and then fine-tune it to create a LoRA adapter.
3327

3428
For this demo, we trained on two datasets:
35-
1. executorch/docs/source: an adapter with domain knowledge of executorch. Using Meta Synthetic Data Kit, you can generate qa pairs based on the executorch documentation.
36-
2. Recent Nobel prize winners (2024-2025): an adapter with knowledge beyond the cutoff date of Llama-3-2-1B. This data was taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates).
29+
1. executorch/docs/source/: an adapter with domain knowledge of executorch. This used Meta Synthetic Data Kit to generate qa pairs based on the documentation.
30+
2. Recent Nobel prize winners (2024-2025): an adapter with knowledge beyond the cutoff date of Llama-3-2-1B. This data was taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates), and formatted into the chat template for training.
31+
32+
The training notebook takes a few shortcuts to reduce the latency/compute. You can change these settings for better results.
33+
1. When generating data, play around with the chunk sizes and overlap to see what works best for your dataset.
34+
2. At the training step, the notebook uses max_steps=60 to speed things up. Setting num_train_epochs=1 (or greater) for a full run and max_steps=None has better results.
3735

3836
Unsloth will output the adapter artifacts to the specified directory (in the colab notebook, 'lora_model/'). You will see a few files like such:
39-
```
37+
```bash
4038
-rw-r--r-- 1 lfq users 1092 Oct 15 11:01 adapter_config.json
4139
-rw-r--r-- 1 lfq users 45118424 Oct 15 11:01 adapter_model.safetensors
4240
-rw-r--r-- 1 lfq users 3827 Oct 15 11:01 chat_template.jinja
@@ -57,27 +55,29 @@ python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip
5755
```
5856
Or alternatively, [install conda on your machine](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)
5957
```bash
60-
conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
58+
conda create -yn executorch-lora python=3.10.0 && conda activate executorch-lora
6159
```
6260

6361
## Install executorch
62+
[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
6463

65-
You can also install from the nightly build.
66-
```
67-
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
6864
```
65+
# Move to the executorch subdirectory
66+
cd ~/executorch-examples/program-data-separation/cpp/executorch
6967
70-
Or [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
71-
72-
```
73-
# Clone the ExecuTorch repo from GitHub.
74-
git clone https://github.com/pytorch/executorch.git && cd executorch
68+
# Update to recent main.
69+
git pull origin/main
7570
7671
# Install ExecuTorch pip package.
7772
./install_executorch.sh --editable
7873
```
7974

80-
NOTE: some features are not available in executorch==1.0.0, use main or a recent nightly.
75+
You can also install from a recent nightly build.
76+
```
77+
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
78+
```
79+
80+
NOTE: use main or a recent nightly, as some features are not available in executorch==1.0.0.
8181

8282
## Download base model
8383
We're using https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct.
@@ -115,16 +115,19 @@ python -m executorch.extension.llm.export.export_llm \
115115
export.foundation_weights_file="foundation.ptd"
116116
```
117117

118-
Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. For example:
118+
Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. The generated `foundation.ptd` files should all be the same (we are using the same base model) and you only need to keep one of them.
119119

120-
```
120+
Example files, trained on executorch/docs/source/ and recent Nobel prize winners.
121+
```bash
122+
# executorch docs trained adapter model.
121123
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:05 et.pte
124+
# foundation weight file
122125
-rw-r--r-- 1 lfq users 5994013600 Oct 17 18:05 foundation.ptd
126+
# Nobel prize winners trained adapter model.
123127
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:00 nobel.pte
124128
```
125129

126-
The `foundation.ptd` file should be the same regardless of the adapter.
127-
Notice the adapter PTE files about the size of the adapter_model.safetensors file generated during training. The PTE contains the adapter weights (which are not shared) and the program.
130+
Notice the adapter PTE files are about the same size as the `adapter_model.safetensors` file generated during training. The PTE contains the adapter weights (which are not shared) and the program.
128131

129132
## Install runtime dependencies.
130133
The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:
@@ -133,8 +136,8 @@ cd ~/executorch-examples/
133136
git submodule sync
134137
git submodule update --init --recursive
135138
```
136-
Install dev requirements for ExecuTorch:
137139

140+
Install dev requirements for ExecuTorch:
138141
```bash
139142
cd ~/executorch-examples/program-data-separation/cpp/executorch
140143
pip install -r requirements-dev.txt
@@ -157,7 +160,7 @@ sh build_example.sh
157160
```bash
158161
cd ~/executorch-examples/program-data-separation/cpp/lora_example
159162

160-
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
163+
DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/
161164
./build/bin/executorch_program_data_separation \
162165
--tokenizer_path="${DOWNLOADED_PATH}" \
163166
--model1="et.pte" \
@@ -166,19 +169,68 @@ DOWNLOADED_PATH=Llama-3.2-1B-Instruct
166169
--prompt="Who were the winners of the Nobel Prize in Physics in 2025?" \
167170
--apply_chat_template
168171
```
169-
Set `apply_chat_template` to true as this was trained as a chatbot.
172+
Passing in the `DOWNLOADED_PATH` as the tokenizer directory will invoke the HFTokenizer, and parse additional tokenizers files: `tokenizer_config.json` and `special_tokens_map.json`. `special_tokens_map.json` tells us which bos/eos token to use, especially if there are multiple.
170173

171-
Sample output:
174+
`apply_chat_template` formats the prompt according to the LLAMA chat template, which is what the adapter was trained on.
172175

176+
Sample output:
177+
```
178+
I 00:00:00.538779 executorch:main.cpp:133] Generating with model et.pte..
179+
...
180+
I 00:00:06.999737 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 6941.296875 MiB (0 if unsupported)
181+
I don't have information on the winners of the Nobel Prize in Physics in 2025.<|eot_id|>
182+
...
183+
I 00:00:11.635379 executorch:main.cpp:141] Generating with model nobel.pte...
184+
...
185+
I 00:00:14.109447 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 8041.632812 MiB (0 if unsupported)
186+
John Clarke, Michel H. Devoret, John M. Martinis<|eot_id|>
187+
```
188+
We can see that the ExecuTorch-trained adapter model does not have knowledge of the recent Nobel Prize winners, as neither the base model or adapter was trained on it. Meanwhile, the Nobel-prize adapter model can answer well.
173189

190+
There is about ~1.1GB memory increase between running the two models.
191+
Most of that (about ~1GB) comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
192+
~50MB comes from the adapter model, which is also shared.
174193

175194
Let's try with an executorch-specific prompt.
176-
```
177-
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
195+
```bash
196+
cd ~/executorch-examples/program-data-separation/cpp/lora_example
197+
198+
DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/
178199
./build/bin/executorch_program_data_separation \
179200
--tokenizer_path="${DOWNLOADED_PATH}" \
180-
--model1="adapter_model1.pte" \
181-
--model2="adapter_model2.pte" \
201+
--model1="et.pte" \
202+
--model2="nobel.pte" \
182203
--weights="foundation.ptd" \
183-
--prompt="Help me get started with ExecuTorch"
204+
--prompt="Help me get started with ExecuTorch in 3 steps" \
205+
--apply_chat_template
184206
```
207+
208+
Sample output:
209+
```
210+
...
211+
I 00:00:00.554048 executorch:main.cpp:133] Generating with model et.pte...
212+
...
213+
Here are 3 steps to get started with ExecuTorch:
214+
215+
Step 1: Install ExecuTorch dependencies. This includes installing Python 3.8+ library, PyTorch library, and the ExecuTorch runtime.
216+
217+
Step 2: Set up a Python environment with pip and a virtual environment (e.g., conda) to isolate ExecuTorch dependencies.
218+
219+
Step 3: Clone the Execu
220+
I 00:00:27.243400 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 6940.410156 MiB (0 if unsupported)
221+
...
222+
I 00:00:27.243504 executorch:main.cpp:141] Generating with model nobel.pte...
223+
...
224+
Here are the 3 steps to get started with Excetorch:
225+
226+
**Step 1: Install Node.js and npm**
227+
228+
Excetorch is a JavaScript compiler, so you'll need Node.js and npm (the Node Package Manager) installed on your computer. You can download Node.js from the official website and npm from the npm website. Follow the installation instructions for your operating system.
229+
230+
**Step 2: Install Excetorch**
231+
232+
233+
I 00:00:50.189743 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 8039.152344 MiB (0 if unsupported)
234+
```
235+
236+
The ExecuTorch-trained adapter model has domain knowledge of ExecuTorch codebase, whereas the Nobel-prize trained adapter model does not.

program-data-separation/cpp/lora_example/main.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ int main(int argc, char *argv[]) {
108108
llm::create_text_llm_runner(model1, std::move(tokenizer1),
109109
weights, temperature);
110110
std::unique_ptr<llm::TextLLMRunner> runner2 =
111-
llm::create_text_llm_runner(model1, std::move(tokenizer2),
111+
llm::create_text_llm_runner(model2, std::move(tokenizer2),
112112
weights, temperature);
113113

114114
llm::GenerationConfig config{
@@ -118,10 +118,11 @@ int main(int argc, char *argv[]) {
118118

119119
std::string formatted_prompt = std::string();
120120
if (FLAGS_apply_chat_template) {
121+
ET_LOG(Info, "Applying chat template...");
121122
// System Prompt.
122123
formatted_prompt += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n";
124+
// User Prompt.
123125
formatted_prompt += "You are a helpful assistant.<|eot_id|>";
124-
// User prompt.
125126
formatted_prompt += "<|start_header_id|>user<|end_header_id|>\n";
126127
formatted_prompt += prompt;
127128
formatted_prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>";
@@ -144,6 +145,5 @@ int main(int argc, char *argv[]) {
144145
model2, error);
145146
return 1;
146147
}
147-
148148
return 0;
149149
}

program-data-separation/cpp/lora_example/quick_start.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip
99
```
1010
Or alternatively, [install conda on your machine](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)
1111
```bash
12-
conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
12+
conda create -yn executorch-lora python=3.10.0 && conda activate executorch-lora
1313
```
1414

1515
## Install executorch
@@ -18,8 +18,7 @@ Please install executorch. If you are using your own trained adapter (not the ex
1818
```
1919
pip install executorch==1.0.0
2020
```
21-
22-
You can also install from the nightly build.
21+
You can also install from a recent nightly build.
2322
```
2423
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
2524
```
@@ -118,9 +117,9 @@ RSS after loading model: 7941.667969 MiB
118117
RSS after prompt prefill: 7941.667969 MiB
119118
RSS after finishing text generation: 7941.667969 MiB
120119
```
121-
There is about ~1.4GB memory increase between running the two models.
122-
~1GB comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
123-
~40MB comes from running the non-lora model, to running the lora model.
120+
There is about ~1GB memory increase between running the two models.
121+
Most of that comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
122+
~40MB comes from the adapter weights of the lora model.
124123

125124
You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`. Expect to see almost double the memory usage, ie. ~14-15GB instead of ~8GB.
126125

0 commit comments

Comments
 (0)