You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: program-data-separation/cpp/lora_example/README.md
+38-34Lines changed: 38 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,25 +4,30 @@ This directory contains the C++ code for the LoRA demo.
4
4
5
5
You'll learn how to:
6
6
1. Export LoRA PTE files that share a single foundation weight file.
7
-
2. Load and run the LoRA PTE files, and notice that the runtime memory increases by the LoRA adapter size (small) and not the foundation weight size (large), because the foundation weights are shared.
7
+
2. Load and run multiple LoRA PTE files at the same, and notice that the runtime memory increases by the LoRA adapter size (small) and not the foundation weight size (large), because the foundation weights are shared.
8
8
9
9
Note:
10
10
- Weight-sharing is supported with the XNNPACK backend.
11
11
- Quantization (outside of embedding quantization) is currently not supported when weight-sharing.
12
12
- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo.
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
17
25
18
-
### XNNPACK weight sharing.
26
+
### XNNPACK weight sharing
19
27
20
28
The XNNPACK backend is a singleton. Weight sharing is implemented via the XNNPACK weight cache. At delegate init time, XNNPACK checks the weight cache for the weights it needs. If they don't exist, XNNPACK will fetch weights from the NamedDataMap (the API that exposes weights in a PTD file), pack them, store them in the weight cache and free the original. This means we won't keep around multiple copies of the same weights.
21
29
22
-
## [Quick Start](quick_start.md)
23
-
Download pre-trained dummy adapter to export and run along with a regular Llama-3-2-1B model.
24
-
25
-
## Fine-tune from scratch with Unsloth and Llama-3-2-1B.
30
+
## Finetune from scratch with Unsloth and Llama
26
31
[Unsloth](https://unsloth.ai/) provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit, and then fine-tune it to create a LoRA adapter.
[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. The generated `foundation.ptd` files should all be the same (we are using the same base model) and you only need to keep one of them.
115
+
Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. You only need to keep one `foundation.ptd` file.
116
+
117
+
You can also run `~/executorch-examples/program-data-separation/export_lora.sh`. This will export the dummy lora model and the base Llama-3-2-1B model PTE files.
119
118
120
119
Example files, trained on executorch/docs/source/ and recent Nobel prize winners.
121
120
```bash
122
121
# executorch docs trained adapter model.
123
122
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:05 et.pte
124
123
# foundation weight file
125
124
-rw-r--r-- 1 lfq users 5994013600 Oct 17 18:05 foundation.ptd
125
+
# dummy lora model.
126
+
-rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte
126
127
# Nobel prize winners trained adapter model.
127
128
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:00 nobel.pte
128
129
```
129
130
130
-
Notice the adapter PTE files are about the same size as the `adapter_model.safetensors` file generated during training. The PTE contains the adapter weights (which are not shared) and the program.
131
+
Notice the adapter PTE files are about the same size as the `adapter_model.safetensors`/`adapter_model.pt` files generated during training. The PTE contains the adapter weights (which are not shared) and the program.
131
132
132
-
## Install runtime dependencies.
133
+
## Install runtime dependencies
133
134
The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:
@@ -192,7 +193,7 @@ We can see that the ExecuTorch-trained adapter model does not have knowledge of
192
193
193
194
There is about ~1.1GB memory increase between running the two models.
194
195
Most of that (about ~1GB) comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
195
-
~50MB comes from the adapter model, which is also shared.
196
+
~50MB comes from the adapter model, which is not shared.
196
197
197
198
Let's try with an executorch-specific prompt.
198
199
```bash
@@ -237,3 +238,6 @@ I 00:00:50.189743 executorch:text_llm_runner.cpp:206] RSS after finishing text g
237
238
```
238
239
239
240
The ExecuTorch-trained adapter model has domain knowledge of ExecuTorch codebase, whereas the Nobel-prize trained adapter model does not.
0 commit comments