Skip to content

Commit 7a8eaef

Browse files
authored
update existing lora demo (#92)
1 parent baa5552 commit 7a8eaef

File tree

5 files changed

+106
-66
lines changed

5 files changed

+106
-66
lines changed

program-data-separation/README.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Program Data Separation Examples
22

3-
This directory provides an example of the Program Data Separation APIs in ExecuTorch. Specifically, it showcases:
3+
This directory provides an example of the Program Data Separation APIs in ExecuTorch.
44
1. Program data separation examples using a linear model with the portable operators and XNNPACK.
55
2. LoRA inference example with a LoRA and non-LoRA model sharing foundation weights.
66

@@ -28,6 +28,3 @@ To enable LoRA, we generate:
2828
Multiple LoRA-adapted PTE files can share the same foundation weights and adding a model adapted to a new task incurs minimal binary size and runtime memory overhead.
2929

3030
Please take a look at [program-data-separation/cpp/lora_example](lora_example/) for a demo of the program-data separation APIs with LoRA. This example generates and runs a LoRA and a non-LoRA model that share foundation weights. At runtime, we see that memory usage does not double.
31-
32-
### Requirements
33-
LoRA is currently supported on executorch main. [Please install ExecuTorch pip package from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source), until executorch==1.0 is released.

program-data-separation/cpp/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,10 @@ option(EXECUTORCH_ENABLE_LOGGING "" ON)
1010
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
1111
option(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR "" ON)
1212
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
13+
option(EXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP "" ON)
1314
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
1415
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)
16+
option(EXECUTORCH_BUILD_KERNELS_QUANTIZED "" ON)
1517
option(EXECUTORCH_BUILD_XNNPACK "" ON)
1618

1719
# Dependencies required for llm runner in lora demo.

program-data-separation/cpp/lora_example/README.md

Lines changed: 47 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
11
# ExecuTorch LoRA Demo
22

3-
This directory contains the C++ code for the LoRA demo. This demo showcases how to export and run models that share the same architecture without inflating binary file size or runtime memory.
3+
This directory contains the C++ code for the LoRA demo.
44

5-
Specifically, this demo walks through exporting and running a LoRA and non-LoRA llama model without duplication of shared foundation weights on disk or in memory.
5+
You'll learn how to:
6+
1. Export two LoRA PTE files that share a single foundation weight file.
7+
2. Load and run the LoRA PTE files, and notice that the runtime memory is not doubled as the foundation weights are shared.
68

7-
1. Exporting LoRA and non-LoRA llama models, lowered to XNNPACK, with weights in a separate file.
8-
2. Loading and running models with weights in a separate file.
9-
3. Runtime weight sharing via XNNPACK.
9+
Note:
10+
- Weight-sharing is supported with the XNNPACK backend.
11+
- Quantization (outside of embedding quantization) is not supported when weight-sharing.
12+
- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo.
1013

1114
## Size savings.
1215

13-
Size results will vary depending on the model, quantization and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
16+
Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
1417

1518
### XNNPACK weight sharing.
1619

@@ -26,24 +29,32 @@ Or alternatively, [install conda on your machine](https://conda.io/projects/cond
2629
conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
2730
```
2831

29-
Install dependencies:
30-
LoRA isn't available in the 0.7.0 release of ExecuTorch. Instead, please install from source until ExecuTorch 1.0 is released.
32+
## Install executorch
33+
Please install executorch. If you are using your own trained adapter (not the example one), please use a recent nightly build or install from source.
3134

32-
[Install ExecuTorch pip package from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
35+
```
36+
pip install executorch==1.0.0
37+
```
3338

34-
Currently, the LoRA changes aren't in nightlies. Once they are in, you can also install from the nightly build.
39+
You can also install from the nightly build.
3540
```
36-
pip install executorch==0.8.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
41+
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
3742
```
3843

44+
Or [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
45+
46+
3947
## Export the model/s.
4048
Change into the program-data-separation directory and create a directory to hold exported artifacts.
4149
```bash
4250
cd ~/executorch-examples/program-data-separation
4351
mkdir models
4452
```
4553

46-
Export models into the `models` directory. The first command will generated undelegated model/data files, and the second will generate XNNPACK-delegated model/data files.
54+
Export models into the `models` directory.
55+
- The first command generates a regular llama_3_2_1B model.
56+
- The second command generates a llama_3_2_1B lora model.
57+
4758
```bash
4859
sh export_lora.sh
4960
```
@@ -55,20 +66,20 @@ Expect the files:
5566
- tokenizer.model
5667

5768
llama_3_2_1B.ptd and foundation_weights.ptd contain the same contents, and you can remove llama_3_2_1B.ptd.
58-
tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It will be used at runtime.
69+
tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It is used at runtime.
5970

6071
Note:
6172
- PTE: contains the program execution logic.
62-
- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors, but relying on flatbuffer instead of json for serde.
73+
- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors. It relies on flatbuffers instead of json for serde.
6374

6475
Sample file sizes:
6576
```
66-
-rw-r--r-- 1 lfq users 4943000480 Aug 11 15:55 foundation.ptd
67-
-rw-r--r-- 1 lfq users 1078636416 Aug 11 15:55 llama_3_2_1B_lora.pte
68-
-rw-r--r-- 1 lfq users 1051324736 Aug 11 15:53 llama_3_2_1B.pte
77+
-rw-r--r-- 1 lfq users 5994013600 Oct 17 14:31 foundation.ptd
78+
-rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte
79+
-rw-r--r-- 1 lfq users 317248 Oct 17 14:28 llama_3_2_1B.pte
6980
```
7081

71-
Notice the lora - llama file size difference is about 27.3MB. This will change depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json
82+
Notice the lora - llama file size difference is about 27.3MB. This is the size of the adapter weights, and changes depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json.
7283
```
7384
{"r": 64, "lora_alpha": 128, "target_modules": ["q_proj", "v_proj", "o_proj"], "peft_type": "LORA", "base_model_name_or_path": "meta-llama/Llama-3.2-1B-Instruct"}
7485
```
@@ -104,27 +115,35 @@ sh build_example.sh
104115
```bash
105116
cd ~/executorch-examples/program-data-separation/cpp/lora_example
106117

107-
./build/bin/executorch_program_data_separation --lora_model_path=../../llama_3_2_1B_lora.pte --llama_model_path=../../llama_3_2_1B.pte --tokenizer_path=../../tokenizer.model --foundation_weights_path=../../foundation.ptd
118+
./build/bin/executorch_program_data_separation \
119+
--tokenizer_path="../../tokenizer.model" \
120+
--model1="../../models/llama_3_2_1B_lora.pte" \
121+
--model2="../../models/llama_3_2_1B.pte" \
122+
--weights="../../models/foundation.ptd"
108123
```
109124

110125
You should see some logs showing the Resident Set Size (RSS) at various points of the execution. Some sample logs may look like this:
111126

112127
```
113-
Generating with llama...
114-
RSS after loading model: 7886.125000 MiB
115-
RSS after prompt prefill: 7886.125000 MiB
116-
RSS after finishing text generation: 7886.125000 MiB
128+
Generating with model <model file path>
129+
RSS after loading model: 6909.328125 MiB
130+
RSS after prompt prefill: 6909.328125 MiB
131+
RSS after finishing text generation: 6909.328125 MiB
117132
118133
Generating with lora...
119-
RSS after loading model: 7933.523438 MiB
120-
RSS after prompt prefill: 7933.523438 MiB
121-
RSS after finishing text generation: 7933.523438 MiB
134+
RSS after loading model: 7941.667969 MiB
135+
RSS after prompt prefill: 7941.667969 MiB
136+
RSS after finishing text generation: 7941.667969 MiB
122137
```
123-
Notice the memory increase of ~47 MiB from running llama model to running lora model. You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`.
138+
There is about ~1.4GB memory increase between running the two models.
139+
~1GB comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
140+
~40MB comes from running the non-lora model, to running the lora model.
141+
142+
You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`. Expect to see almost double the memory usage, ie. ~14-15GB instead of ~8GB.
124143

125144
## Clean up.
126145
```bash
127146
rm -rf build
128147
cd ~/executorch-examples/program-data-separation
129-
rm -rf *.pte *.ptd tokenizer.model
148+
rm -rf models/
130149
```

program-data-separation/cpp/lora_example/main.cpp

Lines changed: 51 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -18,22 +18,22 @@
1818
#include <executorch/extension/llm/runner/text_llm_runner.h>
1919
#include <executorch/extension/llm/runner/text_prefiller.h>
2020
#include <executorch/extension/llm/runner/text_token_generator.h>
21-
21+
#include <pytorch/tokenizers/hf_tokenizer.h>
2222
#if defined(ET_USE_THREADPOOL)
2323
#include <executorch/extension/threadpool/cpuinfo_utils.h>
2424
#include <executorch/extension/threadpool/threadpool.h>
2525
#endif
2626

27-
DEFINE_string(lora_model_path, "llama_3_2_1B_lora.pte",
28-
"LoRA model serialized in flatbuffer format.");
29-
DEFINE_string(llama_model_path, "llama_3_2_1B.pte",
30-
"Model serialized in flatbuffer format.");
31-
DEFINE_string(foundation_weights_path, "foundation.ptd",
32-
"Foundation weights serialized in flatbuffer format.");
27+
DEFINE_string(model1, "llama_3_2_1B_lora.pte",
28+
"First model, a PTE file.");
29+
DEFINE_string(model2, "llama_3_2_1B.pte",
30+
"Second model, a PTE file.");
31+
DEFINE_string(weights, "foundation.ptd",
32+
"Shared weights, a PTD file.");
3333

34-
DEFINE_string(tokenizer_path, "tokenizer.model", "Tokenizer stuff.");
34+
DEFINE_string(tokenizer_path, "tokenizer.model", "Tokenizer.");
3535

36-
DEFINE_string(prompt, "The answer to the ultimate question is", "Prompt.");
36+
DEFINE_string(prompt, "What is the meaning of life?", "Prompt.");
3737

3838
DEFINE_double(temperature, 0,
3939
"Temperature; Default is 0. 0 = greedy argmax sampling "
@@ -45,6 +45,10 @@ DEFINE_int32(
4545
"max_seq_len. If the number of input tokens + seq_len > max_seq_len, the "
4646
"output will be truncated to max_seq_len tokens.");
4747

48+
DEFINE_bool(
49+
apply_chat_template, false,
50+
"Apply a LLAMA-style chat template to the prompt. Defaults to false.");
51+
4852
using executorch::extension::Module;
4953
using executorch::runtime::Error;
5054
namespace llm = executorch::extension::llm;
@@ -75,9 +79,9 @@ int main(int argc, char *argv[]) {
7579

7680
gflags::ParseCommandLineFlags(&argc, &argv, true);
7781

78-
const char *lora_model_path = FLAGS_lora_model_path.c_str();
79-
const char *llama_model_path = FLAGS_llama_model_path.c_str();
80-
const char *foundation_weights_path = FLAGS_foundation_weights_path.c_str();
82+
const char *model1 = FLAGS_model1.c_str();
83+
const char *model2 = FLAGS_model2.c_str();
84+
const char *weights = FLAGS_weights.c_str();
8185

8286
const char *tokenizer_path = FLAGS_tokenizer_path.c_str();
8387
const char *prompt = FLAGS_prompt.c_str();
@@ -93,35 +97,52 @@ int main(int argc, char *argv[]) {
9397

9498
if (tokenizer1 == nullptr || tokenizer2 == nullptr) {
9599
ET_LOG(Info,
96-
"Failed to load %s as a Tiktoken, Sentencepiece or Llama2.c "
100+
"Failed to load %s as a Tiktoken, Sentencepiece, Llama2.c or HFTokenizer "
97101
"tokenizer, make sure the artifact is one of these types",
98102
tokenizer_path);
99103
return 1;
100104
}
101105

102106
// Create runners.
103-
std::unique_ptr<llm::TextLLMRunner> llama_runner =
104-
llm::create_text_llm_runner(llama_model_path, std::move(tokenizer1),
105-
foundation_weights_path, temperature);
106-
std::unique_ptr<llm::TextLLMRunner> lora_runner =
107-
llm::create_text_llm_runner(lora_model_path, std::move(tokenizer2),
108-
foundation_weights_path, temperature);
109-
110-
// Generate.
111-
llm::GenerationConfig config{.seq_len = seq_len, .temperature = temperature};
112-
113-
ET_LOG(Info, "Generating with llama...");
114-
auto error = llama_runner->generate(prompt, config);
107+
std::unique_ptr<llm::TextLLMRunner> runner1 =
108+
llm::create_text_llm_runner(model1, std::move(tokenizer1),
109+
weights, temperature);
110+
std::unique_ptr<llm::TextLLMRunner> runner2 =
111+
llm::create_text_llm_runner(model1, std::move(tokenizer2),
112+
weights, temperature);
113+
114+
llm::GenerationConfig config{
115+
.echo = false,
116+
.seq_len = seq_len,
117+
.temperature = temperature};
118+
119+
std::string formatted_prompt = std::string();
120+
if (FLAGS_apply_chat_template) {
121+
// System Prompt.
122+
formatted_prompt += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n";
123+
formatted_prompt += "You are a helpful assistant.<|eot_id|>";
124+
// User prompt.
125+
formatted_prompt += "<|start_header_id|>user<|end_header_id|>\n";
126+
formatted_prompt += prompt;
127+
formatted_prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>";
128+
} else {
129+
formatted_prompt += prompt;
130+
}
131+
132+
ET_LOG(Info, "Generating with model %s...", model1);
133+
ET_LOG(Info, "Formatted prompt: %s", formatted_prompt.c_str());
134+
auto error = runner1->generate(formatted_prompt, config);
115135
if (error != Error::Ok) {
116-
ET_LOG(Error, "Failed to generate with llama_runner, error code %zu.",
117-
error);
136+
ET_LOG(Error, "Failed to generate with model %s, error code %zu.",
137+
model1, error);
118138
return 1;
119139
}
120140

121-
error = lora_runner->generate(prompt, config);
141+
ET_LOG(Info, "Generating with model %s...", model2);
142+
error = runner2->generate(formatted_prompt, config);
122143
if (error != Error::Ok) {
123-
ET_LOG(Error, "Failed to generate with lora_runner, error code %zu.",
124-
error);
144+
ET_LOG(Error, "Failed to generate with model %s, error code %zu.",
145+
model2, error);
125146
return 1;
126147
}
127148

program-data-separation/export_lora.sh

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ print(path)
2323
cp "${DOWNLOADED_PATH}/tokenizer.model" .
2424

2525
# Export a non-LoRA model with program-data separated.
26+
DIR="models/"
2627
MODEL="llama_3_2_1B"
2728
python -m executorch.extension.llm.export.export_llm \
2829
base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \
@@ -33,8 +34,8 @@ python -m executorch.extension.llm.export.export_llm \
3334
model.dtype_override="fp32" \
3435
backend.xnnpack.enabled=true \
3536
backend.xnnpack.extended_ops=true \
36-
export.output_name="${MODEL}.pte" \
37-
export.foundation_weights_file="${MODEL}.ptd"
37+
export.output_name="${DIR}/${MODEL}.pte" \
38+
export.foundation_weights_file="${DIR}/${MODEL}.ptd"
3839

3940
# Export a LoRA model, with program and data separated.
4041
LORA_MODEL="llama_3_2_1B_lora"
@@ -49,5 +50,5 @@ python -m executorch.extension.llm.export.export_llm \
4950
model.dtype_override="fp32" \
5051
backend.xnnpack.enabled=true \
5152
backend.xnnpack.extended_ops=true \
52-
export.output_name="${LORA_MODEL}.pte" \
53-
export.foundation_weights_file="foundation.ptd"
53+
export.output_name="${DIR}/${LORA_MODEL}.pte" \
54+
export.foundation_weights_file="${DIR}/foundation.ptd"

0 commit comments

Comments
 (0)