update existing lora demo (#92)

lucylq · web-flow · commit 7a8eaef41761 · 2025-10-17T15:44:11.000-07:00
diff --git a/program-data-separation/README.md b/program-data-separation/README.md
@@ -1,6 +1,6 @@
 # Program Data Separation Examples
 
-This directory provides an example of the Program Data Separation APIs in ExecuTorch. Specifically, it showcases:
+This directory provides an example of the Program Data Separation APIs in ExecuTorch.
 1. Program data separation examples using a linear model with the portable operators and XNNPACK.
 2. LoRA inference example with a LoRA and non-LoRA model sharing foundation weights.
 
@@ -28,6 +28,3 @@ To enable LoRA, we generate:
 Multiple LoRA-adapted PTE files can share the same foundation weights and adding a model adapted to a new task incurs minimal binary size and runtime memory overhead.
 
 Please take a look at [program-data-separation/cpp/lora_example](lora_example/) for a demo of the program-data separation APIs with LoRA. This example generates and runs a LoRA and a non-LoRA model that share foundation weights. At runtime, we see that memory usage does not double.
-
-### Requirements
-LoRA is currently supported on executorch main. [Please install ExecuTorch pip package from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source), until executorch==1.0 is released.
diff --git a/program-data-separation/cpp/CMakeLists.txt b/program-data-separation/cpp/CMakeLists.txt
@@ -10,8 +10,10 @@ option(EXECUTORCH_ENABLE_LOGGING "" ON)
 option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
 option(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR "" ON)
 option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
+option(EXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP "" ON)
 option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
 option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)
+option(EXECUTORCH_BUILD_KERNELS_QUANTIZED "" ON)
 option(EXECUTORCH_BUILD_XNNPACK "" ON)
 
 # Dependencies required for llm runner in lora demo.
diff --git a/program-data-separation/cpp/lora_example/README.md b/program-data-separation/cpp/lora_example/README.md
@@ -1,16 +1,19 @@
 # ExecuTorch LoRA Demo
 
-This directory contains the C++ code for the LoRA demo. This demo showcases how to export and run models that share the same architecture without inflating binary file size or runtime memory.
+This directory contains the C++ code for the LoRA demo.
 
-Specifically, this demo walks through exporting and running a LoRA and non-LoRA llama model without duplication of shared foundation weights on disk or in memory.
+You'll learn how to:
+1. Export two LoRA PTE files that share a single foundation weight file.
+2. Load and run the LoRA PTE files, and notice that the runtime memory is not doubled as the foundation weights are shared.
 
-1. Exporting LoRA and non-LoRA llama models, lowered to XNNPACK, with weights in a separate file.
-2. Loading and running models with weights in a separate file.
-3. Runtime weight sharing via XNNPACK.
+Note:
+- Weight-sharing is supported with the XNNPACK backend.
+- Quantization (outside of embedding quantization) is not supported when weight-sharing.
+- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo.
 
 ## Size savings.
 
-Size results will vary depending on the model, quantization and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
+Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
 
 ### XNNPACK weight sharing.
 
@@ -26,24 +29,32 @@ Or alternatively, [install conda on your machine](https://conda.io/projects/cond
 conda create -yn executorch-ptd python=3.10.0 && conda activate executorch-ptd
 ```
 
-Install dependencies:
-LoRA isn't available in the 0.7.0 release of ExecuTorch. Instead, please install from source until ExecuTorch 1.0 is released.
+## Install executorch
+Please install executorch. If you are using your own trained adapter (not the example one), please use a recent nightly build or install from source.
 
-[Install ExecuTorch pip package from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
+```
+pip install executorch==1.0.0
+```
 
-Currently, the LoRA changes aren't in nightlies. Once they are in, you can also install from the nightly build.
+You can also install from the nightly build.
 ```
-pip install executorch==0.8.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
 ```
 
+Or [install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).
+
+
 ## Export the model/s.
 Change into the program-data-separation directory and create a directory to hold exported artifacts.
 ```bash
 cd ~/executorch-examples/program-data-separation
 mkdir models
 ```
 
-Export models into the `models` directory. The first command will generated undelegated model/data files, and the second will generate XNNPACK-delegated model/data files.
+Export models into the `models` directory.
+- The first command generates a regular llama_3_2_1B model.
+- The second command generates a llama_3_2_1B lora model.
+
 ```bash
 sh export_lora.sh
 ```
@@ -55,20 +66,20 @@ Expect the files:
 - tokenizer.model
 
 llama_3_2_1B.ptd and foundation_weights.ptd contain the same contents, and you can remove llama_3_2_1B.ptd.
-tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It will be used at runtime.
+tokenizer.model is copied from the temp directory where we downloaded the HF artifacts. It is used at runtime.
 
 Note:
 - PTE: contains the program execution logic.
-- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors, but relying on flatbuffer instead of json for serde.
+- PTD: contains the constant tensors used by the PTE. This format is similar to safetensors. It relies on flatbuffers instead of json for serde.
 
 Sample file sizes:
 ```
--rw-r--r-- 1 lfq users 4943000480 Aug 11 15:55 foundation.ptd
--rw-r--r-- 1 lfq users 1078636416 Aug 11 15:55 llama_3_2_1B_lora.pte
--rw-r--r-- 1 lfq users 1051324736 Aug 11 15:53 llama_3_2_1B.pte
+-rw-r--r-- 1 lfq users 5994013600 Oct 17 14:31 foundation.ptd
+-rw-r--r-- 1 lfq users   27628928 Oct 17 14:31 llama_3_2_1B_lora.pte
+-rw-r--r-- 1 lfq users     317248 Oct 17 14:28 llama_3_2_1B.pte
 ```
 
-Notice the lora - llama file size difference is about 27.3MB. This will change depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json
+Notice the lora - llama file size difference is about 27.3MB. This is the size of the adapter weights, and changes depending on the LoRA config. This demo is using the config from https://huggingface.co/lucylq/llama3_1B_lora/blob/main/adapter_config.json.
 ```
 {"r": 64, "lora_alpha": 128, "target_modules": ["q_proj", "v_proj", "o_proj"], "peft_type": "LORA", "base_model_name_or_path": "meta-llama/Llama-3.2-1B-Instruct"}
 ```
@@ -104,27 +115,35 @@ sh build_example.sh
 ```bash
 cd ~/executorch-examples/program-data-separation/cpp/lora_example
 
-./build/bin/executorch_program_data_separation --lora_model_path=../../llama_3_2_1B_lora.pte --llama_model_path=../../llama_3_2_1B.pte --tokenizer_path=../../tokenizer.model --foundation_weights_path=../../foundation.ptd
+./build/bin/executorch_program_data_separation \
+    --tokenizer_path="../../tokenizer.model" \
+    --model1="../../models/llama_3_2_1B_lora.pte" \
+    --model2="../../models/llama_3_2_1B.pte"  \
+    --weights="../../models/foundation.ptd"
 ```
 
 You should see some logs showing the Resident Set Size (RSS) at various points of the execution. Some sample logs may look like this:
 
 ```
-Generating with llama...
-RSS after loading model: 7886.125000 MiB
-RSS after prompt prefill: 7886.125000 MiB
-RSS after finishing text generation: 7886.125000 MiB
+Generating with model <model file path>
+RSS after loading model: 6909.328125 MiB
+RSS after prompt prefill: 6909.328125 MiB
+RSS after finishing text generation: 6909.328125 MiB
 
 Generating with lora...
-RSS after loading model: 7933.523438 MiB
-RSS after prompt prefill: 7933.523438 MiB
-RSS after finishing text generation: 7933.523438 MiB
+RSS after loading model: 7941.667969 MiB
+RSS after prompt prefill: 7941.667969 MiB
+RSS after finishing text generation: 7941.667969 MiB
 ```
-Notice the memory increase of ~47 MiB from running llama model to running lora model. You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`.
+There is about ~1.4GB memory increase between running the two models.
+~1GB comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
+~40MB comes from running the non-lora model, to running the lora model.
+
+You can see the difference without weight-sharing by removing the flag `-DEXECUTORCH_XNNPACK_ENABLE_WEIGHT_CACHE=True` from `build_example.sh`. Expect to see almost double the memory usage, ie. ~14-15GB instead of ~8GB.
 
 ## Clean up.
 ```bash
 rm -rf build
 cd ~/executorch-examples/program-data-separation
-rm -rf *.pte *.ptd tokenizer.model
+rm -rf models/
 ```
diff --git a/program-data-separation/cpp/lora_example/main.cpp b/program-data-separation/cpp/lora_example/main.cpp
@@ -18,22 +18,22 @@
 #include <executorch/extension/llm/runner/text_llm_runner.h>
 #include <executorch/extension/llm/runner/text_prefiller.h>
 #include <executorch/extension/llm/runner/text_token_generator.h>
-
+#include <pytorch/tokenizers/hf_tokenizer.h>
 #if defined(ET_USE_THREADPOOL)
 #include <executorch/extension/threadpool/cpuinfo_utils.h>
 #include <executorch/extension/threadpool/threadpool.h>
 #endif
 
-DEFINE_string(lora_model_path, "llama_3_2_1B_lora.pte",
-              "LoRA model serialized in flatbuffer format.");
-DEFINE_string(llama_model_path, "llama_3_2_1B.pte",
-              "Model serialized in flatbuffer format.");
-DEFINE_string(foundation_weights_path, "foundation.ptd",
-              "Foundation weights serialized in flatbuffer format.");
+DEFINE_string(model1, "llama_3_2_1B_lora.pte",
+              "First model, a PTE file.");
+DEFINE_string(model2, "llama_3_2_1B.pte",
+              "Second model, a PTE file.");
+DEFINE_string(weights, "foundation.ptd",
+              "Shared weights, a PTD file.");
 
-DEFINE_string(tokenizer_path, "tokenizer.model", "Tokenizer stuff.");
+DEFINE_string(tokenizer_path, "tokenizer.model", "Tokenizer.");
 
-DEFINE_string(prompt, "The answer to the ultimate question is", "Prompt.");
+DEFINE_string(prompt, "What is the meaning of life?", "Prompt.");
 
 DEFINE_double(temperature, 0,
               "Temperature; Default is 0. 0 = greedy argmax sampling "
@@ -45,6 +45,10 @@ DEFINE_int32(
     "max_seq_len. If the number of input tokens + seq_len > max_seq_len, the "
     "output will be truncated to max_seq_len tokens.");
 
+DEFINE_bool(
+  apply_chat_template, false,
+  "Apply a LLAMA-style chat template to the prompt. Defaults to false.");
+
 using executorch::extension::Module;
 using executorch::runtime::Error;
 namespace llm = executorch::extension::llm;
@@ -75,9 +79,9 @@ int main(int argc, char *argv[]) {
 
   gflags::ParseCommandLineFlags(&argc, &argv, true);
 
-  const char *lora_model_path = FLAGS_lora_model_path.c_str();
-  const char *llama_model_path = FLAGS_llama_model_path.c_str();
-  const char *foundation_weights_path = FLAGS_foundation_weights_path.c_str();
+  const char *model1 = FLAGS_model1.c_str();
+  const char *model2 = FLAGS_model2.c_str();
+  const char *weights = FLAGS_weights.c_str();
 
   const char *tokenizer_path = FLAGS_tokenizer_path.c_str();
   const char *prompt = FLAGS_prompt.c_str();
@@ -93,35 +97,52 @@ int main(int argc, char *argv[]) {
 
   if (tokenizer1 == nullptr || tokenizer2 == nullptr) {
     ET_LOG(Info,
-           "Failed to load %s as a Tiktoken, Sentencepiece or Llama2.c "
+           "Failed to load %s as a Tiktoken, Sentencepiece, Llama2.c or HFTokenizer "
            "tokenizer, make sure the artifact is one of these types",
            tokenizer_path);
     return 1;
   }
 
   // Create runners.
-  std::unique_ptr<llm::TextLLMRunner> llama_runner =
-      llm::create_text_llm_runner(llama_model_path, std::move(tokenizer1),
-                                  foundation_weights_path, temperature);
-  std::unique_ptr<llm::TextLLMRunner> lora_runner =
-      llm::create_text_llm_runner(lora_model_path, std::move(tokenizer2),
-                                  foundation_weights_path, temperature);
-
-  // Generate.
-  llm::GenerationConfig config{.seq_len = seq_len, .temperature = temperature};
-
-  ET_LOG(Info, "Generating with llama...");
-  auto error = llama_runner->generate(prompt, config);
+  std::unique_ptr<llm::TextLLMRunner> runner1 =
+      llm::create_text_llm_runner(model1, std::move(tokenizer1),
+                                  weights, temperature);
+  std::unique_ptr<llm::TextLLMRunner> runner2 =
+      llm::create_text_llm_runner(model1, std::move(tokenizer2),
+                                  weights, temperature);
+
+  llm::GenerationConfig config{
+      .echo = false,
+      .seq_len = seq_len,
+      .temperature = temperature};
+
+  std::string formatted_prompt = std::string();
+  if (FLAGS_apply_chat_template) {
+    // System Prompt.
+    formatted_prompt += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n";
+    formatted_prompt += "You are a helpful assistant.<|eot_id|>";
+    // User prompt.
+    formatted_prompt += "<|start_header_id|>user<|end_header_id|>\n";
+    formatted_prompt += prompt;
+    formatted_prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>";
+  } else {
+    formatted_prompt += prompt;
+  }
+
+  ET_LOG(Info, "Generating with model %s...", model1);
+  ET_LOG(Info, "Formatted prompt: %s", formatted_prompt.c_str());
+  auto error = runner1->generate(formatted_prompt, config);
   if (error != Error::Ok) {
-    ET_LOG(Error, "Failed to generate with llama_runner, error code %zu.",
-           error);
+    ET_LOG(Error, "Failed to generate with model %s, error code %zu.",
+           model1, error);
     return 1;
   }
 
-  error = lora_runner->generate(prompt, config);
+  ET_LOG(Info, "Generating with model %s...", model2);
+  error = runner2->generate(formatted_prompt, config);
   if (error != Error::Ok) {
-    ET_LOG(Error, "Failed to generate with lora_runner, error code %zu.",
-           error);
+    ET_LOG(Error, "Failed to generate with model %s, error code %zu.",
+           model2, error);
     return 1;
   }
 
diff --git a/program-data-separation/export_lora.sh b/program-data-separation/export_lora.sh
@@ -23,6 +23,7 @@ print(path)
 cp "${DOWNLOADED_PATH}/tokenizer.model" .
 
 # Export a non-LoRA model with program-data separated.
+DIR="models/"
 MODEL="llama_3_2_1B"
 python -m executorch.extension.llm.export.export_llm \
     base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \
@@ -33,8 +34,8 @@ python -m executorch.extension.llm.export.export_llm \
     model.dtype_override="fp32" \
     backend.xnnpack.enabled=true \
     backend.xnnpack.extended_ops=true \
-    export.output_name="${MODEL}.pte" \
-    export.foundation_weights_file="${MODEL}.ptd"
+    export.output_name="${DIR}/${MODEL}.pte" \
+    export.foundation_weights_file="${DIR}/${MODEL}.ptd"
 
 # Export a LoRA model, with program and data separated.
 LORA_MODEL="llama_3_2_1B_lora"
@@ -49,5 +50,5 @@ python -m executorch.extension.llm.export.export_llm \
     model.dtype_override="fp32" \
     backend.xnnpack.enabled=true \
     backend.xnnpack.extended_ops=true \
-    export.output_name="${LORA_MODEL}.pte" \
-    export.foundation_weights_file="foundation.ptd"
+    export.output_name="${DIR}/${LORA_MODEL}.pte" \
+    export.foundation_weights_file="${DIR}/foundation.ptd"