Skip to content

Commit 6fda1ef

Browse files
committed
Add run with c++
1 parent 6af121b commit 6fda1ef

File tree

1 file changed

+240
-1
lines changed

1 file changed

+240
-1
lines changed
Lines changed: 240 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,240 @@
1-
# TODO(mengwei)
1+
# Running LLMs with C++
2+
3+
This guide explains how to use ExecuTorch's C++ runner library to run LLM models that have been exported to the `.pte` format. The runner library provides a high-level API for text generation with LLMs, handling tokenization, inference, and token generation.
4+
5+
## Prerequisites
6+
7+
Before you begin, make sure you have:
8+
9+
1. A model exported to `.pte` format using the `export_llm` API as described in [Exporting popular LLMs out of the box](export-llm.md) or []
10+
2. A tokenizer file compatible with your model
11+
3. CMake and a C++ compiler installed
12+
13+
## Building the Runner Library
14+
15+
The ExecuTorch LLM runner library can be built using CMake. To integrate it into your project:
16+
17+
1. Add ExecuTorch as a dependency in your CMake project
18+
2. Enable the required components (extension_module, extension_tensor, etc.)
19+
3. Link your application against the `extension_llm_runner` library
20+
21+
Here's a simplified example of the CMake configuration:
22+
23+
```cmake
24+
# Enable required components
25+
set_overridable_option(EXECUTORCH_BUILD_EXTENSION_MODULE ON)
26+
set_overridable_option(EXECUTORCH_BUILD_EXTENSION_TENSOR ON)
27+
set_overridable_option(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER ON)
28+
29+
# Add ExecuTorch as a dependency
30+
add_subdirectory(executorch)
31+
32+
# Link against the LLM runner library
33+
target_link_libraries(your_app PRIVATE extension_llm_runner)
34+
```
35+
36+
## The Runner API Architecture
37+
38+
The ExecuTorch LLM runner library is designed with a modular architecture that separates concerns between different components of the text generation pipeline.
39+
40+
### IRunner Interface
41+
42+
The `IRunner` interface (`irunner.h`) defines the core functionality for LLM text generation. This interface serves as the primary abstraction for interacting with LLM models:
43+
44+
```cpp
45+
class IRunner {
46+
public:
47+
virtual ~IRunner() = default;
48+
virtual bool is_loaded() const = 0;
49+
virtual runtime::Error load() = 0;
50+
virtual runtime::Error generate(...) = 0;
51+
virtual runtime::Error generate_from_pos(...) = 0;
52+
virtual void stop() = 0;
53+
};
54+
```
55+
56+
Let's examine each method in detail:
57+
58+
#### `bool is_loaded() const`
59+
60+
Checks if the model and all necessary resources have been loaded into memory and are ready for inference. This method is useful for verifying the runner's state before attempting to generate text.
61+
62+
#### `runtime::Error load()`
63+
64+
Loads the model and prepares it for inference. This includes:
65+
- Loading the model weights from the `.pte` file
66+
- Initializing any necessary buffers or caches
67+
- Preparing the execution environment
68+
69+
This method should be called before any generation attempts. It returns an `Error` object indicating success or failure.
70+
71+
#### `runtime::Error generate(const std::string& prompt, const GenerationConfig& config, std::function<void(const std::string&)> token_callback, std::function<void(const Stats&)> stats_callback)`
72+
73+
The primary method for text generation. It takes:
74+
75+
- `prompt`: The input text to generate from
76+
- `config`: Configuration parameters controlling the generation process
77+
- `token_callback`: A callback function that receives each generated token as a string
78+
- `stats_callback`: A callback function that receives performance statistics after generation completes
79+
80+
The token callback is called for each token as it's generated, allowing for streaming output. The stats callback provides detailed performance metrics after generation completes.
81+
82+
#### `runtime::Error generate_from_pos(const std::string& prompt, int64_t start_pos, const GenerationConfig& config, std::function<void(const std::string&)> token_callback, std::function<void(const Stats&)> stats_callback)`
83+
84+
An advanced version of `generate()` that allows starting generation from a specific position in the KV cache. This is useful for continuing generation from a previous state.
85+
86+
#### `void stop()`
87+
88+
Immediately stops the generation loop. This is typically called from another thread to interrupt a long-running generation.
89+
90+
### GenerationConfig Structure
91+
92+
The `GenerationConfig` struct controls various aspects of the generation process:
93+
94+
```cpp
95+
struct GenerationConfig {
96+
bool echo = true; // Whether to echo the input prompt in the output
97+
int32_t max_new_tokens = -1; // Maximum number of new tokens to generate
98+
bool warming = false; // Whether this is a warmup run
99+
int32_t seq_len = -1; // Maximum number of total tokens
100+
float temperature = 0.8f; // Temperature for sampling
101+
int32_t num_bos = 0; // Number of BOS tokens to add
102+
int32_t num_eos = 0; // Number of EOS tokens to add
103+
104+
// Helper method to resolve the actual max_new_tokens based on constraints
105+
int32_t resolve_max_new_tokens(int32_t max_context_len, int32_t num_prompt_tokens) const;
106+
};
107+
```
108+
109+
The `resolve_max_new_tokens` method handles the complex logic of determining how many tokens can be generated based on:
110+
- The model's maximum context length
111+
- The number of tokens in the prompt
112+
- The user-specified maximum sequence length and maximum new tokens
113+
114+
### Implementation Components
115+
116+
The runner library consists of several specialized components that work together:
117+
118+
#### TextLLMRunner
119+
120+
The main implementation of the `IRunner` interface that orchestrates the text generation process. It manages:
121+
122+
1. Tokenization of input text
123+
2. Prefilling the KV cache with prompt tokens
124+
3. Generating new tokens one by one
125+
4. Collecting performance statistics
126+
127+
#### TextPrefiller
128+
129+
Responsible for processing the initial prompt tokens and filling the KV cache. Key features:
130+
131+
- Efficiently processes large prompts
132+
- Handles dynamic sequence lengths
133+
- Supports parallel prefilling for performance optimization
134+
135+
#### TextTokenGenerator
136+
137+
Generates new tokens one by one in an autoregressive manner. It:
138+
139+
- Manages the token generation loop
140+
- Applies temperature-based sampling
141+
- Detects end-of-sequence conditions
142+
- Streams tokens as they're generated
143+
144+
#### TextDecoderRunner
145+
146+
Interfaces with the ExecuTorch Module to run the model forward pass. It:
147+
148+
- Manages inputs and outputs to the model
149+
- Handles KV cache updates
150+
- Converts logits to tokens via sampling
151+
152+
## Model Metadata
153+
154+
The metadata includes several important configuration parameters:
155+
156+
1. **`enable_dynamic_shape`**: Whether the model supports dynamic input shapes
157+
2. **`max_seq_len`**: Maximum sequence length the model can handle
158+
3. **`max_context_len`**: Maximum context length for KV cache
159+
4. **`use_kv_cache`**: Whether the model uses KV cache for efficient generation
160+
5. **`use_sdpa_with_kv_cache`**: Whether the model uses the custom op [`torch.ops.llama.sdpa_with_kv_cache.default`](https://github.com/pytorch/executorch/blob/release/0.7/extension/llm/custom_ops/op_sdpa.cpp#L611-L614)
161+
6. **`get_bos_id`**: Beginning-of-sequence token ID
162+
7. **`get_eos_ids`**: End-of-sequence token IDs
163+
8. **`get_vocab_size`**: Size of the model's vocabulary
164+
165+
### Adding Metadata During Export
166+
167+
To ensure your model has the necessary metadata, you can specify it during export using the `metadata` parameter in the export configuration:
168+
169+
```python
170+
# export_llm
171+
python -m extension.llm.export.export_llm \
172+
--config path/to/config.yaml \
173+
+base.metadata='{"get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_max_context_len":4096}'
174+
```
175+
176+
## Tokenizer Support
177+
178+
The runner library supports multiple tokenizer formats through a unified interface:
179+
180+
```cpp
181+
std::unique_ptr<tokenizers::Tokenizer> tokenizer = load_tokenizer(
182+
tokenizer_path, // Path to tokenizer file
183+
nullptr, // Optional special tokens
184+
std::nullopt, // Optional regex pattern (for TikToken)
185+
0, // BOS token index
186+
0 // EOS token index
187+
);
188+
```
189+
190+
Supported tokenizer formats include:
191+
192+
1. **HuggingFace Tokenizers**: JSON format tokenizers
193+
2. **SentencePiece**: `.model` format tokenizers
194+
3. **TikToken**: BPE tokenizers
195+
4. **Llama2c**: BPE tokenizers in the Llama2.c format
196+
197+
For custom tokenizers, you can find implementations in the [pytorch-labs/tokenizers](https://github.com/pytorch-labs/tokenizers) repository.
198+
199+
## Basic Usage Example
200+
201+
Here's a simplified example of using the runner:
202+
203+
```cpp
204+
#include <executorch/extension/llm/runner/text_llm_runner.h>
205+
206+
using namespace executorch::extension::llm;
207+
208+
int main() {
209+
// Load tokenizer and create runner
210+
auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0);
211+
auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer));
212+
213+
// Load the model
214+
runner->load();
215+
216+
// Configure generation
217+
GenerationConfig config;
218+
config.max_new_tokens = 100;
219+
config.temperature = 0.8f;
220+
221+
// Generate text with streaming output
222+
runner->generate("Hello, world!", config,
223+
[](const std::string& token) { std::cout << token << std::flush; },
224+
nullptr);
225+
226+
return 0;
227+
}
228+
```
229+
230+
## Other APIs
231+
232+
1. **Warmup**: For more accurate timing, perform a warmup run before measuring performance:
233+
```cpp
234+
runner->warmup("Hello world", 10); // Generate 10 tokens as warmup
235+
```
236+
237+
2. **Memory Usage**: Monitor memory usage with the `Stats` object:
238+
```cpp
239+
std::cout << "RSS after loading: " << get_rss_bytes() / 1024.0 / 1024.0 << " MiB" << std::endl;
240+
```

0 commit comments

Comments
 (0)