|
1 | | -# Running LLMs on iOS |
2 | | - |
3 | | -ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime. |
4 | | - |
5 | | -## Prerequisites |
6 | | - |
7 | | -Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. |
8 | | - |
9 | | -## Runtime API |
10 | | - |
11 | | -Once linked against the [`executorch_llm`](../using-executorch-ios.md) framework, you can import the necessary components. |
12 | | - |
13 | | -### Importing |
14 | | - |
15 | | -Objective-C: |
16 | | -```objectivec |
17 | | -#import <ExecuTorchLLM/ExecuTorchLLM.h> |
18 | | -``` |
19 | | - |
20 | | -Swift: |
21 | | -```swift |
22 | | -import ExecuTorchLLM |
23 | | -``` |
24 | | - |
25 | | -### TextLLMRunner |
26 | | - |
27 | | -The `ExecuTorchTextLLMRunner` class (bridged to Swift as `TextLLMRunner`) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. |
28 | | -This API is experimental and subject to change. |
29 | | - |
30 | | -#### Initialization |
31 | | - |
32 | | -Create a runner by specifying paths to your serialized model (`.pte`) and tokenizer data, plus an array of special tokens to use during tokenization. |
33 | | -Initialization itself is lightweight and doesn’t load the program data immediately. |
34 | | - |
35 | | -Objective-C: |
36 | | -```objectivec |
37 | | -NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"]; |
38 | | -NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"]; |
39 | | -NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ]; |
40 | | - |
41 | | -ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath |
42 | | - tokenizerPath:tokenizerPath |
43 | | - specialTokens:specialTokens]; |
44 | | -``` |
45 | | -
|
46 | | -Swift: |
47 | | -```swift |
48 | | -let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")! |
49 | | -let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")! |
50 | | -let specialTokens = ["<|bos|>", "<|eos|>"] |
51 | | -
|
52 | | -let runner = TextLLMRunner( |
53 | | - modelPath: modelPath, |
54 | | - tokenizerPath: tokenizerPath, |
55 | | - specialTokens: specialTokens |
56 | | -) |
57 | | -``` |
58 | | - |
59 | | -#### Loading |
60 | | - |
61 | | -Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. |
62 | | - |
63 | | -Objective-C: |
64 | | -```objectivec |
65 | | -NSError *error = nil; |
66 | | -BOOL success = [runner loadWithError:&error]; |
67 | | -if (!success) { |
68 | | - NSLog(@"Failed to load: %@", error); |
69 | | -} |
70 | | -``` |
71 | | -
|
72 | | -Swift: |
73 | | -```swift |
74 | | -do { |
75 | | - try runner.load() |
76 | | -} catch { |
77 | | - print("Failed to load: \(error)") |
78 | | -} |
79 | | -``` |
80 | | - |
81 | | -#### Generating |
82 | | - |
83 | | -Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced. |
84 | | - |
85 | | -Objective-C: |
86 | | -```objectivec |
87 | | -NSError *error = nil; |
88 | | -BOOL success = [runner generate:@"Once upon a time" |
89 | | - sequenceLength:50 |
90 | | - withTokenCallback:^(NSString *token) { |
91 | | - NSLog(@"Generated token: %@", token); |
92 | | - } |
93 | | - error:&error]; |
94 | | -if (!success) { |
95 | | - NSLog(@"Generation failed: %@", error); |
96 | | -} |
97 | | -``` |
98 | | -
|
99 | | -Swift: |
100 | | -```swift |
101 | | -do { |
102 | | - try runner.generate("Once upon a time", sequenceLength: 50) { token in |
103 | | - print("Generated token:", token) |
104 | | - } |
105 | | -} catch { |
106 | | - print("Generation failed:", error) |
107 | | -} |
108 | | -``` |
109 | | - |
110 | | -#### Stopping Generation |
111 | | - |
112 | | -If you need to interrupt a long‐running generation, call: |
113 | | - |
114 | | -Objective-C: |
115 | | -```objectivec |
116 | | -[runner stop]; |
117 | | -``` |
118 | | - |
119 | | -Swift: |
120 | | -```swift |
121 | | -runner.stop() |
122 | | -``` |
123 | | - |
124 | | -## Demo |
125 | | - |
126 | | -Get hands-on with our [LLaMA iOS Demo App](llama-demo-ios.md) to see the LLM runtime APIs in action. |
127 | | - |
128 | | - |
129 | | - |
| 1 | +# Running LLMs on iOS |
| 2 | + |
| 3 | +ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. |
| 8 | + |
| 9 | +## Runtime API |
| 10 | + |
| 11 | +Once linked against the [`executorch_llm`](../using-executorch-ios.md) framework, you can import the necessary components. |
| 12 | + |
| 13 | +### Importing |
| 14 | + |
| 15 | +Objective-C: |
| 16 | +```objectivec |
| 17 | +#import <ExecuTorchLLM/ExecuTorchLLM.h> |
| 18 | +``` |
| 19 | + |
| 20 | +Swift: |
| 21 | +```swift |
| 22 | +import ExecuTorchLLM |
| 23 | +``` |
| 24 | + |
| 25 | +### TextLLMRunner |
| 26 | + |
| 27 | +The `ExecuTorchTextLLMRunner` class (bridged to Swift as `TextLLMRunner`) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. |
| 28 | +This API is experimental and subject to change. |
| 29 | + |
| 30 | +#### Initialization |
| 31 | + |
| 32 | +Create a runner by specifying paths to your serialized model (`.pte`) and tokenizer data, plus an array of special tokens to use during tokenization. |
| 33 | +Initialization itself is lightweight and doesn’t load the program data immediately. |
| 34 | + |
| 35 | +Objective-C: |
| 36 | +```objectivec |
| 37 | +NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"]; |
| 38 | +NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"]; |
| 39 | +NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ]; |
| 40 | + |
| 41 | +ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath |
| 42 | + tokenizerPath:tokenizerPath |
| 43 | + specialTokens:specialTokens]; |
| 44 | +``` |
| 45 | +
|
| 46 | +Swift: |
| 47 | +```swift |
| 48 | +let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")! |
| 49 | +let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")! |
| 50 | +let specialTokens = ["<|bos|>", "<|eos|>"] |
| 51 | +
|
| 52 | +let runner = TextLLMRunner( |
| 53 | + modelPath: modelPath, |
| 54 | + tokenizerPath: tokenizerPath, |
| 55 | + specialTokens: specialTokens |
| 56 | +) |
| 57 | +``` |
| 58 | + |
| 59 | +#### Loading |
| 60 | + |
| 61 | +Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. |
| 62 | + |
| 63 | +Objective-C: |
| 64 | +```objectivec |
| 65 | +NSError *error = nil; |
| 66 | +BOOL success = [runner loadWithError:&error]; |
| 67 | +if (!success) { |
| 68 | + NSLog(@"Failed to load: %@", error); |
| 69 | +} |
| 70 | +``` |
| 71 | +
|
| 72 | +Swift: |
| 73 | +```swift |
| 74 | +do { |
| 75 | + try runner.load() |
| 76 | +} catch { |
| 77 | + print("Failed to load: \(error)") |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +#### Generating |
| 82 | + |
| 83 | +Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced. |
| 84 | + |
| 85 | +Objective-C: |
| 86 | +```objectivec |
| 87 | +NSError *error = nil; |
| 88 | +BOOL success = [runner generate:@"Once upon a time" |
| 89 | + sequenceLength:50 |
| 90 | + withTokenCallback:^(NSString *token) { |
| 91 | + NSLog(@"Generated token: %@", token); |
| 92 | + } |
| 93 | + error:&error]; |
| 94 | +if (!success) { |
| 95 | + NSLog(@"Generation failed: %@", error); |
| 96 | +} |
| 97 | +``` |
| 98 | +
|
| 99 | +Swift: |
| 100 | +```swift |
| 101 | +do { |
| 102 | + try runner.generate("Once upon a time", sequenceLength: 50) { token in |
| 103 | + print("Generated token:", token) |
| 104 | + } |
| 105 | +} catch { |
| 106 | + print("Generation failed:", error) |
| 107 | +} |
| 108 | +``` |
| 109 | + |
| 110 | +#### Stopping Generation |
| 111 | + |
| 112 | +If you need to interrupt a long‐running generation, call: |
| 113 | + |
| 114 | +Objective-C: |
| 115 | +```objectivec |
| 116 | +[runner stop]; |
| 117 | +``` |
| 118 | + |
| 119 | +Swift: |
| 120 | +```swift |
| 121 | +runner.stop() |
| 122 | +``` |
| 123 | + |
| 124 | +## Demo |
| 125 | + |
| 126 | +Get hands-on with our [LLaMA iOS Demo App](llama-demo-ios.md) to see the LLM runtime APIs in action. |
| 127 | + |
| 128 | + |
| 129 | + |
0 commit comments