|
1 |
| -# Running LLMs on iOS |
2 |
| - |
3 |
| -ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime. |
4 |
| - |
5 |
| -## Prerequisites |
6 |
| - |
7 |
| -Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. |
8 |
| - |
9 |
| -## Runtime API |
10 |
| - |
11 |
| -Once linked against the [`executorch_llm`](../using-executorch-ios.md) framework, you can import the necessary components. |
12 |
| - |
13 |
| -### Importing |
14 |
| - |
15 |
| -Objective-C: |
16 |
| -```objectivec |
17 |
| -#import <ExecuTorchLLM/ExecuTorchLLM.h> |
18 |
| -``` |
19 |
| - |
20 |
| -Swift: |
21 |
| -```swift |
22 |
| -import ExecuTorchLLM |
23 |
| -``` |
24 |
| - |
25 |
| -### TextLLMRunner |
26 |
| - |
27 |
| -The `ExecuTorchTextLLMRunner` class (bridged to Swift as `TextLLMRunner`) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. |
28 |
| -This API is experimental and subject to change. |
29 |
| - |
30 |
| -#### Initialization |
31 |
| - |
32 |
| -Create a runner by specifying paths to your serialized model (`.pte`) and tokenizer data, plus an array of special tokens to use during tokenization. |
33 |
| -Initialization itself is lightweight and doesn’t load the program data immediately. |
34 |
| - |
35 |
| -Objective-C: |
36 |
| -```objectivec |
37 |
| -NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"]; |
38 |
| -NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"]; |
39 |
| -NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ]; |
40 |
| - |
41 |
| -ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath |
42 |
| - tokenizerPath:tokenizerPath |
43 |
| - specialTokens:specialTokens]; |
44 |
| -``` |
45 |
| -
|
46 |
| -Swift: |
47 |
| -```swift |
48 |
| -let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")! |
49 |
| -let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")! |
50 |
| -let specialTokens = ["<|bos|>", "<|eos|>"] |
51 |
| -
|
52 |
| -let runner = TextLLMRunner( |
53 |
| - modelPath: modelPath, |
54 |
| - tokenizerPath: tokenizerPath, |
55 |
| - specialTokens: specialTokens |
56 |
| -) |
57 |
| -``` |
58 |
| - |
59 |
| -#### Loading |
60 |
| - |
61 |
| -Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. |
62 |
| - |
63 |
| -Objective-C: |
64 |
| -```objectivec |
65 |
| -NSError *error = nil; |
66 |
| -BOOL success = [runner loadWithError:&error]; |
67 |
| -if (!success) { |
68 |
| - NSLog(@"Failed to load: %@", error); |
69 |
| -} |
70 |
| -``` |
71 |
| -
|
72 |
| -Swift: |
73 |
| -```swift |
74 |
| -do { |
75 |
| - try runner.load() |
76 |
| -} catch { |
77 |
| - print("Failed to load: \(error)") |
78 |
| -} |
79 |
| -``` |
80 |
| - |
81 |
| -#### Generating |
82 |
| - |
83 |
| -Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced. |
84 |
| - |
85 |
| -Objective-C: |
86 |
| -```objectivec |
87 |
| -NSError *error = nil; |
88 |
| -BOOL success = [runner generate:@"Once upon a time" |
89 |
| - sequenceLength:50 |
90 |
| - withTokenCallback:^(NSString *token) { |
91 |
| - NSLog(@"Generated token: %@", token); |
92 |
| - } |
93 |
| - error:&error]; |
94 |
| -if (!success) { |
95 |
| - NSLog(@"Generation failed: %@", error); |
96 |
| -} |
97 |
| -``` |
98 |
| -
|
99 |
| -Swift: |
100 |
| -```swift |
101 |
| -do { |
102 |
| - try runner.generate("Once upon a time", sequenceLength: 50) { token in |
103 |
| - print("Generated token:", token) |
104 |
| - } |
105 |
| -} catch { |
106 |
| - print("Generation failed:", error) |
107 |
| -} |
108 |
| -``` |
109 |
| - |
110 |
| -#### Stopping Generation |
111 |
| - |
112 |
| -If you need to interrupt a long‐running generation, call: |
113 |
| - |
114 |
| -Objective-C: |
115 |
| -```objectivec |
116 |
| -[runner stop]; |
117 |
| -``` |
118 |
| - |
119 |
| -Swift: |
120 |
| -```swift |
121 |
| -runner.stop() |
122 |
| -``` |
123 |
| - |
124 |
| -## Demo |
125 |
| - |
126 |
| -Get hands-on with our [LLaMA iOS Demo App](llama-demo-ios.md) to see the LLM runtime APIs in action. |
127 |
| - |
128 |
| - |
129 |
| - |
| 1 | +# Running LLMs on iOS |
| 2 | + |
| 3 | +ExecuTorch’s LLM-specific runtime components provide an experimental Objective-C and Swift components around the core C++ LLM runtime. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. |
| 8 | + |
| 9 | +## Runtime API |
| 10 | + |
| 11 | +Once linked against the [`executorch_llm`](../using-executorch-ios.md) framework, you can import the necessary components. |
| 12 | + |
| 13 | +### Importing |
| 14 | + |
| 15 | +Objective-C: |
| 16 | +```objectivec |
| 17 | +#import <ExecuTorchLLM/ExecuTorchLLM.h> |
| 18 | +``` |
| 19 | + |
| 20 | +Swift: |
| 21 | +```swift |
| 22 | +import ExecuTorchLLM |
| 23 | +``` |
| 24 | + |
| 25 | +### TextLLMRunner |
| 26 | + |
| 27 | +The `ExecuTorchTextLLMRunner` class (bridged to Swift as `TextLLMRunner`) provides a simple Objective-C/Swift interface for loading a text-generation model, configuring its tokenizer with custom special tokens, generating token streams, and stopping execution. |
| 28 | +This API is experimental and subject to change. |
| 29 | + |
| 30 | +#### Initialization |
| 31 | + |
| 32 | +Create a runner by specifying paths to your serialized model (`.pte`) and tokenizer data, plus an array of special tokens to use during tokenization. |
| 33 | +Initialization itself is lightweight and doesn’t load the program data immediately. |
| 34 | + |
| 35 | +Objective-C: |
| 36 | +```objectivec |
| 37 | +NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llama-3.2-instruct" ofType:@"pte"]; |
| 38 | +NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"tokenizer" ofType:@"model"]; |
| 39 | +NSArray<NSString *> *specialTokens = @[ @"<|bos|>", @"<|eos|>" ]; |
| 40 | + |
| 41 | +ExecuTorchTextLLMRunner *runner = [[ExecuTorchTextLLMRunner alloc] initWithModelPath:modelPath |
| 42 | + tokenizerPath:tokenizerPath |
| 43 | + specialTokens:specialTokens]; |
| 44 | +``` |
| 45 | +
|
| 46 | +Swift: |
| 47 | +```swift |
| 48 | +let modelPath = Bundle.main.path(forResource: "llama-3.2-instruct", ofType: "pte")! |
| 49 | +let tokenizerPath = Bundle.main.path(forResource: "tokenizer", ofType: "model")! |
| 50 | +let specialTokens = ["<|bos|>", "<|eos|>"] |
| 51 | +
|
| 52 | +let runner = TextLLMRunner( |
| 53 | + modelPath: modelPath, |
| 54 | + tokenizerPath: tokenizerPath, |
| 55 | + specialTokens: specialTokens |
| 56 | +) |
| 57 | +``` |
| 58 | + |
| 59 | +#### Loading |
| 60 | + |
| 61 | +Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. |
| 62 | + |
| 63 | +Objective-C: |
| 64 | +```objectivec |
| 65 | +NSError *error = nil; |
| 66 | +BOOL success = [runner loadWithError:&error]; |
| 67 | +if (!success) { |
| 68 | + NSLog(@"Failed to load: %@", error); |
| 69 | +} |
| 70 | +``` |
| 71 | +
|
| 72 | +Swift: |
| 73 | +```swift |
| 74 | +do { |
| 75 | + try runner.load() |
| 76 | +} catch { |
| 77 | + print("Failed to load: \(error)") |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +#### Generating |
| 82 | + |
| 83 | +Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced. |
| 84 | + |
| 85 | +Objective-C: |
| 86 | +```objectivec |
| 87 | +NSError *error = nil; |
| 88 | +BOOL success = [runner generate:@"Once upon a time" |
| 89 | + sequenceLength:50 |
| 90 | + withTokenCallback:^(NSString *token) { |
| 91 | + NSLog(@"Generated token: %@", token); |
| 92 | + } |
| 93 | + error:&error]; |
| 94 | +if (!success) { |
| 95 | + NSLog(@"Generation failed: %@", error); |
| 96 | +} |
| 97 | +``` |
| 98 | +
|
| 99 | +Swift: |
| 100 | +```swift |
| 101 | +do { |
| 102 | + try runner.generate("Once upon a time", sequenceLength: 50) { token in |
| 103 | + print("Generated token:", token) |
| 104 | + } |
| 105 | +} catch { |
| 106 | + print("Generation failed:", error) |
| 107 | +} |
| 108 | +``` |
| 109 | + |
| 110 | +#### Stopping Generation |
| 111 | + |
| 112 | +If you need to interrupt a long‐running generation, call: |
| 113 | + |
| 114 | +Objective-C: |
| 115 | +```objectivec |
| 116 | +[runner stop]; |
| 117 | +``` |
| 118 | + |
| 119 | +Swift: |
| 120 | +```swift |
| 121 | +runner.stop() |
| 122 | +``` |
| 123 | + |
| 124 | +## Demo |
| 125 | + |
| 126 | +Get hands-on with our [LLaMA iOS Demo App](llama-demo-ios.md) to see the LLM runtime APIs in action. |
| 127 | + |
| 128 | + |
| 129 | + |
0 commit comments