|
| 1 | +--- |
| 2 | +outline: deep |
| 3 | +description: Learn how to use the low-level API of node-llama-cpp |
| 4 | +--- |
| 5 | +# Low Level API |
| 6 | +`node-llama-cpp` provides high-level APIs for the most common use cases to make it easy to use. |
| 7 | +However, it also provides low-level APIs for more advanced use cases. |
| 8 | + |
| 9 | +There are various low-level APIs that you can use - the more high level you can go, the more optimizations and features you can leverage. |
| 10 | + |
| 11 | +## Background {#background} |
| 12 | +Before you can use the low-level API, here are a few concepts you should be familiar with: |
| 13 | + |
| 14 | +### Context Sequence {#context-sequence} |
| 15 | +A [`LlamaContextSequence`](../api/classes/LlamaContextSequence.md) is an isolated component that holds an inference state. |
| 16 | + |
| 17 | +The state is constructed from tokens you evaluate to "append" to the state, and you can access the current state tokens using [`.contextTokens`](../api/classes/LlamaContextSequence.md#contexttokens). |
| 18 | + |
| 19 | +When evaluating input (tokens) onto a context sequence, you can choose to generate a "next token" for each of the input tokens you evaluate. |
| 20 | +When choosing to generate a "next token" for a given token, |
| 21 | +the model will "see" all the tokens up to it (input tokens and the current context sequence state tokens), |
| 22 | +and the generated token will be in the generation result you get from the API and won't be appended to the context sequence state. |
| 23 | + |
| 24 | +### Probabilities List {#probabilities-list} |
| 25 | +When generating a token, the model actually generates a list of probabilities for each token in the vocabulary to be the next token. |
| 26 | + |
| 27 | +It then uses the probabilities to choose the next token based on the heuristics you provide (like [`temperature`](../api/type-aliases/SequenceEvaluateOptions#temperature), for example). |
| 28 | + |
| 29 | +The operation of applying such heuristics to choose the next token is also called _sampling_. |
| 30 | + |
| 31 | +When you pass sampling options (like [`temperature`](../api/type-aliases/SequenceEvaluateOptions#temperature), for example) for the generation of a token, |
| 32 | +it may make adjustments to the probabilities list so it can choose the next token based on the heuristics you provide. |
| 33 | + |
| 34 | +The sampling is done on the native side of `node-llama-cpp` for performance reasons. |
| 35 | +However, you can still opt to get the full probabilities list after the sampling is done, |
| 36 | +and you can pass no sampling options to avoid making any adjustments to the probabilities list. |
| 37 | + |
| 38 | +It's best to avoid getting the full probabilities list unless you really need it, |
| 39 | +as passing it to the JavaScript side can be slow. |
| 40 | + |
| 41 | +## Simple Evaluation {#simple-evaluation} |
| 42 | +You can evaluate the given input tokens onto a context sequence using [`.evaluate`](../api/classes/LlamaContextSequence.md#evaluate) |
| 43 | +and generate the next token for the last input token. |
| 44 | + |
| 45 | +On each iteration of the returned iterator, the generated token is then added to the context sequence state and the next token is generated for it, and so on. |
| 46 | + |
| 47 | +When using [`.evaluate`](../api/classes/LlamaContextSequence.md#evaluate), the configured [token predictor](./token-prediction.md) is used to speed up the generation process. |
| 48 | + |
| 49 | +```typescript |
| 50 | +import {fileURLToPath} from "url"; |
| 51 | +import path from "path"; |
| 52 | +import {getLlama, Token, SequenceEvaluateOptions} from "node-llama-cpp"; |
| 53 | + |
| 54 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 55 | + |
| 56 | +const llama = await getLlama(); |
| 57 | +const model = await llama.loadModel({ |
| 58 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 59 | +}); |
| 60 | +const context = await model.createContext(); |
| 61 | +const sequence = context.getSequence(); |
| 62 | + |
| 63 | +const input = "The best way to"; |
| 64 | +const tokens = model.tokenize(input); |
| 65 | +const maxTokens = 10; |
| 66 | +const res: Token[] = []; |
| 67 | +const options: SequenceEvaluateOptions = { |
| 68 | + temperature: 0.8 |
| 69 | +}; |
| 70 | + |
| 71 | +for await (const generatedToken of sequence.evaluate(tokens, options)) { |
| 72 | + res.push(generatedToken); |
| 73 | + if (res.length >= maxTokens) |
| 74 | + break; |
| 75 | +} |
| 76 | + |
| 77 | +const resText = model.detokenize(res); |
| 78 | +console.log("Result: " + resText); |
| 79 | +``` |
| 80 | +> For generating text completion, it's better to use [`LlamaCompletion`](./text-completion.md) instead of manually evaluating input, |
| 81 | +> since it supports all models, and provides many more features and optimizations |
| 82 | +
|
| 83 | +### Replacement Token(s) {#replacement-tokens} |
| 84 | +You can manually iterate over the evaluation iterator and provide a replacement to the generated token. |
| 85 | +You you provide a replacement token(s), it'll be appended to the context sequence state instead of the generated token. |
| 86 | + |
| 87 | +```typescript |
| 88 | +import {fileURLToPath} from "url"; |
| 89 | +import path from "path"; |
| 90 | +import {getLlama, Token, SequenceEvaluateOptions} from "node-llama-cpp"; |
| 91 | + |
| 92 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 93 | + |
| 94 | +const llama = await getLlama(); |
| 95 | +const model = await llama.loadModel({ |
| 96 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 97 | +}); |
| 98 | +const context = await model.createContext(); |
| 99 | +const sequence = context.getSequence(); |
| 100 | + |
| 101 | +const input = "The best way to"; |
| 102 | +const tokens = model.tokenize(input); |
| 103 | +const options: SequenceEvaluateOptions = { |
| 104 | + temperature: 0.8 |
| 105 | +}; |
| 106 | +const maxTokens = 10; |
| 107 | +const res: Token[] = []; |
| 108 | + |
| 109 | +// fill this with tokens to replace |
| 110 | +const replacementMap = new Map<Token, Token>(); |
| 111 | + |
| 112 | +const iterator = sequence.evaluate(tokens, options); |
| 113 | +let replacementToken: Token | undefined; |
| 114 | + |
| 115 | +while (true) { |
| 116 | + const {value: token, done} = await iterator.next(replacementToken); |
| 117 | + replacementToken = undefined; |
| 118 | + if (done || token == null) |
| 119 | + break; |
| 120 | + |
| 121 | + replacementToken = replacementMap.get(token); |
| 122 | + |
| 123 | + res.push(replacementToken ?? token); |
| 124 | + if (res.length >= maxTokens) |
| 125 | + break; |
| 126 | +} |
| 127 | + |
| 128 | +const resText = model.detokenize(res); |
| 129 | +console.log("Result: " + resText); |
| 130 | +``` |
| 131 | +> If you want to adjust the token probabilities when generating output, consider using [token bias](./token-bias.md) instead |
| 132 | +
|
| 133 | +### No Generation {#evaluation-without-generation} |
| 134 | +To evaluate the input tokens onto a context sequence without generating new tokens, |
| 135 | +you can use [`.evaluateWithoutGeneratingNewTokens`](../api/classes/LlamaContextSequence.md#evaluatewithoutgeneratingnewtokens). |
| 136 | + |
| 137 | +```typescript |
| 138 | +import {fileURLToPath} from "url"; |
| 139 | +import path from "path"; |
| 140 | +import {getLlama} from "node-llama-cpp"; |
| 141 | + |
| 142 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 143 | + |
| 144 | +const llama = await getLlama(); |
| 145 | +const model = await llama.loadModel({ |
| 146 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 147 | +}); |
| 148 | +const context = await model.createContext(); |
| 149 | +const sequence = context.getSequence(); |
| 150 | + |
| 151 | +const input = "The best way to"; |
| 152 | +const tokens = model.tokenize(input); |
| 153 | +await sequence.evaluateWithoutGeneratingNewTokens(tokens); |
| 154 | +``` |
| 155 | + |
| 156 | +## Controlled Evaluation {#controlled-evaluation} |
| 157 | +To manually control for which of the input tokens to generate output, you can use [`.controlledEvaluate`](../api/classes/LlamaContextSequence.md#controlledevaluate). |
| 158 | + |
| 159 | +```typescript |
| 160 | +import {fileURLToPath} from "url"; |
| 161 | +import path from "path"; |
| 162 | +import {getLlama, Token, ControlledEvaluateInputItem} from "node-llama-cpp"; |
| 163 | + |
| 164 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 165 | + |
| 166 | +const llama = await getLlama(); |
| 167 | +const model = await llama.loadModel({ |
| 168 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 169 | +}); |
| 170 | +const context = await model.createContext(); |
| 171 | +const sequence = context.getSequence(); |
| 172 | + |
| 173 | +const input = "The best way to"; |
| 174 | +const tokens = model.tokenize(input); |
| 175 | +const evaluateInput: ControlledEvaluateInputItem[] = tokens.slice(); |
| 176 | + |
| 177 | +// generate output for the last token only |
| 178 | +const lastToken = evaluateInput.pop() as Token; |
| 179 | +if (lastToken != null) |
| 180 | + evaluateInput.push([lastToken, { |
| 181 | + generateNext: { |
| 182 | + singleToken: true, |
| 183 | + probabilitiesList: true, |
| 184 | + options: { |
| 185 | + temperature: 0.8 |
| 186 | + } |
| 187 | + } |
| 188 | + }]) |
| 189 | + |
| 190 | +const res = await sequence.controlledEvaluate(evaluateInput); |
| 191 | +const lastTokenResult = res[evaluateInput.length - 1]; |
| 192 | +if (lastTokenResult != null) { |
| 193 | + const {next} = lastTokenResult; |
| 194 | + |
| 195 | + if (next.token != null) |
| 196 | + console.log( |
| 197 | + "next token", |
| 198 | + next.token, |
| 199 | + model.detokenize([next.token], true) |
| 200 | + ); |
| 201 | + |
| 202 | + if (next.probabilities != null) |
| 203 | + console.log( |
| 204 | + "next probabilities", |
| 205 | + [...next.probabilities.entries()] |
| 206 | + .slice(0, 5) // top 5 probabilities |
| 207 | + .map(([token, probability]) => ( |
| 208 | + [model.detokenize([token], true), probability] |
| 209 | + )) |
| 210 | + ); |
| 211 | + |
| 212 | + // next: evalute `next.token` onto the context sequence |
| 213 | + // and generate the next token for it |
| 214 | +} |
| 215 | +``` |
| 216 | + |
| 217 | +## State Manipulation {#state-manipulation} |
| 218 | +You can manipulate the context sequence state by erasing tokens from it or shifting tokens in it. |
| 219 | + |
| 220 | +Make sure that you don't attempt to manipulate the state while waiting for a generation result from an evaluation operation, |
| 221 | +as it may lead to unexpected results. |
| 222 | + |
| 223 | +### Erase State Ranges {#erase-state-ranges} |
| 224 | +To erase a range of tokens from the context sequence state, |
| 225 | +you can use [`.eraseContextTokenRanges`](../api/classes/LlamaContextSequence.md#erasecontexttokenranges). |
| 226 | + |
| 227 | +```typescript |
| 228 | +import {fileURLToPath} from "url"; |
| 229 | +import path from "path"; |
| 230 | +import {getLlama} from "node-llama-cpp"; |
| 231 | + |
| 232 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 233 | + |
| 234 | +const llama = await getLlama(); |
| 235 | +const model = await llama.loadModel({ |
| 236 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 237 | +}); |
| 238 | +const context = await model.createContext(); |
| 239 | +const sequence = context.getSequence(); |
| 240 | + |
| 241 | +const input = "The best way to"; |
| 242 | +const tokens = model.tokenize(input); |
| 243 | +await sequence.evaluateWithoutGeneratingNewTokens(tokens); |
| 244 | + |
| 245 | +console.log( |
| 246 | + "Current state:", |
| 247 | + model.detokenize(sequence.contextTokens, true), |
| 248 | + sequence.contextTokens |
| 249 | +); |
| 250 | + |
| 251 | +// erase the last token from the state |
| 252 | +if (sequence.nextTokenIndex > 0) |
| 253 | + await sequence.eraseContextTokenRanges([{ |
| 254 | + start: sequence.nextTokenIndex - 1, |
| 255 | + end: sequence.nextTokenIndex |
| 256 | + }]); |
| 257 | + |
| 258 | +console.log( |
| 259 | + "Current state:", |
| 260 | + model.detokenize(sequence.contextTokens, true), |
| 261 | + sequence.contextTokens |
| 262 | +); |
| 263 | +``` |
| 264 | + |
| 265 | +### Adapt State to Tokens {#adapt-state-to-tokens} |
| 266 | +You can adapt the existing context state to a new input to avoid re-evaluating some of the tokens you've already evaluated. |
| 267 | + |
| 268 | +::: tip NOTE |
| 269 | +All the high-level APIs provided by `node-llama-cpp` automatically do this to improve efficiency and performance. |
| 270 | +::: |
| 271 | + |
| 272 | +```typescript |
| 273 | +import {fileURLToPath} from "url"; |
| 274 | +import path from "path"; |
| 275 | +import {getLlama} from "node-llama-cpp"; |
| 276 | + |
| 277 | +const __dirname = path.dirname(fileURLToPath(import.meta.url)); |
| 278 | + |
| 279 | +const llama = await getLlama(); |
| 280 | +const model = await llama.loadModel({ |
| 281 | + modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf") |
| 282 | +}); |
| 283 | +const context = await model.createContext(); |
| 284 | +const sequence = context.getSequence(); |
| 285 | + |
| 286 | +const input = "The best way to"; |
| 287 | +const tokens = model.tokenize(input); |
| 288 | +await sequence.evaluateWithoutGeneratingNewTokens(tokens); |
| 289 | + |
| 290 | +console.log( |
| 291 | + "Current state:", |
| 292 | + model.detokenize(sequence.contextTokens, true), |
| 293 | + sequence.contextTokens |
| 294 | +); |
| 295 | + |
| 296 | +const newInput = "The best method to"; |
| 297 | +const newTokens = model.tokenize(newInput); |
| 298 | + |
| 299 | +// only align the current state if the length |
| 300 | +// of the new tokens won't incur a context shift |
| 301 | +if (newTokens.length < sequence.contextSize && newTokens.length > 0) { |
| 302 | + // ensure we have at least one token to evalute |
| 303 | + const lastToken = newTokens.pop()!; |
| 304 | + |
| 305 | + await sequence.adaptStateToTokens(newTokens); |
| 306 | + newTokens.push(lastToken); |
| 307 | + |
| 308 | + // remove the tokens that already exist in the state |
| 309 | + newTokens.splice(0, sequence.nextTokenIndex) |
| 310 | +} |
| 311 | + |
| 312 | +console.log( |
| 313 | + "Current state:", |
| 314 | + model.detokenize(sequence.contextTokens, true), |
| 315 | + sequence.contextTokens |
| 316 | +); |
| 317 | +console.log( |
| 318 | + "New tokens:", |
| 319 | + model.detokenize(newTokens, true), |
| 320 | + newTokens |
| 321 | +); |
| 322 | +``` |
0 commit comments