Skip to content

Commit 6a13bbf

Browse files
committed
feat: experimentalChunkDocument
1 parent 07bbc4e commit 6a13bbf

File tree

12 files changed

+899
-69
lines changed

12 files changed

+899
-69
lines changed

.vitepress/config.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -491,6 +491,7 @@ export default defineConfig({
491491
{text: "Chat Context Shift", link: "/chat-context-shift"},
492492
{text: "Batching", link: "/batching"},
493493
{text: "Token Prediction", link: "/token-prediction"},
494+
{text: "Low Level API", link: "/low-level-api"},
494495
{text: "Awesome List", link: "/awesome"},
495496
{text: "Troubleshooting", link: "/troubleshooting"},
496497
{text: "Tips and Tricks", link: "/tips-and-tricks"}

docs/guide/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,10 @@ console.log("AI: " + a1);
264264
```
265265

266266
### Raw
267+
::: tip NOTE
268+
To learn more about using low level APIs, read the [low level API guide](./low-level-api.md).
269+
:::
270+
267271
```typescript
268272
import {fileURLToPath} from "url";
269273
import path from "path";

docs/guide/low-level-api.md

Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
---
2+
outline: deep
3+
description: Learn how to use the low-level API of node-llama-cpp
4+
---
5+
# Low Level API
6+
`node-llama-cpp` provides high-level APIs for the most common use cases to make it easy to use.
7+
However, it also provides low-level APIs for more advanced use cases.
8+
9+
There are various low-level APIs that you can use - the more high level you can go, the more optimizations and features you can leverage.
10+
11+
## Background {#background}
12+
Before you can use the low-level API, here are a few concepts you should be familiar with:
13+
14+
### Context Sequence {#context-sequence}
15+
A [`LlamaContextSequence`](../api/classes/LlamaContextSequence.md) is an isolated component that holds an inference state.
16+
17+
The state is constructed from tokens you evaluate to "append" to the state, and you can access the current state tokens using [`.contextTokens`](../api/classes/LlamaContextSequence.md#contexttokens).
18+
19+
When evaluating input (tokens) onto a context sequence, you can choose to generate a "next token" for each of the input tokens you evaluate.
20+
When choosing to generate a "next token" for a given token,
21+
the model will "see" all the tokens up to it (input tokens and the current context sequence state tokens),
22+
and the generated token will be in the generation result you get from the API and won't be appended to the context sequence state.
23+
24+
### Probabilities List {#probabilities-list}
25+
When generating a token, the model actually generates a list of probabilities for each token in the vocabulary to be the next token.
26+
27+
It then uses the probabilities to choose the next token based on the heuristics you provide (like [`temperature`](../api/type-aliases/SequenceEvaluateOptions#temperature), for example).
28+
29+
The operation of applying such heuristics to choose the next token is also called _sampling_.
30+
31+
When you pass sampling options (like [`temperature`](../api/type-aliases/SequenceEvaluateOptions#temperature), for example) for the generation of a token,
32+
it may make adjustments to the probabilities list so it can choose the next token based on the heuristics you provide.
33+
34+
The sampling is done on the native side of `node-llama-cpp` for performance reasons.
35+
However, you can still opt to get the full probabilities list after the sampling is done,
36+
and you can pass no sampling options to avoid making any adjustments to the probabilities list.
37+
38+
It's best to avoid getting the full probabilities list unless you really need it,
39+
as passing it to the JavaScript side can be slow.
40+
41+
## Simple Evaluation {#simple-evaluation}
42+
You can evaluate the given input tokens onto a context sequence using [`.evaluate`](../api/classes/LlamaContextSequence.md#evaluate)
43+
and generate the next token for the last input token.
44+
45+
On each iteration of the returned iterator, the generated token is then added to the context sequence state and the next token is generated for it, and so on.
46+
47+
When using [`.evaluate`](../api/classes/LlamaContextSequence.md#evaluate), the configured [token predictor](./token-prediction.md) is used to speed up the generation process.
48+
49+
```typescript
50+
import {fileURLToPath} from "url";
51+
import path from "path";
52+
import {getLlama, Token, SequenceEvaluateOptions} from "node-llama-cpp";
53+
54+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
55+
56+
const llama = await getLlama();
57+
const model = await llama.loadModel({
58+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
59+
});
60+
const context = await model.createContext();
61+
const sequence = context.getSequence();
62+
63+
const input = "The best way to";
64+
const tokens = model.tokenize(input);
65+
const maxTokens = 10;
66+
const res: Token[] = [];
67+
const options: SequenceEvaluateOptions = {
68+
temperature: 0.8
69+
};
70+
71+
for await (const generatedToken of sequence.evaluate(tokens, options)) {
72+
res.push(generatedToken);
73+
if (res.length >= maxTokens)
74+
break;
75+
}
76+
77+
const resText = model.detokenize(res);
78+
console.log("Result: " + resText);
79+
```
80+
> For generating text completion, it's better to use [`LlamaCompletion`](./text-completion.md) instead of manually evaluating input,
81+
> since it supports all models, and provides many more features and optimizations
82+
83+
### Replacement Token(s) {#replacement-tokens}
84+
You can manually iterate over the evaluation iterator and provide a replacement to the generated token.
85+
You you provide a replacement token(s), it'll be appended to the context sequence state instead of the generated token.
86+
87+
```typescript
88+
import {fileURLToPath} from "url";
89+
import path from "path";
90+
import {getLlama, Token, SequenceEvaluateOptions} from "node-llama-cpp";
91+
92+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
93+
94+
const llama = await getLlama();
95+
const model = await llama.loadModel({
96+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
97+
});
98+
const context = await model.createContext();
99+
const sequence = context.getSequence();
100+
101+
const input = "The best way to";
102+
const tokens = model.tokenize(input);
103+
const options: SequenceEvaluateOptions = {
104+
temperature: 0.8
105+
};
106+
const maxTokens = 10;
107+
const res: Token[] = [];
108+
109+
// fill this with tokens to replace
110+
const replacementMap = new Map<Token, Token>();
111+
112+
const iterator = sequence.evaluate(tokens, options);
113+
let replacementToken: Token | undefined;
114+
115+
while (true) {
116+
const {value: token, done} = await iterator.next(replacementToken);
117+
replacementToken = undefined;
118+
if (done || token == null)
119+
break;
120+
121+
replacementToken = replacementMap.get(token);
122+
123+
res.push(replacementToken ?? token);
124+
if (res.length >= maxTokens)
125+
break;
126+
}
127+
128+
const resText = model.detokenize(res);
129+
console.log("Result: " + resText);
130+
```
131+
> If you want to adjust the token probabilities when generating output, consider using [token bias](./token-bias.md) instead
132+
133+
### No Generation {#evaluation-without-generation}
134+
To evaluate the input tokens onto a context sequence without generating new tokens,
135+
you can use [`.evaluateWithoutGeneratingNewTokens`](../api/classes/LlamaContextSequence.md#evaluatewithoutgeneratingnewtokens).
136+
137+
```typescript
138+
import {fileURLToPath} from "url";
139+
import path from "path";
140+
import {getLlama} from "node-llama-cpp";
141+
142+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
143+
144+
const llama = await getLlama();
145+
const model = await llama.loadModel({
146+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
147+
});
148+
const context = await model.createContext();
149+
const sequence = context.getSequence();
150+
151+
const input = "The best way to";
152+
const tokens = model.tokenize(input);
153+
await sequence.evaluateWithoutGeneratingNewTokens(tokens);
154+
```
155+
156+
## Controlled Evaluation {#controlled-evaluation}
157+
To manually control for which of the input tokens to generate output, you can use [`.controlledEvaluate`](../api/classes/LlamaContextSequence.md#controlledevaluate).
158+
159+
```typescript
160+
import {fileURLToPath} from "url";
161+
import path from "path";
162+
import {getLlama, Token, ControlledEvaluateInputItem} from "node-llama-cpp";
163+
164+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
165+
166+
const llama = await getLlama();
167+
const model = await llama.loadModel({
168+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
169+
});
170+
const context = await model.createContext();
171+
const sequence = context.getSequence();
172+
173+
const input = "The best way to";
174+
const tokens = model.tokenize(input);
175+
const evaluateInput: ControlledEvaluateInputItem[] = tokens.slice();
176+
177+
// generate output for the last token only
178+
const lastToken = evaluateInput.pop() as Token;
179+
if (lastToken != null)
180+
evaluateInput.push([lastToken, {
181+
generateNext: {
182+
singleToken: true,
183+
probabilitiesList: true,
184+
options: {
185+
temperature: 0.8
186+
}
187+
}
188+
}])
189+
190+
const res = await sequence.controlledEvaluate(evaluateInput);
191+
const lastTokenResult = res[evaluateInput.length - 1];
192+
if (lastTokenResult != null) {
193+
const {next} = lastTokenResult;
194+
195+
if (next.token != null)
196+
console.log(
197+
"next token",
198+
next.token,
199+
model.detokenize([next.token], true)
200+
);
201+
202+
if (next.probabilities != null)
203+
console.log(
204+
"next probabilities",
205+
[...next.probabilities.entries()]
206+
.slice(0, 5) // top 5 probabilities
207+
.map(([token, probability]) => (
208+
[model.detokenize([token], true), probability]
209+
))
210+
);
211+
212+
// next: evalute `next.token` onto the context sequence
213+
// and generate the next token for it
214+
}
215+
```
216+
217+
## State Manipulation {#state-manipulation}
218+
You can manipulate the context sequence state by erasing tokens from it or shifting tokens in it.
219+
220+
Make sure that you don't attempt to manipulate the state while waiting for a generation result from an evaluation operation,
221+
as it may lead to unexpected results.
222+
223+
### Erase State Ranges {#erase-state-ranges}
224+
To erase a range of tokens from the context sequence state,
225+
you can use [`.eraseContextTokenRanges`](../api/classes/LlamaContextSequence.md#erasecontexttokenranges).
226+
227+
```typescript
228+
import {fileURLToPath} from "url";
229+
import path from "path";
230+
import {getLlama} from "node-llama-cpp";
231+
232+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
233+
234+
const llama = await getLlama();
235+
const model = await llama.loadModel({
236+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
237+
});
238+
const context = await model.createContext();
239+
const sequence = context.getSequence();
240+
241+
const input = "The best way to";
242+
const tokens = model.tokenize(input);
243+
await sequence.evaluateWithoutGeneratingNewTokens(tokens);
244+
245+
console.log(
246+
"Current state:",
247+
model.detokenize(sequence.contextTokens, true),
248+
sequence.contextTokens
249+
);
250+
251+
// erase the last token from the state
252+
if (sequence.nextTokenIndex > 0)
253+
await sequence.eraseContextTokenRanges([{
254+
start: sequence.nextTokenIndex - 1,
255+
end: sequence.nextTokenIndex
256+
}]);
257+
258+
console.log(
259+
"Current state:",
260+
model.detokenize(sequence.contextTokens, true),
261+
sequence.contextTokens
262+
);
263+
```
264+
265+
### Adapt State to Tokens {#adapt-state-to-tokens}
266+
You can adapt the existing context state to a new input to avoid re-evaluating some of the tokens you've already evaluated.
267+
268+
::: tip NOTE
269+
All the high-level APIs provided by `node-llama-cpp` automatically do this to improve efficiency and performance.
270+
:::
271+
272+
```typescript
273+
import {fileURLToPath} from "url";
274+
import path from "path";
275+
import {getLlama} from "node-llama-cpp";
276+
277+
const __dirname = path.dirname(fileURLToPath(import.meta.url));
278+
279+
const llama = await getLlama();
280+
const model = await llama.loadModel({
281+
modelPath: path.join(__dirname, "models", "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf")
282+
});
283+
const context = await model.createContext();
284+
const sequence = context.getSequence();
285+
286+
const input = "The best way to";
287+
const tokens = model.tokenize(input);
288+
await sequence.evaluateWithoutGeneratingNewTokens(tokens);
289+
290+
console.log(
291+
"Current state:",
292+
model.detokenize(sequence.contextTokens, true),
293+
sequence.contextTokens
294+
);
295+
296+
const newInput = "The best method to";
297+
const newTokens = model.tokenize(newInput);
298+
299+
// only align the current state if the length
300+
// of the new tokens won't incur a context shift
301+
if (newTokens.length < sequence.contextSize && newTokens.length > 0) {
302+
// ensure we have at least one token to evalute
303+
const lastToken = newTokens.pop()!;
304+
305+
await sequence.adaptStateToTokens(newTokens);
306+
newTokens.push(lastToken);
307+
308+
// remove the tokens that already exist in the state
309+
newTokens.splice(0, sequence.nextTokenIndex)
310+
}
311+
312+
console.log(
313+
"Current state:",
314+
model.detokenize(sequence.contextTokens, true),
315+
sequence.contextTokens
316+
);
317+
console.log(
318+
"New tokens:",
319+
model.detokenize(newTokens, true),
320+
newTokens
321+
);
322+
```

eslint.config.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@ export default tseslint.config({
5555
exemptDestructuredRootsFromChecks: true,
5656

5757
tagNamePreference: {
58-
hidden: "hidden"
58+
hidden: "hidden",
59+
experimental: "experimental"
5960
}
6061
}
6162
},

0 commit comments

Comments
 (0)