Skip to content

Commit 30eaa23

Browse files
authored
fix: gpt-oss segment budgets (#489)
* feat: `comment` segment budget * feat(Electron template): comment segments * feat(Electron template): improve completions speed when using functions * feat(Electron template): start with inspect script * feat(Electron template): add a link to download `gpt-oss` * fix: using segment budgets with `gpt-oss` models * fix: detect more variations of Harmony chat template * fix: use a model message for user prompt completion on unsupported models by default * fix(Electron template): don't render `<br/>` in a thought excerpt
1 parent 722e29d commit 30eaa23

File tree

20 files changed

+1138
-124
lines changed

20 files changed

+1138
-124
lines changed

docs/blog/v3.12-gpt-oss.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,30 @@ npx -y node-llama-cpp inspect estimate <model URI>
4646
:::
4747

4848

49+
## `MXFP4` Quantization
50+
You might be used to looking for a `Q4_K_M` quantization because of its good balance between quality and size,
51+
and be looking for a `Q4_K_M` quantization of `gpt-oss` models.
52+
You don't have to, because these models are already natively provided in a similar quantization format called `MXFP4`.
53+
54+
Let's break down what `MXFP4` is:
55+
* `MXFP4` stands for Microscaling FP4 (Floating Point, 4-bit). `Q4_K_M` is also a 4-bit quantization.
56+
* It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024.
57+
OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta,
58+
with the goal of lowering the hardware and compute barriers to running AI models.
59+
* Designed to dramatically reduce the memory and compute requirements for training and running AI models,
60+
while preserving as much precision as possible.
61+
62+
This format was used to train the `gpt-oss` models, so the most precise format of these models is `MXFP4`.
63+
<br/>
64+
Since this is a 4-bit precision format, its size footprint is similar to `Q4_K_M` quantization,
65+
but it provides better precision and thus better quality.
66+
First class support for `MXFP4` in `llama.cpp` was introduced as part of the `gpt-oss` release.
67+
68+
The bottom line is that you don't have to find a `Q4_K_M` quantization of `gpt-oss` models,
69+
because the `MXFP4` format is as small, efficient, and fast as `Q4_K_M`,
70+
but offers better precision and thus better quality.
71+
72+
4973
### Try It Using the CLI
5074
To quickly try out [`gpt-oss-20b`](https://huggingface.co/giladgd/gpt-oss-20b-GGUF), you can use the [CLI `chat` command](../cli/chat.md):
5175

@@ -54,6 +78,42 @@ npx -y node-llama-cpp chat --ef --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/
5478
```
5579

5680

81+
## `thought` Segments
82+
Since `gpt-oss` models are reasoning models, they generate thoughts as part of their response.
83+
These thoughts are useful for debugging and understanding the model's reasoning process,
84+
and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.
85+
86+
However, OpenAI [emphasizes](https://openai.com/index/chain-of-thought-monitoring/#:~:text=leaving%20CoTs%20unrestricted%20may%20make%20them%20unfit%20to%20be%20shown%20to%20end%2Dusers%2C%20as%20they%20might%20violate%20some%20misuse%20policies)
87+
that the thoughts generated by these models may not be safe to show to end users as they are unrestricted
88+
and might include sensitive information, uncontained language, hallucinations, or other issues.
89+
Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.
90+
91+
Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.
92+
93+
94+
## `comment` Segments
95+
`gpt-oss` models output "preamble" messages in their response;
96+
these are segmented as a new `comment` segment in the model's response.
97+
98+
The model might choose to generate those segments to inform the user about the functions it's about to call.
99+
For example, when it plans to use multiple functions, it may generate a plan in advance.
100+
101+
These are intended for the user to see, but not as part of the main response.
102+
103+
Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.
104+
105+
::: info Experiment with `comment` segments
106+
The [Electron app template](../guide/electron.md) has been updated to properly segment comments in the response.
107+
108+
Try it out by downloading the latest build [from GitHub](https://github.com/withcatai/node-llama-cpp/releases/latest),
109+
or by [scaffolding a new project](../guide/index.md#scaffold-new-project) based on the Electron template:
110+
111+
```shell
112+
npm create node-llama-cpp@latest
113+
```
114+
:::
115+
116+
57117
## Customizing gpt-oss
58118
You can adjust `gpt-oss`'s responses by configuring the options of [`HarmonyChatWrapper`](../api/classes/HarmonyChatWrapper.md):
59119
```typescript

docs/guide/chat-session.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -833,7 +833,8 @@ console.log("AI: " + a1);
833833

834834
## Stream Response Segments {#stream-response-segments}
835835
The raw model response is automatically segmented into different types of segments.
836-
The main response is not segmented, but other kinds of sections, like thoughts (chain of thought), are segmented.
836+
The main response is not segmented, but other kinds of sections,
837+
like thoughts (chain of thought) and comments (on relevant models, like [`gpt-oss`](../blog/v3.12-gpt-oss.md#comment-segments)), are segmented.
837838

838839
To stream response segments you can use the [`onResponseChunk`](../api/type-aliases/LLamaChatPromptOptions.md#onresponsechunk) option.
839840

@@ -862,6 +863,8 @@ const a1 = await session.promptWithMeta(q1, {
862863
onResponseChunk(chunk) {
863864
const isThoughtSegment = chunk.type === "segment" &&
864865
chunk.segmentType === "thought";
866+
const isCommentSegment = chunk.type === "segment" &&
867+
chunk.segmentType === "comment";
865868

866869
if (chunk.type === "segment" && chunk.segmentStartTime != null)
867870
process.stdout.write(` [segment start: ${chunk.segmentType}] `);
@@ -879,6 +882,7 @@ const fullResponse = a1.response
879882
return item;
880883
else if (item.type === "segment") {
881884
const isThoughtSegment = item.segmentType === "thought";
885+
const isCommentSegment = item.segmentType === "comment";
882886
let res = "";
883887

884888
if (item.startTime != null)

src/chatWrappers/HarmonyChatWrapper.ts

Lines changed: 28 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import {ChatWrapper, ChatWrapperJinjaMatchConfiguration} from "../ChatWrapper.js";
22
import {
3-
ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState,
4-
ChatWrapperGeneratedPrefixTriggersContextState, ChatWrapperSettings
3+
ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState, ChatWrapperSettings
54
} from "../types.js";
65
import {SpecialToken, LlamaText, SpecialTokensText} from "../utils/LlamaText.js";
76
import {ChatModelFunctionsDocumentationGenerator} from "./utils/ChatModelFunctionsDocumentationGenerator.js";
@@ -282,23 +281,21 @@ export class HarmonyChatWrapper extends ChatWrapper {
282281
],
283282
inject: LlamaText(new SpecialTokensText("<|message|>"))
284283
},
285-
...(
286-
!hasFunctions ? [] : [{
287-
type: "functionCall",
288-
triggers: [
289-
LlamaText(new SpecialTokensText("<|channel|>commentary to="))
290-
],
291-
replaceTrigger: true,
292-
inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
293-
}, {
294-
type: "functionCall",
295-
triggers: [
296-
LlamaText(new SpecialTokensText("<|channel|>analysis to="))
297-
],
298-
replaceTrigger: true,
299-
inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
300-
}] satisfies ChatWrapperGeneratedPrefixTriggersContextState["prefixTriggers"]
301-
)
284+
{
285+
type: "functionCall",
286+
triggers: [
287+
LlamaText(new SpecialTokensText("<|channel|>commentary to="))
288+
],
289+
replaceTrigger: true,
290+
inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
291+
}, {
292+
type: "functionCall",
293+
triggers: [
294+
LlamaText(new SpecialTokensText("<|channel|>analysis to="))
295+
],
296+
replaceTrigger: true,
297+
inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
298+
}
302299
],
303300
noPrefixTrigger: {
304301
type: "response",
@@ -669,6 +666,18 @@ export class HarmonyChatWrapper extends ChatWrapper {
669666
{},
670667
{additionalRenderParameters: jinjaParameters}
671668
],
669+
[
670+
{
671+
_jinjaFlags: {
672+
emptyLastModelResponseIsFinalMessage: true,
673+
useSpecialTokensForFullSystemMessage: true,
674+
useNonFinalFinalMessage: false,
675+
noFinalMessages: false
676+
}
677+
},
678+
{},
679+
{additionalRenderParameters: jinjaParameters}
680+
],
672681
[
673682
{
674683
_jinjaFlags: {

src/chatWrappers/utils/isJinjaTemplateEquivalentToSpecializedChatWrapper.ts

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,13 +139,22 @@ function checkEquivalence(
139139
if (!compareContextTexts(jinjaRes.contextText, specializedWrapperRes.contextText, tokenizer))
140140
return false;
141141

142+
const specializedStopGenerationTriggers = [
143+
...specializedWrapperRes.stopGenerationTriggers,
144+
...(
145+
specializedWrapperRes.rerender?.triggers == null
146+
? []
147+
: specializedWrapperRes.rerender.triggers
148+
)
149+
];
150+
142151
const jinjaHasAllSpecializedStopGenerationTriggers = jinjaRes.stopGenerationTriggers
143152
.every((trigger) => {
144153
return [trigger, trigger.trimEnd(), trigger.trimStart(), trigger.trimStart().trimEnd()].some((normalizedJinjaTrigger) => {
145154
if (normalizedJinjaTrigger.values.length === 0)
146155
return true;
147156

148-
const foundSimilarTriggers = specializedWrapperRes.stopGenerationTriggers.some((specializedTrigger) => (
157+
const foundSimilarTriggers = specializedStopGenerationTriggers.some((specializedTrigger) => (
149158
normalizedJinjaTrigger.includes(specializedTrigger)
150159
));
151160

@@ -158,7 +167,7 @@ function checkEquivalence(
158167
tokenizer
159168
);
160169

161-
const foundSimilarOrShorterTokenizedTriggers = specializedWrapperRes.stopGenerationTriggers
170+
const foundSimilarOrShorterTokenizedTriggers = specializedStopGenerationTriggers
162171
.some((specializedTrigger) => {
163172
const resolvedSpecializedTrigger = StopGenerationDetector.resolveLlamaTextTrigger(
164173
specializedTrigger,

0 commit comments

Comments
 (0)