withcatai · giladgd · Aug 10, 2025 · Aug 10, 2025 · Aug 10, 2025 · Aug 10, 2025
diff --git a/docs/blog/v3.12-gpt-oss.md b/docs/blog/v3.12-gpt-oss.md
@@ -46,6 +46,30 @@ npx -y node-llama-cpp inspect estimate <model URI>
 :::
 
 
+## `MXFP4` Quantization
+You might be used to looking for a `Q4_K_M` quantization because of its good balance between quality and size,
+and be looking for a `Q4_K_M` quantization of `gpt-oss` models.
+You don't have to, because these models are already natively provided in a similar quantization format called `MXFP4`.
+
+Let's break down what `MXFP4` is:
+* `MXFP4` stands for Microscaling FP4 (Floating Point, 4-bit). `Q4_K_M` is also a 4-bit quantization.
+* It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024.
+  OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta, 
+  with the goal of lowering the hardware and compute barriers to running AI models.
+* Designed to dramatically reduce the memory and compute requirements for training and running AI models,
+  while preserving as much precision as possible.
+
+This format was used to train the `gpt-oss` models, so the most precise format of these models is `MXFP4`.
+<br/>
+Since this is a 4-bit precision format, its size footprint is similar to `Q4_K_M` quantization,
+but it provides better precision and thus better quality.
+First class support for `MXFP4` in `llama.cpp` was introduced as part of the `gpt-oss` release.
+
+The bottom line is that you don't have to find a `Q4_K_M` quantization of `gpt-oss` models,
+because the `MXFP4` format is as small, efficient, and fast as `Q4_K_M`,
+but offers better precision and thus better quality.
+
+
 ### Try It Using the CLI
 To quickly try out [`gpt-oss-20b`](https://huggingface.co/giladgd/gpt-oss-20b-GGUF), you can use the [CLI `chat` command](../cli/chat.md):
 
@@ -54,6 +78,42 @@ npx -y node-llama-cpp chat --ef --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/
 ```
 
 
+## `thought` Segments
+Since `gpt-oss` models are reasoning models, they generate thoughts as part of their response.
+These thoughts are useful for debugging and understanding the model's reasoning process,
+and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.
+
+However, OpenAI [emphasizes](https://openai.com/index/chain-of-thought-monitoring/#:~:text=leaving%20CoTs%20unrestricted%20may%20make%20them%20unfit%20to%20be%20shown%20to%20end%2Dusers%2C%20as%20they%20might%20violate%20some%20misuse%20policies)
+that the thoughts generated by these models may not be safe to show to end users as they are unrestricted
+and might include sensitive information, uncontained language, hallucinations, or other issues.
+Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.
+
+Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.
+
+
+## `comment` Segments
+`gpt-oss` models output "preamble" messages in their response;
+these are segmented as a new `comment` segment in the model's response.
+
+The model might choose to generate those segments to inform the user about the functions it's about to call.
+For example, when it plans to use multiple functions, it may generate a plan in advance.
+
+These are intended for the user to see, but not as part of the main response.
+
+Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.
+
+::: info Experiment with `comment` segments
+The [Electron app template](../guide/electron.md) has been updated to properly segment comments in the response.
+
+Try it out by downloading the latest build [from GitHub](https://github.com/withcatai/node-llama-cpp/releases/latest),
+or by [scaffolding a new project](../guide/index.md#scaffold-new-project) based on the Electron template:
+
+```shell
+npm create node-llama-cpp@latest
+```
+:::
+
+
 ## Customizing gpt-oss
 You can adjust `gpt-oss`'s responses by configuring the options of [`HarmonyChatWrapper`](../api/classes/HarmonyChatWrapper.md):
 ```typescript

diff --git a/docs/guide/chat-session.md b/docs/guide/chat-session.md
@@ -833,7 +833,8 @@ console.log("AI: " + a1);
 
 ## Stream Response Segments {#stream-response-segments}
 The raw model response is automatically segmented into different types of segments.
-The main response is not segmented, but other kinds of sections, like thoughts (chain of thought), are segmented.
+The main response is not segmented, but other kinds of sections,
+like thoughts (chain of thought) and comments (on relevant models, like [`gpt-oss`](../blog/v3.12-gpt-oss.md#comment-segments)), are segmented.
 
 To stream response segments you can use the [`onResponseChunk`](../api/type-aliases/LLamaChatPromptOptions.md#onresponsechunk) option.
 
@@ -862,6 +863,8 @@ const a1 = await session.promptWithMeta(q1, {
     onResponseChunk(chunk) {
         const isThoughtSegment = chunk.type === "segment" &&
             chunk.segmentType === "thought";
+        const isCommentSegment = chunk.type === "segment" &&
+            chunk.segmentType === "comment";
 
         if (chunk.type === "segment" && chunk.segmentStartTime != null)
             process.stdout.write(` [segment start: ${chunk.segmentType}] `);
@@ -879,6 +882,7 @@ const fullResponse = a1.response
             return item;
         else if (item.type === "segment") {
             const isThoughtSegment = item.segmentType === "thought";
+            const isCommentSegment = item.segmentType === "comment";
             let res = "";
 
             if (item.startTime != null)

diff --git a/src/chatWrappers/HarmonyChatWrapper.ts b/src/chatWrappers/HarmonyChatWrapper.ts
@@ -1,7 +1,6 @@
 import {ChatWrapper, ChatWrapperJinjaMatchConfiguration} from "../ChatWrapper.js";
 import {
-    ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState,
-    ChatWrapperGeneratedPrefixTriggersContextState, ChatWrapperSettings
+    ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState, ChatWrapperSettings
 } from "../types.js";
 import {SpecialToken, LlamaText, SpecialTokensText} from "../utils/LlamaText.js";
 import {ChatModelFunctionsDocumentationGenerator} from "./utils/ChatModelFunctionsDocumentationGenerator.js";
@@ -282,23 +281,21 @@ export class HarmonyChatWrapper extends ChatWrapper {
                     ],
                     inject: LlamaText(new SpecialTokensText("<|message|>"))
                 },
-                ...(
-                    !hasFunctions ? [] : [{
-                        type: "functionCall",
-                        triggers: [
-                            LlamaText(new SpecialTokensText("<|channel|>commentary to="))
-                        ],
-                        replaceTrigger: true,
-                        inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
-                    }, {
-                        type: "functionCall",
-                        triggers: [
-                            LlamaText(new SpecialTokensText("<|channel|>analysis to="))
-                        ],
-                        replaceTrigger: true,
-                        inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
-                    }] satisfies ChatWrapperGeneratedPrefixTriggersContextState["prefixTriggers"]
-                )
+                {
+                    type: "functionCall",
+                    triggers: [
+                        LlamaText(new SpecialTokensText("<|channel|>commentary to="))
+                    ],
+                    replaceTrigger: true,
+                    inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
+                }, {
+                    type: "functionCall",
+                    triggers: [
+                        LlamaText(new SpecialTokensText("<|channel|>analysis to="))
+                    ],
+                    replaceTrigger: true,
+                    inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
+                }
             ],
             noPrefixTrigger: {
                 type: "response",
@@ -669,6 +666,18 @@ export class HarmonyChatWrapper extends ChatWrapper {
                 {},
                 {additionalRenderParameters: jinjaParameters}
             ],
+            [
+                {
+                    _jinjaFlags: {
+                        emptyLastModelResponseIsFinalMessage: true,
+                        useSpecialTokensForFullSystemMessage: true,
+                        useNonFinalFinalMessage: false,
+                        noFinalMessages: false
+                    }
+                },
+                {},
+                {additionalRenderParameters: jinjaParameters}
+            ],
             [
                 {
                     _jinjaFlags: {

diff --git a/src/chatWrappers/utils/isJinjaTemplateEquivalentToSpecializedChatWrapper.ts b/src/chatWrappers/utils/isJinjaTemplateEquivalentToSpecializedChatWrapper.ts
@@ -139,13 +139,22 @@ function checkEquivalence(
         if (!compareContextTexts(jinjaRes.contextText, specializedWrapperRes.contextText, tokenizer))
             return false;
 
+        const specializedStopGenerationTriggers = [
+            ...specializedWrapperRes.stopGenerationTriggers,
+            ...(
+                specializedWrapperRes.rerender?.triggers == null
+                    ? []
+                    : specializedWrapperRes.rerender.triggers
+            )
+        ];
+
         const jinjaHasAllSpecializedStopGenerationTriggers = jinjaRes.stopGenerationTriggers
             .every((trigger) => {
                 return [trigger, trigger.trimEnd(), trigger.trimStart(), trigger.trimStart().trimEnd()].some((normalizedJinjaTrigger) => {
                     if (normalizedJinjaTrigger.values.length === 0)
                         return true;
 
-                    const foundSimilarTriggers = specializedWrapperRes.stopGenerationTriggers.some((specializedTrigger) => (
+                    const foundSimilarTriggers = specializedStopGenerationTriggers.some((specializedTrigger) => (
                         normalizedJinjaTrigger.includes(specializedTrigger)
                     ));
 
@@ -158,7 +167,7 @@ function checkEquivalence(
                             tokenizer
                         );
 
-                        const foundSimilarOrShorterTokenizedTriggers = specializedWrapperRes.stopGenerationTriggers
+                        const foundSimilarOrShorterTokenizedTriggers = specializedStopGenerationTriggers
                             .some((specializedTrigger) => {
                                 const resolvedSpecializedTrigger = StopGenerationDetector.resolveLlamaTextTrigger(
                                     specializedTrigger,