Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/blog/v3.12-gpt-oss.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,30 @@ npx -y node-llama-cpp inspect estimate <model URI>
:::


## `MXFP4` Quantization
You might be used to looking for a `Q4_K_M` quantization because of its good balance between quality and size,
and be looking for a `Q4_K_M` quantization of `gpt-oss` models.
You don't have to, because these models are already natively provided in a similar quantization format called `MXFP4`.

Let's break down what `MXFP4` is:
* `MXFP4` stands for Microscaling FP4 (Floating Point, 4-bit). `Q4_K_M` is also a 4-bit quantization.
* It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024.
OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta,
with the goal of lowering the hardware and compute barriers to running AI models.
* Designed to dramatically reduce the memory and compute requirements for training and running AI models,
while preserving as much precision as possible.

This format was used to train the `gpt-oss` models, so the most precise format of these models is `MXFP4`.
<br/>
Since this is a 4-bit precision format, its size footprint is similar to `Q4_K_M` quantization,
but it provides better precision and thus better quality.
First class support for `MXFP4` in `llama.cpp` was introduced as part of the `gpt-oss` release.

The bottom line is that you don't have to find a `Q4_K_M` quantization of `gpt-oss` models,
because the `MXFP4` format is as small, efficient, and fast as `Q4_K_M`,
but offers better precision and thus better quality.


### Try It Using the CLI
To quickly try out [`gpt-oss-20b`](https://huggingface.co/giladgd/gpt-oss-20b-GGUF), you can use the [CLI `chat` command](../cli/chat.md):

Expand All @@ -54,6 +78,42 @@ npx -y node-llama-cpp chat --ef --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/
```


## `thought` Segments
Since `gpt-oss` models are reasoning models, they generate thoughts as part of their response.
These thoughts are useful for debugging and understanding the model's reasoning process,
and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.

However, OpenAI [emphasizes](https://openai.com/index/chain-of-thought-monitoring/#:~:text=leaving%20CoTs%20unrestricted%20may%20make%20them%20unfit%20to%20be%20shown%20to%20end%2Dusers%2C%20as%20they%20might%20violate%20some%20misuse%20policies)
that the thoughts generated by these models may not be safe to show to end users as they are unrestricted
and might include sensitive information, uncontained language, hallucinations, or other issues.
Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.

Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.


## `comment` Segments
`gpt-oss` models output "preamble" messages in their response;
these are segmented as a new `comment` segment in the model's response.

The model might choose to generate those segments to inform the user about the functions it's about to call.
For example, when it plans to use multiple functions, it may generate a plan in advance.

These are intended for the user to see, but not as part of the main response.

Check out the [segment streaming example](../guide/chat-session.md#stream-response-segments) to learn how to use segments.

::: info Experiment with `comment` segments
The [Electron app template](../guide/electron.md) has been updated to properly segment comments in the response.

Try it out by downloading the latest build [from GitHub](https://github.com/withcatai/node-llama-cpp/releases/latest),
or by [scaffolding a new project](../guide/index.md#scaffold-new-project) based on the Electron template:

```shell
npm create node-llama-cpp@latest
```
:::


## Customizing gpt-oss
You can adjust `gpt-oss`'s responses by configuring the options of [`HarmonyChatWrapper`](../api/classes/HarmonyChatWrapper.md):
```typescript
Expand Down
6 changes: 5 additions & 1 deletion docs/guide/chat-session.md
Original file line number Diff line number Diff line change
Expand Up @@ -833,7 +833,8 @@ console.log("AI: " + a1);

## Stream Response Segments {#stream-response-segments}
The raw model response is automatically segmented into different types of segments.
The main response is not segmented, but other kinds of sections, like thoughts (chain of thought), are segmented.
The main response is not segmented, but other kinds of sections,
like thoughts (chain of thought) and comments (on relevant models, like [`gpt-oss`](../blog/v3.12-gpt-oss.md#comment-segments)), are segmented.

To stream response segments you can use the [`onResponseChunk`](../api/type-aliases/LLamaChatPromptOptions.md#onresponsechunk) option.

Expand Down Expand Up @@ -862,6 +863,8 @@ const a1 = await session.promptWithMeta(q1, {
onResponseChunk(chunk) {
const isThoughtSegment = chunk.type === "segment" &&
chunk.segmentType === "thought";
const isCommentSegment = chunk.type === "segment" &&
chunk.segmentType === "comment";

if (chunk.type === "segment" && chunk.segmentStartTime != null)
process.stdout.write(` [segment start: ${chunk.segmentType}] `);
Expand All @@ -879,6 +882,7 @@ const fullResponse = a1.response
return item;
else if (item.type === "segment") {
const isThoughtSegment = item.segmentType === "thought";
const isCommentSegment = item.segmentType === "comment";
let res = "";

if (item.startTime != null)
Expand Down
47 changes: 28 additions & 19 deletions src/chatWrappers/HarmonyChatWrapper.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import {ChatWrapper, ChatWrapperJinjaMatchConfiguration} from "../ChatWrapper.js";
import {
ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState,
ChatWrapperGeneratedPrefixTriggersContextState, ChatWrapperSettings
ChatModelFunctions, ChatModelResponse, ChatWrapperGenerateContextStateOptions, ChatWrapperGeneratedContextState, ChatWrapperSettings
} from "../types.js";
import {SpecialToken, LlamaText, SpecialTokensText} from "../utils/LlamaText.js";
import {ChatModelFunctionsDocumentationGenerator} from "./utils/ChatModelFunctionsDocumentationGenerator.js";
Expand Down Expand Up @@ -282,23 +281,21 @@ export class HarmonyChatWrapper extends ChatWrapper {
],
inject: LlamaText(new SpecialTokensText("<|message|>"))
},
...(
!hasFunctions ? [] : [{
type: "functionCall",
triggers: [
LlamaText(new SpecialTokensText("<|channel|>commentary to="))
],
replaceTrigger: true,
inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
}, {
type: "functionCall",
triggers: [
LlamaText(new SpecialTokensText("<|channel|>analysis to="))
],
replaceTrigger: true,
inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
}] satisfies ChatWrapperGeneratedPrefixTriggersContextState["prefixTriggers"]
)
{
type: "functionCall",
triggers: [
LlamaText(new SpecialTokensText("<|channel|>commentary to="))
],
replaceTrigger: true,
inject: LlamaText(new SpecialTokensText("<|channel|>commentary"))
}, {
type: "functionCall",
triggers: [
LlamaText(new SpecialTokensText("<|channel|>analysis to="))
],
replaceTrigger: true,
inject: LlamaText(new SpecialTokensText("<|channel|>analysis"))
}
],
noPrefixTrigger: {
type: "response",
Expand Down Expand Up @@ -669,6 +666,18 @@ export class HarmonyChatWrapper extends ChatWrapper {
{},
{additionalRenderParameters: jinjaParameters}
],
[
{
_jinjaFlags: {
emptyLastModelResponseIsFinalMessage: true,
useSpecialTokensForFullSystemMessage: true,
useNonFinalFinalMessage: false,
noFinalMessages: false
}
},
{},
{additionalRenderParameters: jinjaParameters}
],
[
{
_jinjaFlags: {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,13 +139,22 @@ function checkEquivalence(
if (!compareContextTexts(jinjaRes.contextText, specializedWrapperRes.contextText, tokenizer))
return false;

const specializedStopGenerationTriggers = [
...specializedWrapperRes.stopGenerationTriggers,
...(
specializedWrapperRes.rerender?.triggers == null
? []
: specializedWrapperRes.rerender.triggers
)
];

const jinjaHasAllSpecializedStopGenerationTriggers = jinjaRes.stopGenerationTriggers
.every((trigger) => {
return [trigger, trigger.trimEnd(), trigger.trimStart(), trigger.trimStart().trimEnd()].some((normalizedJinjaTrigger) => {
if (normalizedJinjaTrigger.values.length === 0)
return true;

const foundSimilarTriggers = specializedWrapperRes.stopGenerationTriggers.some((specializedTrigger) => (
const foundSimilarTriggers = specializedStopGenerationTriggers.some((specializedTrigger) => (
normalizedJinjaTrigger.includes(specializedTrigger)
));

Expand All @@ -158,7 +167,7 @@ function checkEquivalence(
tokenizer
);

const foundSimilarOrShorterTokenizedTriggers = specializedWrapperRes.stopGenerationTriggers
const foundSimilarOrShorterTokenizedTriggers = specializedStopGenerationTriggers
.some((specializedTrigger) => {
const resolvedSpecializedTrigger = StopGenerationDetector.resolveLlamaTextTrigger(
specializedTrigger,
Expand Down
Loading
Loading