Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 2, 2025

Mirrored from ggml-org/llama.cpp#16934

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

TBH I'm not sure this is the correct approach as I'm not familiar with the code. I've simply made the necessary changes for llama.cpp no longer error out when receiving reasoning_content back from the client.

I've been using GPT-OSS 120B locally with a codex fork that sends reasoning_content back, and it seems to work quite well.

It also requires a slightly modified jinja chat template that replaces "thinking" with "reasoning_content".

If this is the way to go and is merged, I will follow up with a codex PR that makes this configurable so that codex can be used correctly with llama-server.

I've also looked at Minimax M2's chat template and it seems to use reasoning_content to render <think> blocks, which is compatible to how it is done here.

In case someone wants to try my codex fork with this, here's the config you can drop to ~/.codex/config.toml:

profile = "llama_server"

[model_providers.llama_server]
name = "llama-server"
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"} # doesn't seem like this is currently working, still need to debug

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss-120b"

This is the llama-server command I use (adjust for what your hardware can handle):

llama-server --no-mmap --no-warmup --model gpt-oss-120b-mxfp4-00001-of-00003.gguf
 -a gpt-oss-120b --ctx-size 524280 -np 4 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-kwargs '{"reasoning_effort":"low"}' --chat-template-file gptoss.j2

cc @pwilkin

ggerganov and others added 30 commits October 3, 2025 19:18
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
…ers (#16418)

* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* add test model

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
This commit removes jina-reranker-v1-tiny-en model files that are no
longer present on Hugging Face.

The motivation for this that it clears up the CI logs from 404 errors
which can be a little confusing when looking at the logs the first time.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649
* refactor sdk caching to minimize storage

* use correct action

* add myself as owner to /.github/actions/ [no ci]
* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>
* metal : ssm_scan minor opts

* metal : get_rows optimize

* metal : cpy optimize

* metal : ssm_conv opt

* metal : ssm_scan simplify

* metal : ssm_Scan opt
* tests : add -INF blocks to the KQ mask in the FA tests

* cont : bump -INF block size to 64

Co-authored-by: Jeff Bolz <[email protected]>

* ggml : prevent division by zero in FA CPU op

---------

Co-authored-by: Jeff Bolz <[email protected]>
* metal : pad K, V and Mask when needed

* cont : simplify

* cuda : add TODO about KV padding requirement

* metal : add comments

* metal : remove mask padding requirement
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <[email protected]>
* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* webui : Updated static build output

---------

Co-authored-by: Aleksander Grygier <[email protected]>
* server : add /v1/health endpoint

* cont : update readme
* llama : support LiquidAI LFM2-MoE hybrid model

Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model.
For more information about models, please read [the blog post](https://www.liquid.ai/company/news).

[HF PR](huggingface/transformers#41401)
[GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF)

* Do not use defaultdict

* Address PR feedback
…#16452)

* Add profiling

* More detailed profiling

* Rework command submission to avoid global locks

* Update wait handling

* try new method of waiting on futures

* Add serializing of command submission in some cases

* Add new pool for timestamp queries and clean up logging

* Serialize command submission in CI and leave a TODO note

* Update webgpu CI

* Add myself as WebGPU codeowner

* Deadlock avoidance

* Leave WebGPU/Vulkan CI serialized

* Fix divide by 0

* Fix logic in division by inflight_threads

* Update CODEOWNERS and remove serialize submit option
* metal : better unroll in the FA kernels

* metal : index FA blocks

* tests : restore [no ci]

* metal : prevent division by zero in FA kernels

* metal : fix -INF detection logic
mnehete32 and others added 3 commits November 2, 2025 11:12
…(#16920)

commit 5fb5e24 (llama : minor
sampling refactor (2) (#9386)) moved the llama_sampler_accept call
into llama_sampler_sample, but the sampling sample usage in llama.h
was forgotten to be updated accordingly.
This makes it possible for reasoning_content to be passed back to llama-server,
which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Based on the performance data analysis, the observed degradations are minimal and do not affect the core inference pipeline or critical performance functions.

Critical Function Performance Status

Core Inference Functions - No Impact

  • llama_decode() - No performance changes detected
  • llama_encode() - No performance changes detected
  • llama_tokenize() - No performance changes detected
  • llama_model_load_from_file() - No performance changes detected
  • llama_batch_init() - No performance changes detected

Affected Functions - Non-Critical

  • _RegexMask constructor: +0.082% response time (+0.018 ns)
  • make_unique<llm_graph_input_pos_bucket>: +0.117% throughput (+0.122 ns)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No degradation in inference performance

  • Core tokenization functions (llama_tokenize, llama_detokenize) show no performance changes
  • Primary inference function (llama_decode) maintains baseline performance
  • Batch processing functions remain unaffected

Reference Impact: With the baseline that 2ms slower llama_decode reduces tokens/second by 7% on the test configuration, the observed changes would have zero impact on tokens per second as no inference-critical functions show measurable degradation.

2. Power Consumption - Negligible Binary Impact

Affected Binaries:

  • build.bin.libllama.so: +0.8 nJ (+0.0003% increase)
  • build.bin.libggml-base.so: No change
  • build.bin.libggml-cpu.so: No change
  • build.bin.libggml.so: No change

Analysis: The power consumption increase is within measurement noise and does not affect core computation binaries.

3. Quantization Efficiency - No Impact

Status: No changes detected

  • llama_model_quantize() function shows no performance degradation
  • Quantization support functions maintain baseline performance
  • GGML quantization operations remain unaffected

4. Memory Usage - No Impact

Status: Memory management functions unaffected

  • KV cache operations (llama_memory_clear, llama_memory_seq_rm) show no changes
  • Memory allocation functions (ggml_gallocr_new, ggml_tallocr_alloc) maintain performance
  • Batch memory management remains efficient

5. Batch Processing - No Impact

Status: Batch processing pipeline unaffected

  • llama_batch_init(), llama_batch_get_one() show no performance changes
  • Dynamic batching logic remains efficient
  • Parallel token processing maintains baseline performance

Root Cause Analysis

Grammar Processing Degradation

The observed degradations affect auxiliary systems rather than core inference:

  • Template instantiation overhead: make_unique specialization shows minor compiler optimization differences
  • Regex initialization: _RegexMask constructor experiences minimal initialization cost increase
  • Compiler optimization variance: Changes likely reflect minor differences in instruction scheduling rather than algorithmic issues

Action Items

Immediate Code Optimizations

  1. Template Specialization Review

    • Examine llm_graph_input_pos_bucket constructor for potential pre-allocation opportunities
    • Consider compile-time initialization for frequently used graph components
  2. Memory Allocation Patterns

    • Review memory allocation patterns in graph optimization pipeline
    • Evaluate if position buckets can be pre-allocated during graph construction

Build System Optimizations

  1. Compiler Flag Verification

    • Ensure consistent optimization flags across template instantiations
    • Verify profile-guided optimization (PGO) coverage for affected functions
  2. Template Optimization

    • Consider explicit template instantiation for frequently used specializations
    • Evaluate template parameter optimization for graph components

Performance Impact Assessment

Overall System Impact: Minimal

  • Core inference pipeline maintains full performance
  • Auxiliary system degradations are within acceptable variance
  • No impact on primary LLaMA.cpp performance metrics

Monitoring Focus:

  • Track template instantiation performance in future builds
  • Monitor for consistency in compiler optimization across releases

The analysis confirms that LLaMA.cpp's critical performance functions remain unaffected, with observed degradations limited to non-essential auxiliary systems that do not impact inference throughput, power efficiency, or memory utilization.

* server : support unified context across slots

* cont : fix speculative decoding initialization

* context : fix n_ctx_per_seq computation

* server : purge slots one by one

* tests : add unified cache server tests

* llama : update per-seq context computation

* test-thread-safety : handle tiny training context of the input model

* server : fix server_tokens clear()

* server : use 4 slots + unified KV by default

* llama : add note about context size queries

* cont : update todos [no ci]

* context : do not cap the size of the context

* tests : adjust parameters to be CI friendlier

* context : add warning
ggerganov and others added 7 commits November 2, 2025 21:21
* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Xuan-Son Nguyen <[email protected]>
Co-authored-by: Xuan-Son Nguyen <[email protected]>
…mode and coverage (#16936)

* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage

* tests: init gf and filter out fusion tests for support mode

* tests: filter out fusion cases before calling eval_support

* tests: filter out fusion cases from show_test_coverage as well, fix lint
* webui : Revised LaTeX formula recognition

* webui : Further examples containg amounts

* webui : vitest for maskInlineLaTeX

* webui: Moved preprocessLaTeX to lib/utils

* webui: LaTeX in table-cells

* chore: update webui build output (use theirs)

* webui: backslash in LaTeX-preprocessing

* chore: update webui build output

* webui: look-behind backslash-check

* chore: update webui build output

* Apply suggestions from code review

Code maintenance (variable names, code formatting, string handling)

Co-authored-by: Aleksander Grygier <[email protected]>

* webui: Moved constants to lib/constants.

* webui: package woff2 inside base64 data

* webui: LaTeX-line-break in display formula

* chore: update webui build output

* webui: Bugfix (font embedding)

* webui: Bugfix (font embedding)

* webui: vite embeds assets

* webui: don't suppress 404 (fonts)

* refactor: KaTeX integration with SCSS

Moves KaTeX styling to SCSS for better customization and font embedding.

This change includes:
- Adding `sass` as a dev dependency.
- Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding.
- Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables.

* fix: LaTeX processing within blockquotes

* webui: update webui build output

---------

Co-authored-by: Aleksander Grygier <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Project

Critical Function Performance Changes

Based on the analysis of version 60af0985-b86f-4455-89da-7b1a290af1e6 compared to base version 6a6ce734-560c-4bf5-8d96-8f68e7716a0f, the performance degradations are concentrated in C++ standard library functions rather than core LLaMA.cpp inference functions.

Functions with Performance Degradation

Response Time: std::__codecvt_abstract_base<wchar_t, char, __mbstate_t>::in()

  • Change: +0.068% (+0.02 ns, from 29.41 ns to 29.43 ns)
  • Location: Standard library codecvt implementation
  • Impact: Character encoding conversion operations

Throughput: std::__detail::_Scanner<wchar_t>::~_Scanner()

  • Change: +0.079% (+0.01 ns, from 18.88 ns to 18.89 ns)
  • Location: C++ regex scanner destructor
  • Impact: Regular expression cleanup operations

Bottleneck: std::__detail::_Scanner<char>::~_Scanner()

  • Change: +0.105% (+0.02 ns, from 14.32 ns to 14.33 ns)
  • Location: C++ regex scanner destructor
  • Impact: Internal cleanup bottlenecks

Core LLaMA.cpp Functions Status

No Performance Impact Detected in critical inference functions:

  • llama_decode() - No changes detected
  • llama_encode() - No changes detected
  • llama_tokenize() - No changes detected
  • llama_model_load_from_file() - No changes detected
  • Memory management functions - No changes detected
  • Batch processing functions - No changes detected

KPI Impact Analysis

1. Tokens Per Second

Impact: None

  • No changes detected in core tokenization/inference functions (llama_decode, llama_encode, llama_tokenize)
  • The observed degradations are in standard library utility functions unrelated to token processing
  • Expected tokens per second performance remains unchanged

2. Power Consumption

Impact: Negligible

  • build.bin.libllama.so: -0.0% change (306,893.80 nJ vs 306,894.13 nJ base)
  • build.bin.libggml-base.so: 0.0% change
  • build.bin.libggml-cpu.so: 0.0% change
  • build.bin.libggml.so: 0.0% change
  • Overall power efficiency remains stable across all binaries

3. Quantization Efficiency

Impact: None

  • No changes detected in quantization-related functions
  • llama_model_quantize() performance unchanged
  • Quantization format handling unaffected

4. Memory Usage

Impact: None

  • Memory management functions show no performance changes
  • KV cache operations (llama_memory_clear, llama_memory_seq_rm) unaffected
  • GGML memory allocation functions unchanged

5. Batch Processing

Impact: None

  • Batch processing functions show no performance degradation
  • llama_batch_init(), llama_batch_get_one() performance unchanged
  • Parallel token processing efficiency maintained

Root Cause Analysis

Standard Library Function Degradations

The performance changes stem from C++ standard library functions used for:

  • Character encoding conversion (codecvt functions)
  • Regular expression processing (regex scanner destructors)

Control Flow Analysis

CFG analysis reveals identical assembly code between versions for the affected functions, indicating the degradation sources are:

  • Microarchitectural factors: Instruction cache alignment changes
  • Memory layout differences: Function placement affecting cache locality
  • System-level variations: CPU frequency scaling or memory subsystem timing

Action Items

Immediate Actions

  1. Code Layout Investigation

    • Analyze object file layout changes using objdump to identify function placement differences
    • Compare instruction cache miss rates between versions using performance counters
  2. Build Configuration Review

    • Verify compiler optimization flags consistency between builds
    • Check for differences in link-time optimization settings

Code-Specific Optimizations

  1. Standard Library Usage Optimization

    • Review usage patterns of codecvt functions in text processing paths
    • Consider alternative character conversion methods if codecvt is performance-critical
  2. Regex Usage Analysis

    • Audit regex scanner usage in parsing operations
    • Evaluate opportunities to replace regex with more efficient string processing

Build System Improvements

  1. Reproducible Builds

    • Implement deterministic build ordering to ensure consistent function layout
    • Add build flags for consistent code generation across environments
  2. Performance Regression Detection

    • Integrate micro-benchmarks for standard library function performance
    • Add automated performance comparison in CI pipeline

Conclusion

The observed performance changes are minimal and concentrated in standard library utility functions rather than core LLaMA.cpp inference operations. The primary inference KPIs (tokens per second, quantization efficiency, memory usage, batch processing) remain unaffected. The negligible power consumption changes confirm that computational workload has not increased meaningfully.

The identical assembly code analysis confirms these are environmental performance variations rather than algorithmic regressions, making them suitable for build system and deployment environment optimizations rather than code logic changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.