UPSTREAM PR #16934: chat: Allow reasoning_content to be passed back #43

DajanaV · 2025-11-02T13:08:40Z

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

TBH I'm not sure this is the correct approach as I'm not familiar with the code. I've simply made the necessary changes for llama.cpp no longer error out when receiving reasoning_content back from the client.

I've been using GPT-OSS 120B locally with a codex fork that sends reasoning_content back, and it seems to work quite well.

It also requires a slightly modified jinja chat template that replaces "thinking" with "reasoning_content".

If this is the way to go and is merged, I will follow up with a codex PR that makes this configurable so that codex can be used correctly with llama-server.

I've also looked at Minimax M2's chat template and it seems to use reasoning_content to render <think> blocks, which is compatible to how it is done here.

In case someone wants to try my codex fork with this, here's the config you can drop to ~/.codex/config.toml:

profile = "llama_server"

[model_providers.llama_server]
name = "llama-server"
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"} # doesn't seem like this is currently working, still need to debug

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss-120b"

This is the llama-server command I use (adjust for what your hardware can handle):

llama-server --no-mmap --no-warmup --model gpt-oss-120b-mxfp4-00001-of-00003.gguf
 -a gpt-oss-120b --ctx-size 524280 -np 4 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-kwargs '{"reasoning_effort":"low"}' --chat-template-file gptoss.j2

cc @pwilkin

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]>

* rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

* feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

* fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

* tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <[email protected]> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <[email protected]>

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

* webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

* server : add /v1/health endpoint * cont : update readme

* llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](huggingface/transformers#41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback

…#16452) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

Co-authored-by: DevAI <[email protected]>

…(#16920) commit 5fb5e24 (llama : minor sampling refactor (2) (#9386)) moved the llama_sampler_accept call into llama_sampler_sample, but the sampling sample usage in llama.h was forgotten to be updated accordingly.

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

loci-agentic-ai · 2025-11-02T14:34:40Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Based on the performance data analysis, the observed degradations are minimal and do not affect the core inference pipeline or critical performance functions.

Critical Function Performance Status

Core Inference Functions - No Impact

llama_decode() - No performance changes detected
llama_encode() - No performance changes detected
llama_tokenize() - No performance changes detected
llama_model_load_from_file() - No performance changes detected
llama_batch_init() - No performance changes detected

Affected Functions - Non-Critical

_RegexMask constructor: +0.082% response time (+0.018 ns)
make_unique<llm_graph_input_pos_bucket>: +0.117% throughput (+0.122 ns)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No degradation in inference performance

Core tokenization functions (llama_tokenize, llama_detokenize) show no performance changes
Primary inference function (llama_decode) maintains baseline performance
Batch processing functions remain unaffected

Reference Impact: With the baseline that 2ms slower llama_decode reduces tokens/second by 7% on the test configuration, the observed changes would have zero impact on tokens per second as no inference-critical functions show measurable degradation.

2. Power Consumption - Negligible Binary Impact

Affected Binaries:

build.bin.libllama.so: +0.8 nJ (+0.0003% increase)
build.bin.libggml-base.so: No change
build.bin.libggml-cpu.so: No change
build.bin.libggml.so: No change

Analysis: The power consumption increase is within measurement noise and does not affect core computation binaries.

3. Quantization Efficiency - No Impact

Status: No changes detected

llama_model_quantize() function shows no performance degradation
Quantization support functions maintain baseline performance
GGML quantization operations remain unaffected

4. Memory Usage - No Impact

Status: Memory management functions unaffected

KV cache operations (llama_memory_clear, llama_memory_seq_rm) show no changes
Memory allocation functions (ggml_gallocr_new, ggml_tallocr_alloc) maintain performance
Batch memory management remains efficient

5. Batch Processing - No Impact

Status: Batch processing pipeline unaffected

llama_batch_init(), llama_batch_get_one() show no performance changes
Dynamic batching logic remains efficient
Parallel token processing maintains baseline performance

Root Cause Analysis

Grammar Processing Degradation

The observed degradations affect auxiliary systems rather than core inference:

Template instantiation overhead: make_unique specialization shows minor compiler optimization differences
Regex initialization: _RegexMask constructor experiences minimal initialization cost increase
Compiler optimization variance: Changes likely reflect minor differences in instruction scheduling rather than algorithmic issues

Action Items

Immediate Code Optimizations

Template Specialization Review
- Examine llm_graph_input_pos_bucket constructor for potential pre-allocation opportunities
- Consider compile-time initialization for frequently used graph components
Memory Allocation Patterns
- Review memory allocation patterns in graph optimization pipeline
- Evaluate if position buckets can be pre-allocated during graph construction

Build System Optimizations

Compiler Flag Verification
- Ensure consistent optimization flags across template instantiations
- Verify profile-guided optimization (PGO) coverage for affected functions
Template Optimization
- Consider explicit template instantiation for frequently used specializations
- Evaluate template parameter optimization for graph components

Performance Impact Assessment

Overall System Impact: Minimal

Core inference pipeline maintains full performance
Auxiliary system degradations are within acceptable variance
No impact on primary LLaMA.cpp performance metrics

Monitoring Focus:

Track template instantiation performance in future builds
Monitor for consistency in compiler optimization across releases

The analysis confirms that LLaMA.cpp's critical performance functions remain unaffected, with observed degradations limited to non-essential auxiliary systems that do not impact inference throughput, power efficiency, or memory utilization.

* server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning

* clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

…mode and coverage (#16936) * tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage * tests: init gf and filter out fusion tests for support mode * tests: filter out fusion cases before calling eval_support * tests: filter out fusion cases from show_test_coverage as well, fix lint

* webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <[email protected]> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

loci-agentic-ai · 2025-11-03T03:05:41Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Project

Critical Function Performance Changes

Based on the analysis of version 60af0985-b86f-4455-89da-7b1a290af1e6 compared to base version 6a6ce734-560c-4bf5-8d96-8f68e7716a0f, the performance degradations are concentrated in C++ standard library functions rather than core LLaMA.cpp inference functions.

Functions with Performance Degradation

Response Time: std::__codecvt_abstract_base<wchar_t, char, __mbstate_t>::in()

Change: +0.068% (+0.02 ns, from 29.41 ns to 29.43 ns)
Location: Standard library codecvt implementation
Impact: Character encoding conversion operations

Throughput: std::__detail::_Scanner<wchar_t>::~_Scanner()

Change: +0.079% (+0.01 ns, from 18.88 ns to 18.89 ns)
Location: C++ regex scanner destructor
Impact: Regular expression cleanup operations

Bottleneck: std::__detail::_Scanner<char>::~_Scanner()

Change: +0.105% (+0.02 ns, from 14.32 ns to 14.33 ns)
Location: C++ regex scanner destructor
Impact: Internal cleanup bottlenecks

Core LLaMA.cpp Functions Status

No Performance Impact Detected in critical inference functions:

llama_decode() - No changes detected
llama_encode() - No changes detected
llama_tokenize() - No changes detected
llama_model_load_from_file() - No changes detected
Memory management functions - No changes detected
Batch processing functions - No changes detected

KPI Impact Analysis

1. Tokens Per Second

Impact: None

No changes detected in core tokenization/inference functions (llama_decode, llama_encode, llama_tokenize)
The observed degradations are in standard library utility functions unrelated to token processing
Expected tokens per second performance remains unchanged

2. Power Consumption

Impact: Negligible

build.bin.libllama.so: -0.0% change (306,893.80 nJ vs 306,894.13 nJ base)
build.bin.libggml-base.so: 0.0% change
build.bin.libggml-cpu.so: 0.0% change
build.bin.libggml.so: 0.0% change
Overall power efficiency remains stable across all binaries

3. Quantization Efficiency

Impact: None

No changes detected in quantization-related functions
llama_model_quantize() performance unchanged
Quantization format handling unaffected

4. Memory Usage

Impact: None

Memory management functions show no performance changes
KV cache operations (llama_memory_clear, llama_memory_seq_rm) unaffected
GGML memory allocation functions unchanged

5. Batch Processing

Impact: None

Batch processing functions show no performance degradation
llama_batch_init(), llama_batch_get_one() performance unchanged
Parallel token processing efficiency maintained

Root Cause Analysis

Standard Library Function Degradations

The performance changes stem from C++ standard library functions used for:

Character encoding conversion (codecvt functions)
Regular expression processing (regex scanner destructors)

Control Flow Analysis

CFG analysis reveals identical assembly code between versions for the affected functions, indicating the degradation sources are:

Microarchitectural factors: Instruction cache alignment changes
Memory layout differences: Function placement affecting cache locality
System-level variations: CPU frequency scaling or memory subsystem timing

Action Items

Immediate Actions

Code Layout Investigation
- Analyze object file layout changes using objdump to identify function placement differences
- Compare instruction cache miss rates between versions using performance counters
Build Configuration Review
- Verify compiler optimization flags consistency between builds
- Check for differences in link-time optimization settings

Code-Specific Optimizations

Standard Library Usage Optimization
- Review usage patterns of codecvt functions in text processing paths
- Consider alternative character conversion methods if codecvt is performance-critical
Regex Usage Analysis
- Audit regex scanner usage in parsing operations
- Evaluate opportunities to replace regex with more efficient string processing

Build System Improvements

Reproducible Builds
- Implement deterministic build ordering to ensure consistent function layout
- Add build flags for consistent code generation across environments
Performance Regression Detection
- Integrate micro-benchmarks for standard library function performance
- Add automated performance comparison in CI pipeline

Conclusion

The observed performance changes are minimal and concentrated in standard library utility functions rather than core LLaMA.cpp inference operations. The primary inference KPIs (tokens per second, quantization efficiency, memory usage, batch processing) remain unaffected. The negligible power consumption changes confirm that computational workload has not increased meaningfully.

The identical assembly code analysis confirms these are environmental performance variations rather than algorithmic regressions, making them suitable for build system and deployment environment optimizations rather than code logic changes.

ggerganov and others added 30 commits October 3, 2025 19:18

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

chat : support Magistral thinking (#16413)

128d522

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

rpc : check src buffer when copying tensor (#16421)

f392839

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

vulkan: use a more appropriate amount of threads when generating shad…

86df2c9

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

ggml webgpu: actually add softmax, fix rms_norm offset (#16400)

3526657

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

server: update readme to mention n_past_max metric (#16436)

c5fef0f

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

nix : removed metal for nix (#16118)

1d49ca3

ggml : fix unaligned access in AMX code (#16315)

a23b9bd

ci : refactor sdk caching to minimize storage (#16414)

3a002af

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

llama : add --no-host to disable host buffers (#16310)

3df2244

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

metal : various optimizations + refactoring (#16446)

8ae32dc

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

metal : add support for non-padded FA KV (#16148)

0a319bb

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

memory : use sequential equal splits for recurrent modules (#16442)

0123ff3

rpc : update documentation (#16441)

c61ae20

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

presets : fix pooling param for embedding models (#16455)

ef4c5b8

server : add /v1/health endpoint (#16461)

df1b612

* server : add /v1/health endpoint * cont : update readme

server : improve context checkpoint logic (#16440)

7fdd16b

metal : mark FA blocks (#16372)

b2c08c9

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

server : fix cancel pending task (#16467)

d2ee056

Co-authored-by: DevAI <[email protected]>

mnehete32 and others added 3 commits November 2, 2025 11:12

CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917)

7db35a7

chat: Allow reasoning_content to be passed back

de4343a

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

DajanaV temporarily deployed to PROD__AL_DEMO November 2, 2025 13:08 — with GitHub Actions Inactive

common : move gpt-oss reasoning processing to init params (#16937)

87c9efc

DajanaV force-pushed the main branch from f24281b to 2ee4526 Compare November 2, 2025 16:07

DajanaV force-pushed the main branch from 2ee4526 to 33a49ec Compare November 2, 2025 20:07

ggerganov and others added 7 commits November 2, 2025 21:21

ci : disable failing riscv cross build (#16952)

dd52868

Merge branch 'master' into allow-passing-back-reasoning-content

d6e2094

Add test for checking if reasoning_content is accepted

48237c2

DajanaV force-pushed the main branch from 33a49ec to 8afd3b9 Compare November 3, 2025 00:33

DajanaV temporarily deployed to PROD__AL_DEMO November 3, 2025 01:33 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 9 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16934: chat: Allow reasoning_content to be passed back #43

UPSTREAM PR #16934: chat: Allow reasoning_content to be passed back #43

Uh oh!

DajanaV commented Nov 2, 2025

Uh oh!

loci-agentic-ai bot commented Nov 2, 2025

Uh oh!

loci-agentic-ai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

91 participants