Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Oct 31, 2025

Mirrored from ggml-org/llama.cpp#16784

  • Add no-cache headers to /props and /slots

  • Throttle slot checks to 30s

  • Prevent concurrent fetches with promise guard

  • Trigger refresh from chat streaming for legacy and ModelSelector

  • Show dynamic serverWarning when using cached data

  • Updated assistant message bubbles to show each message's stored model when available,
    falling back to the current server model only when the per-message value is missing

  • When the model selector is disabled, now fetches /props and prioritizes that model name
    over chunk metadata, then persists it with the streamed message so legacy mode properly
    reflects the backend configuration

Cmdline used on legacy (Raspberry Pi 5) :

/root/llama.cpp/build/bin/llama-server \
 -m /root/ia/models/mradermacher/OLMoE-1B-7B-0125-Instruct-i1-GGUF/OLMoE-1B-7B-0125-Instruct.i1-Q6_K.gguf \
 -ctk q8_0 -ctv q8_0 -fa on \
 --jinja --ctx-size 8192 --mlock --port 8081

/root/llama.cpp/build/bin/llama-server \
 -m /root/ia/models/mradermacher/OLMoE-1B-7B-0125-SFT-i1-GGUF/OLMoE-1B-7B-0125-SFT.i1-Q6_K.gguf \
 -ctk q8_0 -ctv q8_0 -fa on \
 --jinja --ctx-size 8192 --mlock --port 8081

/root/llama.cpp/build/bin/llama-server \
 -m /root/ia/models/mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q4_K_M.gguf \
 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
 -ctk q8_0 -ctv q8_0 -fa on \
 --jinja --ctx-size 4096 --port 8081

Fixes ggml-org/llama.cpp#16771

EDIT : I've recorded a video that specifically targets the original issue.
Testing video, Raspberry Pi 5 + master branch + this PR :

CmdLineSwap-RaspberryPi5.mp4

And another one to show there's no regression when the model selector is enabled,
also demonstrating the multimodal function updates:

NonReg-FullSetup.mp4

NeoZhangJianyu and others added 30 commits October 2, 2025 10:16
* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool

* update to 2025.2 use deeplearn essi to replace base toolkit

* add missed dll

* add deep learning essentials

* add sycl-ls

---------

Co-authored-by: Zhang Jianyu <[email protected]>
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…389)

* do not use more threads than physically available

* ensure n_threads > 0

Co-authored-by: Jeff Bolz <[email protected]>

---------

Co-authored-by: Jeff Bolz <[email protected]>
…rolling (#16356)

Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <[email protected]>
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
…quest (#16405)

* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
…6354)

* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
…ers (#16418)

* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* add test model

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
@DajanaV DajanaV force-pushed the upstream-PR16784-branch_ServeurpersoCom-webui-props-auto-refresh branch from 290a9d9 to 1ee8d6b Compare November 1, 2025 02:34
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Based on my analysis of the performance data for project 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing version 69466bd6-9a46-44f0-aa6c-d9f4f4737e08 against base version 8306f911-f47a-4005-9b4c-b62b55eeb2e9, here's the comprehensive performance summary:

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Changes

Response Time Degradation

  • Function: deallocate in llama-model-loader.cpp
  • Binary: build.bin.libllama.so
  • Change: +0.033% (56 ns vs 56 ns base)
  • Status: No code modifications detected between versions

Throughput and Bottleneck Degradation

  • Function: _Construct (std::vector constructor)
  • Binary: build.bin.libllama.so
  • Throughput Change: +0.050% (23 ns vs 23 ns base)
  • Bottleneck Change: +0.109% (11 ns vs 11 ns base)
  • Status: No code modifications detected between versions

KPI Impact Analysis

1. Tokens Per Second Impact

Status: No measurable impact on inference throughput

Critical Functions Analysis:

  • llama_decode(): No performance changes detected
  • llama_encode(): No performance changes detected
  • llama_tokenize(): No performance changes detected
  • llama_detokenize(): No performance changes detected

Assessment: The observed degradations are in memory management functions (deallocate, _Construct) rather than core inference functions. Based on the reference that 2ms slower llama_decode results in 7% fewer tokens/second, the current 0.018ns degradation in unrelated functions will not impact tokens per second performance.

2. Power Consumption Impact

Binary-Level Analysis:

  • build.bin.libllama.so: -0.0% change (306,978 nJ vs 306,978 nJ base)
  • build.bin.libggml-base.so: 0.0% change
  • build.bin.libggml-cpu.so: 0.0% change
  • build.bin.libggml.so: 0.0% change

Assessment: No measurable power consumption changes across all binaries.

3. Quantization Efficiency

Critical Functions Analysis:

  • llama_model_quantize(): No performance changes detected
  • ggml_quantize_free(): No performance changes detected
  • Quantization format handling: No changes detected

Assessment: No impact on quantization efficiency.

4. Memory Usage Impact

Affected Functions:

  • deallocate function: +0.033% response time degradation
  • _Construct (vector): +0.050% throughput, +0.109% bottleneck degradation

Memory Management Functions:

  • llama_memory_clear(): No changes detected
  • llama_memory_seq_rm(): No changes detected
  • llama_memory_seq_cp(): No changes detected
  • ggml_gallocr_new(): No changes detected
  • ggml_tallocr_alloc(): No changes detected

Assessment: Minor degradation in standard library memory allocation functions, but core LLaMA memory management functions remain unaffected.

5. Batch Processing Impact

Critical Functions Analysis:

  • llama_batch_init(): No performance changes detected
  • llama_batch_get_one(): No performance changes detected
  • llama_batch_free(): No performance changes detected
  • llama_decode() (batch processing): No performance changes detected

Assessment: No impact on batch processing efficiency.

Root Cause Analysis

Assembly Code Investigation: The deallocate function shows identical assembly code between versions, indicating the performance difference stems from environmental factors rather than code changes.

Control Flow Analysis: No structural changes detected in any critical functions. The CFGs for affected functions are identical between versions.

Action Items

Immediate Actions

  1. Build Environment Consistency: Verify compiler version, optimization flags (-O2, -O3), and link-time optimization (LTO) settings are identical between builds
  2. Binary Layout Verification: Check for differences in section alignment or memory layout that could affect cache behavior
  3. Measurement Validation: The 0.033% degradation falls within typical profiling variance margins

Code-Focused Optimizations

  1. Size Calculation Optimization in deallocate: The current 7-instruction sequence for size calculation (size * 88) could be optimized to 2 instructions using direct multiplication
  2. Vector Constructor: Consider alternatives to std::vector<bool> if bit-packing overhead becomes significant in hot paths

Build System Recommendations

  1. Compiler Flag Alignment: Ensure consistent optimization flags across build environments
  2. Link-Time Optimization: Verify LTO configuration consistency
  3. Profile-Guided Optimization: Consider PGO for performance-critical builds

Conclusion

The observed performance changes are minimal (all under 0.11%) and appear to be measurement variance rather than functional regressions. No critical inference functions show degradation, and power consumption remains unchanged. The core LLaMA.cpp inference pipeline maintains its performance characteristics with no impact on tokens per second, quantization efficiency, or batch processing capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet