Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Oct 31, 2025

Mirrored from ggml-org/llama.cpp#16901

Close #16179

Added a setting to display generation statistics for each assistant message — tokens/s, amount of tokens in a message and generation time.

New Setting in the General section

Zrzut ekranu 2025-10-31 o 19 49 20

Statistics at the bottom of the assistant message

Zrzut ekranu 2025-10-31 o 19 33 46

allozaur and others added 30 commits October 1, 2025 12:08
* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369)

* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <[email protected]>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
* CI: Properly install rocwmma for hip builds

on windows we now windows install rocwmma from ubuntu pacakges

* CI: update linux rocm docker build to use rocm 7.0
…16075)

* Fix to use hidden_size_per_head

* Fix num heads

* Fix array

* Fix loading weights

* Support old GGUF converted by the previous version of llama.cpp

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Move shared parameter definitions to the outside of loop

* Not calculating n_embd_head_k,v by n_embd / n_head

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…0 (#16221)

* HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0

rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA

* CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool

* update to 2025.2 use deeplearn essi to replace base toolkit

* add missed dll

* add deep learning essentials

* add sycl-ls

---------

Co-authored-by: Zhang Jianyu <[email protected]>
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…389)

* do not use more threads than physically available

* ensure n_threads > 0

Co-authored-by: Jeff Bolz <[email protected]>

---------

Co-authored-by: Jeff Bolz <[email protected]>
…rolling (#16356)

Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <[email protected]>
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
…quest (#16405)

* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
…6354)

* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
ggerganov and others added 5 commits October 31, 2025 13:50
* CUDA: add expert reduce kernel

* contigous checks, better formatting, use std::vector instead of array

* use vector empty instead of size

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* CUDA: Volta tensor core support for MMF

* more generic checks for hardware support

* Update ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Aman Gupta <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Status

Core Inference Functions - No Performance Changes

  • llama_decode: 48,432,464 ns response time (0 ns change)
  • llama_encode: 12,186,673 ns response time (0 ns change)
  • llama_tokenize: 832,591 ns response time (0 ns change)
  • llama_model_load_from_file: 330,045,630 ns response time (0 ns change)

Memory and Batch Processing Functions - Stable

  • llama_batch_init: 257 ns response time (0 ns change)
  • llama_memory_clear: 49 ns response time (0 ns change)

All critical functions show identical performance metrics between versions, with no modifications detected in the codebase.

Key Performance Indicator Impact Analysis

1. Tokens Per Second - No Impact

Status: No changes detected in tokenization/inference functions

  • llama_decode: 0 ns change (primary inference function)
  • llama_encode: 0 ns change (encoder processing)
  • llama_tokenize: 0 ns change (text-to-token conversion)

Reference Impact: Based on the provided benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U), a 2 ms increase in llama_decode results in 7% tokens/second reduction. Since llama_decode shows 0 ns change, tokens per second remains unaffected.

2. Power Consumption - Minimal Change

Impacted Binary: build.bin.libllama.so

  • Power consumption: 305,211.87 nJ (current) vs 305,212.44 nJ (base)
  • Change: -0.0% (negligible reduction of 0.57 nJ)

Other Binaries: No change

  • build.bin.libggml-base.so: 0.0% change
  • build.bin.libggml-cpu.so: 0.0% change
  • build.bin.libggml.so: 0.0% change

3. Quantization Efficiency - No Impact

Status: No changes in quantization-related functions

  • llama_model_quantize: Function not modified
  • Quantization format handling remains unchanged
  • GGML quantization operations stable

4. Memory Usage - No Impact

Status: Memory management functions show no performance changes

  • llama_memory_clear: 49 ns (0 ns change)
  • KV cache operations unmodified
  • Memory allocation patterns unchanged

5. Batch Processing - No Impact

Status: Batch processing functions maintain identical performance

  • llama_batch_init: 257 ns (0 ns change)
  • Batch allocation and management unchanged
  • Parallel processing efficiency maintained

Performance Degradation Source Analysis

The observed degradations are limited to C++ standard library functions:

  • std::stack constructor: +0.054% response time increase
  • std::_Construct template: +0.131% bottleneck increase

These functions are used in grammar parsing components, not core inference paths.

Action Items

Code-Level Optimizations

  1. Template Instantiation Optimization

    • Pre-instantiate common regex trait combinations for grammar parsing
    • Reduce template compilation overhead in std::stack constructor
  2. Memory Allocation Efficiency

    • Implement custom allocators for grammar-related containers
    • Optimize vector initialization patterns in std::_Construct

Build System Improvements

  1. Link-Time Optimization

    • Enable LTO to inline trivial constructors
    • Eliminate PLT overhead through static resolution
  2. Compiler Optimization Flags

    • Review optimization settings between builds
    • Ensure consistent -march and -mtune configurations

Conclusion

The performance analysis reveals stable core inference functionality with no impact on critical KPIs. The minimal degradations observed (0.054-0.131%) are isolated to auxiliary grammar parsing components and do not affect the primary inference pipeline. Power consumption shows a negligible improvement, and all tokenization, memory management, and batch processing functions maintain identical performance characteristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.