Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 1, 2025

Mirrored from ggml-org/llama.cpp#14891

Following up from #9400 and #12718, I've started tinkering with activation-based statistics, in addition to what's currently available via --show-statistics.

At the moment, I'm exploring three options going from from easy to implement and OK approximation, to some assembly required but fairly accurate:

  1. L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.
  2. KL Divergence reduction using a pre-computed logit file: using a similar approach as described by nostalgebraist in logit lens, and based on a pre-computed logit file (e.g. from a previous llama-perplexity --save-all-logits run)
  3. Given that llama-imatrix already generates the actual logits to compute PPL, use Thông T. Nguyễn's logit prism approach to calculate the exact contribution of each layer to the final logit scores

Sharing with the readers, and in particular @compilade and @jukofyork, in case anyone's willing to double check assumptions and/or suggest alternative approaches I haven't considered.

jeffbolznv and others added 30 commits October 3, 2025 11:52
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
…6354)

* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
…ers (#16418)

* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* add test model

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
This commit removes jina-reranker-v1-tiny-en model files that are no
longer present on Hugging Face.

The motivation for this that it clears up the CI logs from 404 errors
which can be a little confusing when looking at the logs the first time.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649
* refactor sdk caching to minimize storage

* use correct action

* add myself as owner to /.github/actions/ [no ci]
* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>
* metal : ssm_scan minor opts

* metal : get_rows optimize

* metal : cpy optimize

* metal : ssm_conv opt

* metal : ssm_scan simplify

* metal : ssm_Scan opt
* tests : add -INF blocks to the KQ mask in the FA tests

* cont : bump -INF block size to 64

Co-authored-by: Jeff Bolz <[email protected]>

* ggml : prevent division by zero in FA CPU op

---------

Co-authored-by: Jeff Bolz <[email protected]>
* metal : pad K, V and Mask when needed

* cont : simplify

* cuda : add TODO about KV padding requirement

* metal : add comments

* metal : remove mask padding requirement
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <[email protected]>
* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* webui : Updated static build output

---------

Co-authored-by: Aleksander Grygier <[email protected]>
ggerganov and others added 6 commits October 31, 2025 16:26
* CUDA: Volta tensor core support for MMF

* more generic checks for hardware support

* Update ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Aman Gupta <[email protected]>
* Model: Minimax M2

* Cleanup

* Cleanup pt. 2

* Cleanup pt. 3

* Update convert_hf_to_gguf_update.py - merge catch blocks

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Remove vocab models and test

* Remove all redundant hparam settings covered by TextModel

* Move super to start, don't set block_count

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Sqashed: llama-model.cpp refactoring

* Fix formatting of attn / ffn / ffn_moe calls

* Fix import regression / unify spacing in models.h

* totally DID NOT miss those!

* Add missing qwen3vl(moe) models

* Add missing new .cpp files to build

* Remove extra semicolons

* Editor checker

* Update src/models/models.h

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #33 Impact on LLaMA.cpp

Critical Function Performance Analysis

Core Inference Functions - No Performance Impact

Primary Inference Pipeline:

  • llama_decode(): 49,003,788 ns (no change) - Core token processing function
  • llama_tokenize(): 834,830 ns (no change) - Text-to-token conversion
  • llama_model_load_from_file(): 333,127,900 ns (no change) - Model loading
  • llama_batch_init(): 257 ns (no change) - Batch initialization

Key Finding: All critical inference functions show zero performance degradation, indicating the imatrix statistical enhancements do not impact core inference performance.

Affected Functions

Standard Library Components:

  • _RegexMask constructor: +0.082% Response Time (+0.018 ns)
  • _Construct template: +0.109% Bottleneck (+0.011 ns)

These functions are part of C++ standard library components used in grammar processing, not core inference paths.

Key Performance Indicators Impact Analysis

1. Tokens Per Second - No Impact

Analysis: Core tokenization and inference functions show no performance changes.

Critical Functions Status:

  • llama_decode(): 0% change (49,003,788 ns)
  • llama_tokenize(): 0% change (834,830 ns)
  • llama_encode(): Not measured but no code modifications

Inference: Based on the reference that 2ms slower llama_decode() results in 7% fewer tokens per second, the zero change in llama_decode() indicates no impact on tokens per second performance.

2. Power Consumption - Negligible Impact

Binary-Level Analysis:

  • build.bin.libllama.so: <0.001% increase (306,978.34 nJ vs 306,978.33 nJ)
  • build.bin.libggml.so: 0% change (6,339.24 nJ)
  • build.bin.libggml-cpu.so: 0% change (151,692.17 nJ)
  • build.bin.libggml-base.so: 0% change (90,434.19 nJ)

Impact: Minimal power consumption increase limited to the main library binary.

3. Quantization Efficiency - No Direct Impact

Analysis: Quantization-related functions show no performance changes.

Critical Functions Status:

  • llama_model_quantize(): Not directly measured but no code modifications in quantization paths
  • Model loading functions: No performance impact

Note: The imatrix enhancements improve quantization quality through better statistical analysis but do not affect quantization performance.

4. Memory Usage - Potential Increase

Code Analysis Findings:

  • New data structures: Added activations vector to Stats struct
  • Memory doubling: Each tensor now stores both activations and squared values
  • Conditional allocation: Memory increase only when activation_statistics = true

Impact: Memory usage increases when activation statistics are enabled, but no impact on core inference memory patterns.

5. Batch Processing - No Impact

Critical Functions Status:

  • llama_batch_init(): 0% change (257 ns)
  • llama_decode() with batches: 0% change
  • Batch allocation functions: No code modifications

Analysis: Batch processing performance remains unchanged.

Action Items for Performance Optimization

Code-Level Optimizations

Memory Management:

  • Implement conditional compilation for activation statistics to eliminate overhead when disabled
  • Use memory pools for statistical computations to reduce allocation overhead
  • Optimize data structure alignment to minimize cache misses

Template Optimization:

  • Provide explicit template specializations for commonly used statistical functions
  • Move template implementations to source files to reduce compilation dependencies
  • Use SIMD instructions for statistical calculations

Build System Improvements

Compilation Optimization:

  • Enable link-time optimization (LTO) to recover performance from increased code complexity
  • Use profile-guided optimization (PGO) for statistical computation paths
  • Implement conditional compilation flags for statistical features

Binary Size Management:

  • Separate statistical functionality into optional shared libraries
  • Use function attribute optimization for hot paths
  • Implement dead code elimination for unused statistical features

Summary

PR #33 successfully implements enhanced imatrix statistical capabilities with minimal performance impact on core inference functions. The negligible degradation in standard library components (0.082-0.109%) does not affect primary inference performance metrics. The changes primarily impact memory usage when statistical features are enabled, with no measurable effect on tokens per second, quantization performance, or batch processing efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.