Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 1, 2025

Mirrored from ggml-org/llama.cpp#16907

./build/bin/Release/test-backend-ops.exe perf -o MUL_MAT -p type_a=iq1_m

Tested on AMD 8845HS 780M iGPU

n PR: μs/run PR: GFLOPS Main: μs/run Main: GFLOPS Speedup vs Main
1 224.28 523.63 282.44 415.80 1.26x
2 310.53 756.38 385.04 610.01 1.24x
3 408.65 862.15 515.79 683.08 1.26x
4 589.40 797.02 1244.08 377.60 2.11x
5 1075.96 545.75 4427.85 132.62 4.11x
8 2576.61 364.64 4985.43 188.45 1.94x
512 11601.05 5180.00 11948.15 5030.00 1.03x

allozaur and others added 30 commits October 1, 2025 18:18
…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369)

* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <[email protected]>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
* CI: Properly install rocwmma for hip builds

on windows we now windows install rocwmma from ubuntu pacakges

* CI: update linux rocm docker build to use rocm 7.0
…16075)

* Fix to use hidden_size_per_head

* Fix num heads

* Fix array

* Fix loading weights

* Support old GGUF converted by the previous version of llama.cpp

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Move shared parameter definitions to the outside of loop

* Not calculating n_embd_head_k,v by n_embd / n_head

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…0 (#16221)

* HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0

rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA

* CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool

* update to 2025.2 use deeplearn essi to replace base toolkit

* add missed dll

* add deep learning essentials

* add sycl-ls

---------

Co-authored-by: Zhang Jianyu <[email protected]>
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…389)

* do not use more threads than physically available

* ensure n_threads > 0

Co-authored-by: Jeff Bolz <[email protected]>

---------

Co-authored-by: Jeff Bolz <[email protected]>
…rolling (#16356)

Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <[email protected]>
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
…quest (#16405)

* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
…6354)

* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
ggerganov and others added 6 commits October 31, 2025 16:26
* CUDA: Volta tensor core support for MMF

* more generic checks for hardware support

* Update ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Aman Gupta <[email protected]>
* Model: Minimax M2

* Cleanup

* Cleanup pt. 2

* Cleanup pt. 3

* Update convert_hf_to_gguf_update.py - merge catch blocks

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Remove vocab models and test

* Remove all redundant hparam settings covered by TextModel

* Move super to start, don't set block_count

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Sqashed: llama-model.cpp refactoring

* Fix formatting of attn / ffn / ffn_moe calls

* Fix import regression / unify spacing in models.h

* totally DID NOT miss those!

* Add missing qwen3vl(moe) models

* Add missing new .cpp files to build

* Remove extra semicolons

* Editor checker

* Update src/models/models.h

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

All critical inference functions show zero measurable performance degradation:

llama_decode: 49,004,028 ns (no change) - Primary inference function remains stable
llama_encode: 12,329,253 ns (no change) - Encoder processing unaffected
llama_tokenize: 834,828 ns (no change) - Tokenization performance maintained
llama_model_load_from_file: 333,129,500 ns (no change) - Model loading stable
llama_batch_init: 257 ns (no change) - Batch initialization unchanged
llama_model_quantize: 6,891,676 ns (no change) - Quantization performance stable

Non-Critical Function Degradations

The identified performance regressions are limited to utility functions:

std::make_pair (llama-vocab.cpp): +0.13 ns (+0.058% response time)
std::make_unique (llama-graph.h): +0.12 ns (+0.117% throughput)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No degradation in inference throughput
llama_decode: 0% change (49,004,028 ns baseline maintained)
llama_encode: 0% change (12,329,253 ns baseline maintained)
llama_tokenize: 0% change (834,828 ns baseline maintained)

Reference Context: With the baseline showing no change in llama_decode (vs. the 2ms degradation reference that causes 7% tokens/sec loss), inference throughput remains unaffected.

2. Power Consumption - Negligible Impact

Binary-Level Analysis:
libllama.so: +0.0003% increase (306,979.28 nJ vs 306,978.33 nJ base)
libggml-base.so: 0% change (90,434.19 nJ)
libggml-cpu.so: 0% change (151,692.17 nJ)
libggml.so: 0% change (6,339.24 nJ)

3. Quantization Efficiency - No Impact

Status: Quantization performance maintained
llama_model_quantize: 0% change in execution time
Quantization formats: No changes to Q4_0, Q4_1, Q8_0 implementations
GGUF loading: Model loading performance stable

4. Memory Usage - No Impact

Status: Memory management functions unchanged
KV Cache operations: No performance degradation detected
Memory allocation: GGML allocator performance stable
Batch memory: llama_batch_init shows no regression

5. Batch Processing - No Impact

Status: Batch processing efficiency maintained
llama_batch_init: 0% change (257 ns baseline)
Parallel processing: No degradation in batch execution paths
Dynamic batching: Batch size management unaffected

Root Cause Analysis

Template Utility Function Overhead

The performance regressions are isolated to C++ standard library template functions:
Branch prediction changes: Control flow analysis revealed reordered branch targets in std::make_pair
Compiler optimization variation: Different optimization passes between builds
PLT overhead: Dynamic linking costs in template instantiation

Build System Factors

Non-deterministic compilation: Compiler optimization decisions vary between builds
Template instantiation: Standard library function generation inconsistencies
Link-time optimization: Potential differences in LTO application

Action Items

Immediate Build Optimizations

Deterministic compilation flags: Standardize compiler optimization settings to prevent template function regression
Link-time optimization: Enable -flto to eliminate PLT overhead in template functions
Branch prediction hints: Add __builtin_expect annotations for stack protection branches

Template Performance Improvements

Explicit template specialization: Pre-instantiate common std::make_pair combinations used in vocabulary processing
Inline optimization: Configure build to inline standard library utility functions
Stack protection tuning: Consider selective -fno-stack-protector for performance-critical leaf functions

Build Process Enhancements

Assembly regression testing: Implement automated assembly comparison in CI/CD pipeline
Compiler version control: Lock compiler version to prevent optimization variations
Performance baseline tracking: Add micro-benchmarks for template utility functions

Conclusion

The analysis shows no performance impact on critical inference functions. The identified regressions are limited to standard library utility functions with negligible impact on overall system performance. The core LLaMA.cpp inference pipeline maintains stable performance across all measured KPIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.