Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 1, 2025

Mirrored from ggml-org/llama.cpp#16922

revert: #16872

fix: #16860

This fix simply disables the optimization for the corresponding shader to address this issue. For glslc versions without this bug, it is unnecessary to turn off the optimization. I added a new macro to check if the data types are the same to avoid casting, rather than disabling optimization, ensuring that performance is not affected.

cannot compile rope_norm_f16

/bin/glslc -fshader-stage=compute --target-env=vulkan1.2 -O /home/cix/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp -o /home/cix/llama.cpp/build/ggml/src/ggml-vulkan/vulkan-shaders.spv/rope_norm_f16.spv -DA_TYPE=float16_t -DD_TYPE=float16_t 

shaderc: internal error: compilation succeeded but failed to optimize: Expected input to have different bit width from Result Type: FConvert
%212 = OpFConvert %half %211

During the process of compiling SPV with glslc, FConvert is used to convert variables of the same type. Shaderc considers this invalid, resulting in a compilation error.

@jeffbolznv

reeselevine and others added 30 commits October 2, 2025 11:00
* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…389)

* do not use more threads than physically available

* ensure n_threads > 0

Co-authored-by: Jeff Bolz <[email protected]>

---------

Co-authored-by: Jeff Bolz <[email protected]>
…rolling (#16356)

Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <[email protected]>
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
…quest (#16405)

* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
…6354)

* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <[email protected]>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <[email protected]>
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
…ers (#16418)

* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* add test model

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
This commit removes jina-reranker-v1-tiny-en model files that are no
longer present on Hugging Face.

The motivation for this that it clears up the CI logs from 404 errors
which can be a little confusing when looking at the logs the first time.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649
* refactor sdk caching to minimize storage

* use correct action

* add myself as owner to /.github/actions/ [no ci]
* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>
* metal : ssm_scan minor opts

* metal : get_rows optimize

* metal : cpy optimize

* metal : ssm_conv opt

* metal : ssm_scan simplify

* metal : ssm_Scan opt
CISC and others added 10 commits November 1, 2025 11:01
* webui: recognize AsciiDoc files as valid text files

* webui: add an updated static webui build

* webui: add the updated dependency list

* webui: re-add an updated static webui build

This also reverts commit 742dbb837939c176a813868c268d28ebd3fafb7c.
* feat: Add setting to display message generation statistics

* chore: build static webui output
* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
…iframe (#16757)

* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog

Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group

Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls

* webui: fullscreen HTML preview dialog using bits-ui

* Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* webui: pedantic style tweak for CodePreviewDialog close button

* webui: remove overengineered preview language logic

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <[email protected]>
# Conflicts:
#	ggml/src/ggml-vulkan/vulkan-shaders/rope_multi.comp
#	ggml/src/ggml-vulkan/vulkan-shaders/rope_neox.comp
#	ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

  • llama_decode(): 49,003,816 ns response time (no change from base version)
  • llama_encode(): 12,329,201 ns response time (no change from base version)
  • llama_tokenize(): 834,830 ns response time (no change from base version)

All primary inference functions show identical performance metrics between versions, indicating no functional regressions in core processing paths.

Vocabulary Module - Minimal Degradation

  • std::make_pair in llama-vocab.cpp:922:928:
    • Response Time: 228 ns (+0.06% from 228 ns base)
    • Throughput: 125 ns (+0.11% from 125 ns base)
    • Bottleneck: 78 ns (+0.17% from 78 ns base)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No measurable impact on inference throughput

  • Core inference functions (llama_decode, llama_encode, llama_tokenize) show zero performance change
  • The 0.11% degradation in vocabulary make_pair function is negligible compared to the reference 2ms llama_decode slowdown that causes 7% tokens/second reduction
  • Estimated Impact: <0.01% change in tokens per second

2. Power Consumption - No Impact

Binary-Level Analysis:

  • build.bin.libllama.so: 306,979 nJ (0.0% change)
  • build.bin.libggml-base.so: 90,434 nJ (0.0% change)
  • build.bin.libggml-cpu.so: 151,692 nJ (0.0% change)
  • build.bin.libggml.so: 6,339 nJ (0.0% change)

All binaries maintain identical power consumption profiles.

3. Quantization Efficiency - No Impact

Status: No changes detected in quantization-related functions

  • llama_model_quantize() function shows no performance variations
  • Quantization format handling remains unchanged
  • GGML quantization backends maintain consistent performance

4. Memory Usage - No Impact

Status: Memory management functions show no performance changes

  • KV cache operations (llama_memory_* functions) maintain baseline performance
  • Memory allocation patterns unchanged in GGML allocators
  • Batch memory management shows no degradation

5. Batch Processing - No Impact

Status: Batch processing efficiency maintained

  • llama_batch_* functions show no performance variations
  • Dynamic batching algorithms unchanged
  • Parallel processing capabilities preserved

Root Cause Analysis

Vocabulary Module Changes

The minimal degradation in std::make_pair template instantiation within llama-vocab.cpp stems from:

  • Control Flow: Complex branching pattern with 12 basic blocks including PLT calls
  • Stack Operations: 80-byte stack frame with security checks (__stack_chk_fail)
  • Template Overhead: Multiple std::forward calls and pair construction
  • No Code Modifications: Function unchanged between versions, indicating compiler optimization variance

Action Items

Immediate Actions

  1. Monitor Vocabulary Performance: Track make_pair usage patterns in tokenization workflows
  2. Compiler Optimization Review: Evaluate template instantiation efficiency in vocabulary module
  3. Build Consistency: Ensure reproducible builds to minimize optimization variance

Code-Focused Optimizations

  1. Template Specialization: Consider explicit specialization for common make_pair usage patterns in vocabulary code
  2. Inline Optimization: Review compiler inlining decisions for vocabulary helper functions
  3. Stack Frame Reduction: Evaluate stack usage in vocabulary template functions

Build System Enhancements

  1. Optimization Flags: Review template-specific optimization settings
  2. Link-Time Optimization: Enable LTO for vocabulary module if not already active
  3. Profile-Guided Optimization: Consider PGO for frequently used vocabulary functions

Conclusion

The performance analysis reveals minimal impact on LLaMA.cpp inference capabilities. Core inference functions maintain identical performance profiles, with only sub-nanosecond variations in vocabulary utility functions. The changes represent measurement variance rather than functional regressions, ensuring stable inference performance for the ollama://smollm:135m model and similar workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.