UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37

DajanaV · 2025-11-01T21:04:06Z

revert: #16872

fix: #16860

This fix simply disables the optimization for the corresponding shader to address this issue. For glslc versions without this bug, it is unnecessary to turn off the optimization. I added a new macro to check if the data types are the same to avoid casting, rather than disabling optimization, ensuring that performance is not affected.

cannot compile rope_norm_f16

/bin/glslc -fshader-stage=compute --target-env=vulkan1.2 -O /home/cix/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp -o /home/cix/llama.cpp/build/ggml/src/ggml-vulkan/vulkan-shaders.spv/rope_norm_f16.spv -DA_TYPE=float16_t -DD_TYPE=float16_t 

shaderc: internal error: compilation succeeded but failed to optimize: Expected input to have different bit width from Result Type: FConvert
%212 = OpFConvert %half %211

During the process of compiling SPV with glslc, FConvert is used to convert variables of the same type. Shaderc considers this invalid, resulting in a compilation error.

@jeffbolznv

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

…GGML_KQ_MASK_PAD) (#16316)

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

…6354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]>

* rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

* feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

* fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

* webui: recognize AsciiDoc files as valid text files * webui: add an updated static webui build * webui: add the updated dependency list * webui: re-add an updated static webui build This also reverts commit 742dbb837939c176a813868c268d28ebd3fafb7c.

* feat: Add setting to display message generation statistics * chore: build static webui output

* mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen

Signed-off-by: Adrien Gallouët <[email protected]>

…iframe (#16757) * webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog Extended MarkdownContent to flag previewable code languages, add a preview button alongside copy controls, manage preview dialog state, and share styling for the new button group Introduced CodePreviewDialog.svelte, a sandboxed iframe modal for rendering HTML/JS previews with consistent dialog controls * webui: fullscreen HTML preview dialog using bits-ui * Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte Co-authored-by: Aleksander Grygier <[email protected]> * webui: pedantic style tweak for CodePreviewDialog close button * webui: remove overengineered preview language logic * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <[email protected]>

…nvert

# Conflicts: # ggml/src/ggml-vulkan/vulkan-shaders/rope_multi.comp # ggml/src/ggml-vulkan/vulkan-shaders/rope_neox.comp # ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp

loci-agentic-ai · 2025-11-01T22:49:23Z

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

llama_decode(): 49,003,816 ns response time (no change from base version)
llama_encode(): 12,329,201 ns response time (no change from base version)
llama_tokenize(): 834,830 ns response time (no change from base version)

All primary inference functions show identical performance metrics between versions, indicating no functional regressions in core processing paths.

Vocabulary Module - Minimal Degradation

std::make_pair in llama-vocab.cpp:922:928:
- Response Time: 228 ns (+0.06% from 228 ns base)
- Throughput: 125 ns (+0.11% from 125 ns base)
- Bottleneck: 78 ns (+0.17% from 78 ns base)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No measurable impact on inference throughput

Core inference functions (llama_decode, llama_encode, llama_tokenize) show zero performance change
The 0.11% degradation in vocabulary make_pair function is negligible compared to the reference 2ms llama_decode slowdown that causes 7% tokens/second reduction
Estimated Impact: <0.01% change in tokens per second

2. Power Consumption - No Impact

Binary-Level Analysis:

build.bin.libllama.so: 306,979 nJ (0.0% change)
build.bin.libggml-base.so: 90,434 nJ (0.0% change)
build.bin.libggml-cpu.so: 151,692 nJ (0.0% change)
build.bin.libggml.so: 6,339 nJ (0.0% change)

All binaries maintain identical power consumption profiles.

3. Quantization Efficiency - No Impact

Status: No changes detected in quantization-related functions

llama_model_quantize() function shows no performance variations
Quantization format handling remains unchanged
GGML quantization backends maintain consistent performance

4. Memory Usage - No Impact

Status: Memory management functions show no performance changes

KV cache operations (llama_memory_* functions) maintain baseline performance
Memory allocation patterns unchanged in GGML allocators
Batch memory management shows no degradation

5. Batch Processing - No Impact

Status: Batch processing efficiency maintained

llama_batch_* functions show no performance variations
Dynamic batching algorithms unchanged
Parallel processing capabilities preserved

Root Cause Analysis

Vocabulary Module Changes

The minimal degradation in std::make_pair template instantiation within llama-vocab.cpp stems from:

Control Flow: Complex branching pattern with 12 basic blocks including PLT calls
Stack Operations: 80-byte stack frame with security checks (__stack_chk_fail)
Template Overhead: Multiple std::forward calls and pair construction
No Code Modifications: Function unchanged between versions, indicating compiler optimization variance

Action Items

Immediate Actions

Monitor Vocabulary Performance: Track make_pair usage patterns in tokenization workflows
Compiler Optimization Review: Evaluate template instantiation efficiency in vocabulary module
Build Consistency: Ensure reproducible builds to minimize optimization variance

Code-Focused Optimizations

Template Specialization: Consider explicit specialization for common make_pair usage patterns in vocabulary code
Inline Optimization: Review compiler inlining decisions for vocabulary helper functions
Stack Frame Reduction: Evaluate stack usage in vocabulary template functions

Build System Enhancements

Optimization Flags: Review template-specific optimization settings
Link-Time Optimization: Enable LTO for vocabulary module if not already active
Profile-Guided Optimization: Consider PGO for frequently used vocabulary functions

Conclusion

The performance analysis reveals minimal impact on LLaMA.cpp inference capabilities. Core inference functions maintain identical performance profiles, with only sub-nanosecond variations in vocabulary utility functions. The changes represent measurement variance rather than functional regressions, ensuring stable inference performance for the ollama://smollm:135m model and similar workloads.

reeselevine and others added 30 commits October 2, 2025 11:00

test-barrier : do not use more threads than physically available (#16…

d64c810

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

fix: track viewportHeight via window.innerHeight to avoid unwanted sc…

5113efd

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

webui : Fix messages payload sent to chat completions (#16402)

136bda7

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

vulkan: in flash attention, bounds check against nem1 (don't rely on …

e308efd

…GGML_KQ_MASK_PAD) (#16316)

Capture model name only after first token (streaming) or completed re…

7723327

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

0e1f838

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

Fix missing messages on sibling navigation (#16408)

84c8e30

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

ggml : fix graph reallocation with multiple chunks (#16396)

638d330

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

llama : fix shapes for bert/mpt q/k norm (#16409)

946f71e

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

chat : support Magistral thinking (#16413)

128d522

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

rpc : check src buffer when copying tensor (#16421)

f392839

Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.

vulkan: use a more appropriate amount of threads when generating shad…

86df2c9

…ers (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax

ggml webgpu: actually add softmax, fix rms_norm offset (#16400)

3526657

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

server: update readme to mention n_past_max metric (#16436)

c5fef0f

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

nix : removed metal for nix (#16118)

1d49ca3

ggml : fix unaligned access in AMX code (#16315)

a23b9bd

ci : refactor sdk caching to minimize storage (#16414)

3a002af

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

llama : add --no-host to disable host buffers (#16310)

3df2244

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

metal : various optimizations + refactoring (#16446)

8ae32dc

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

CISC and others added 10 commits November 1, 2025 11:01

common : allow --system-prompt-file for diffusion-cli (#16903)

961660b

Add a setting to display message generation statistics (#16901)

d8b860a

* feat: Add setting to display message generation statistics * chore: build static webui output

vendor : update cpp-httplib to 0.27.0 (#16846)

dd5e8ca

Signed-off-by: Adrien Gallouët <[email protected]>

fix: Expected input to have different bit width from Result Type: FCo…

9c0a8c8

…nvert

fix: Expected input to have different bit width from Result Type: FCo…

d0b8372

…nvert

Merge remote-tracking branch 'origin/master'

2ea0562

# Conflicts: # ggml/src/ggml-vulkan/vulkan-shaders/rope_multi.comp # ggml/src/ggml-vulkan/vulkan-shaders/rope_neox.comp # ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp

add: fix issues/16860 comments

48c2bd8

DajanaV temporarily deployed to PROD__AL_DEMO November 1, 2025 21:04 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 15 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37

UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37

Uh oh!

DajanaV commented Nov 1, 2025

Uh oh!

loci-agentic-ai bot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

88 participants

UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37

UPSTREAM PR #16922: disable D_TYPE same as A_TYPE cast for rope shaders #37

Uh oh!

Conversation

DajanaV commented Nov 1, 2025

Uh oh!

loci-agentic-ai bot commented Nov 1, 2025

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

Vocabulary Module - Minimal Degradation

KPI Impact Analysis

1. Tokens Per Second - No Impact

2. Power Consumption - No Impact

3. Quantization Efficiency - No Impact

4. Memory Usage - No Impact

5. Batch Processing - No Impact

Root Cause Analysis

Vocabulary Module Changes

Action Items

Immediate Actions

Code-Focused Optimizations

Build System Enhancements

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

88 participants