UPSTREAM PR #16981: mtmd: improve struct initialization #54

DajanaV · 2025-11-03T21:34:16Z

WIP

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

* feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

* fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

* tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <[email protected]> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <[email protected]>

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

* webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

* server : add /v1/health endpoint * cont : update readme

* llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](huggingface/transformers#41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback

…#16452) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

Co-authored-by: DevAI <[email protected]>

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <[email protected]> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Aleksander Grygier <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

…odules (#16367) * model: EmbeddingGemma sentence-transformers dense linear projections support * model: add support for EmbeddingGemma SentenceTransformers dense linear projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/ * model: add support for EmbeddingGemma SentenceTransformers dense linear projections - converting model with dense-layers is optional - introduced dense config params * Update convert_hf_to_gguf.py Co-authored-by: Daniel Bevenius <[email protected]> * fixed formatting issues * Update src/llama-graph.cpp Co-authored-by: Georgi Gerganov <[email protected]> * - removed pooling_type_opt, always allow overriding pooling_type - asserts checking dense features dims * fix python lint * fix ubuntu gcc build warning * - fixed thread-safety test - moved asserts to load_hparams * - tidying up code - simplifying graph-context expecting both dense weights * minor : add TODO --------- Co-authored-by: Daniel Bevenius <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* refactor to support soft_max_ext * fix error and support soft_max_back * rm unused functions * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]>

* CANN: improve ACL graph matching Record `ne` and `nb` information for src tensors and include them in the graph matching check. This enhances the robustness of ACL graph matching by preventing incorrect matches when src tensors share the same data address but differ in shape or stride. * CANN: add op_params match

* Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

…mode and coverage (#16936) * tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage * tests: init gf and filter out fusion tests for support mode * tests: filter out fusion cases before calling eval_support * tests: filter out fusion cases from show_test_coverage as well, fix lint

* webui : Revised LaTeX formula recognition * webui : Further examples containg amounts * webui : vitest for maskInlineLaTeX * webui: Moved preprocessLaTeX to lib/utils * webui: LaTeX in table-cells * chore: update webui build output (use theirs) * webui: backslash in LaTeX-preprocessing * chore: update webui build output * webui: look-behind backslash-check * chore: update webui build output * Apply suggestions from code review Code maintenance (variable names, code formatting, string handling) Co-authored-by: Aleksander Grygier <[email protected]> * webui: Moved constants to lib/constants. * webui: package woff2 inside base64 data * webui: LaTeX-line-break in display formula * chore: update webui build output * webui: Bugfix (font embedding) * webui: Bugfix (font embedding) * webui: vite embeds assets * webui: don't suppress 404 (fonts) * refactor: KaTeX integration with SCSS Moves KaTeX styling to SCSS for better customization and font embedding. This change includes: - Adding `sass` as a dev dependency. - Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding. - Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables. * fix: LaTeX processing within blockquotes * webui: update webui build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

…ter)Feature/sycl repeat back opt (#16869) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd.

* sync: minja * Sync ochafik/minja#7 (MinMax M2)

* Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX

* mtmd: pad mask for qwen2.5vl * improve

* server : add props.model_alias * webui : npm run format

This commit modifies the script `run-org-model.py` to ensure that the model configuration is explicitly passed to the `from_pretrained` method when loading the model. It also removes a duplicate configuration loading which was a mistake. The motivation for this change is that enables the config object to be modified and then passed to the model loading function, which can be useful when testing new models.

loci-agentic-ai · 2025-11-03T22:21:07Z

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

All critical LLaMA.cpp functions show zero measurable performance changes between versions:

Primary Inference Pipeline:

llama_decode: 43,183,416 ns (no change) - Core token processing function
llama_encode: 10,861,845 ns (no change) - Encoder model processing
llama_tokenize: 876,476 ns (no change) - Text-to-token conversion

Model Management:

llama_model_load_from_file: 364,760,600 ns (no change) - Model loading
llama_batch_init: 250 ns (no change) - Batch initialization
llama_memory_clear: 49 ns (no change) - Memory management

Function Modification Status: None of the critical functions were modified in this version.

Key Performance Indicator Impact Analysis

1. Tokens Per Second - No Impact

Status: No changes detected in inference-critical functions

llama_decode (primary inference): 0% change in response time and throughput
llama_tokenize (tokenization): 0% change in response time and throughput
llama_encode (encoder processing): 0% change in response time and throughput

Conclusion: Token processing throughput remains unchanged. No impact on the 7% tokens/second degradation reference metric.

2. Power Consumption - Minimal Impact

Affected Binaries:

libllama.so: Negligible change (280,661.86 vs 280,661.81 nJ, <0.001%)
llama-run: Minor decrease (-0.55 nJ from 266,867.64 nJ, <0.001%)
llama-cvector-generator: Minor decrease (-0.45 nJ from 314,116.33 nJ, <0.001%)

Analysis: Power consumption remains stable across all binaries with sub-nanojoule variations.

3. Quantization Efficiency - No Impact

Status: No changes in quantization-related functions

llama_model_quantize: Not analyzed (function not present in performance data)
Core inference functions maintain identical performance profiles
No modifications to quantization pathways detected

4. Memory Usage - No Impact

Status: Memory management functions unchanged

llama_memory_clear: 49 ns (0% change)
llama_batch_init: 250 ns (0% change)
KV cache and memory allocation functions show no performance variations

5. Batch Processing - No Impact

Status: Batch processing pipeline unchanged

llama_batch_init: 250 ns (0% change)
llama_decode (batch processing): 43,183,416 ns (0% change)
Batch allocation and management functions maintain identical performance

Root Cause Analysis

Primary Changes: The detected performance variations stem from:

MTMD module struct initialization improvements (PR UPSTREAM PR #16981: mtmd: improve struct initialization #54)
Compiler optimization cascading effects on template-heavy code
Standard library function optimizations (_RegexMask constructor: -0.082%)

Impact Scope: Changes are isolated to:

Standard library template instantiation patterns
MTMD multimodal processing components
Compiler optimization improvements in non-critical paths

Action Items

Code Optimization

Struct initialization patterns: Apply aggregate initialization improvements to other modules following MTMD example
Template optimization: Review template-heavy code for similar compiler optimization opportunities

Build System

Compiler flags: Maintain current optimization settings that enabled the positive cascading effects
Link-time optimization: Current LTO settings appear effective for template optimization

Performance Validation

Inference pipeline integrity: Core LLaMA functions maintain stable performance profiles
Multimodal functionality: Verify MTMD improvements don't affect primary inference paths

Summary

The version comparison reveals exceptional stability in LLaMA.cpp's core inference pipeline. All critical functions for tokenization, model processing, memory management, and batch processing show zero performance degradation. The minor improvements detected are beneficial side effects of code modernization in auxiliary modules, with no negative impact on primary inference capabilities.

reeselevine and others added 30 commits October 4, 2025 20:59

ggml webgpu: actually add softmax, fix rms_norm offset (#16400)

3526657

* implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit

server: update readme to mention n_past_max metric (#16436)

c5fef0f

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

nix : removed metal for nix (#16118)

1d49ca3

ggml : fix unaligned access in AMX code (#16315)

a23b9bd

ci : refactor sdk caching to minimize storage (#16414)

3a002af

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

llama : add --no-host to disable host buffers (#16310)

3df2244

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

metal : various optimizations + refactoring (#16446)

8ae32dc

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

metal : add support for non-padded FA KV (#16148)

0a319bb

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

memory : use sequential equal splits for recurrent modules (#16442)

0123ff3

rpc : update documentation (#16441)

c61ae20

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

presets : fix pooling param for embedding models (#16455)

ef4c5b8

server : add /v1/health endpoint (#16461)

df1b612

* server : add /v1/health endpoint * cont : update readme

server : improve context checkpoint logic (#16440)

7fdd16b

metal : mark FA blocks (#16372)

b2c08c9

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

server : fix cancel pending task (#16467)

d2ee056

Co-authored-by: DevAI <[email protected]>

Disable CUDA host buffers on integrated GPUs (#16308)

9d08828

[SYCL] refactor soft_max, add soft_max_back (#16472)

b260213

* refactor to support soft_max_ext * fix error and support soft_max_back * rm unused functions * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]>

kleidiai: kernel interface refactoring (#16460)

d80d6d2

ci: add ARM64 Kleidiai build and test support (#16462)

2c0d875

ravenouse and others added 14 commits November 2, 2025 22:08

ci : disable failing riscv cross build (#16952)

dd52868

sync: minja (glm 4.6 & minmax m2 templates) (#16949)

ee3a5a1

* sync: minja * Sync ochafik/minja#7 (MinMax M2)

ggml : LoongArch fixes (#16958)

fcfce04

* Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX

mtmd: pad mask for qwen2.5vl (#16954)

bf7b0c9

* mtmd: pad mask for qwen2.5vl * improve

mtmd: add --image-min/max-tokens (#16921)

070ff4d

ggml: CUDA: add head size 72 for flash-attn (#16962)

622cd01

server : add props.model_alias (#16943)

48bd265

* server : add props.model_alias * webui : npm run format

fix: Viewing multiple PDF attachments (#16974)

e7da30b

mtmd: improve struct initialization

587af77

DajanaV temporarily deployed to PROD__AL_DEMO November 3, 2025 21:34 — with GitHub Actions Inactive

DajanaV closed this Nov 3, 2025

DajanaV deleted the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 3, 2025 22:57

DajanaV restored the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 3, 2025 22:58

DajanaV had a problem deploying to PROD__AL_DEMO November 3, 2025 22:59 — with GitHub Actions Failure

DajanaV had a problem deploying to PROD__AL_DEMO November 3, 2025 23:08 — with GitHub Actions Error

DajanaV deleted the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 3, 2025 23:11

DajanaV restored the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 3, 2025 23:12

DajanaV temporarily deployed to PROD__AL_DEMO November 3, 2025 23:12 — with GitHub Actions Inactive

DajanaV reopened this Nov 3, 2025

DajanaV closed this Nov 3, 2025

DajanaV deleted the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 4, 2025 00:15

DajanaV restored the upstream-PR16981-branch_ngxson-xsn/mtmd_better_init_struct branch November 4, 2025 00:16

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 00:16 — with GitHub Actions Inactive

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 00:53 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16981: mtmd: improve struct initialization #54

UPSTREAM PR #16981: mtmd: improve struct initialization #54

Uh oh!

DajanaV commented Nov 3, 2025

Uh oh!

loci-agentic-ai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

90 participants