UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34

DajanaV · 2025-11-01T05:05:17Z

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

builds a candidate set of quant types (K or IQ types)
for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

Important note:

An imatrix that includes activations is required for the algorithm to work. At the time of writing, this is only available by generating the file using #14891 with the --output-format gguf option.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0, @AesSedai and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

* fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

* tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <[email protected]> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <[email protected]>

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

* webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

* server : add /v1/health endpoint * cont : update readme

* llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](huggingface/transformers#41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback

…#16452) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

Co-authored-by: DevAI <[email protected]>

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <[email protected]> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Aleksander Grygier <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

* CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <[email protected]> --------- Co-authored-by: Aman Gupta <[email protected]>

Signed-off-by: Giuseppe Scrivano <[email protected]>

* Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <[email protected]> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

loci-agentic-ai · 2025-11-01T06:06:08Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Status

Core Inference Functions - No Performance Impact

llama_decode: Response Time: 49,003,776 ns (no change)
llama_tokenize: Response Time: 834,825 ns (no change)
llama_model_quantize: Response Time: 6,891,639 ns (no change)
llama_batch_init: Response Time: 257 ns (no change)
llama_memory_clear: Response Time: 49 ns (no change)

Affected Functions - Minimal Impact

__copy_move_b: +0.084 ns response time (+0.042%)
_M_default_append: +0.128 ns bottleneck (+0.113%)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No degradation in inference performance
Analysis: Core tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show no measurable performance changes. The reference case of 7% tokens/second reduction from 2ms llama_decode slowdown does not apply here.

Affected Functions: None of the critical inference path functions show performance degradation.

2. Power Consumption - Negligible Impact

Binary-Level Analysis:

build.bin.libllama.so: -0.298 nJ reduction (-0.0% change)
build.bin.libggml-base.so: No change (0.0%)
build.bin.libggml-cpu.so: No change (0.0%)
build.bin.libggml.so: No change (0.0%)

Total Power: ~556,443 nJ across all binaries with negligible overall change.

3. Quantization Efficiency - Enhanced Capability

Status: Significant improvement in quantization capabilities
New Features:

Target BPW Algorithm: Automatic quantization type selection for optimal bits-per-weight
Pareto Optimization: Convex hull algorithm for error-size trade-offs
Multi-threaded Processing: Parallel tensor analysis with signal handling

Performance Impact: The llama_model_quantize function maintains identical performance (6,891,639 ns) despite substantial algorithmic enhancements.

4. Memory Usage - Increased During Quantization

Status: Additional memory structures for enhanced quantization
New Data Structures:

Activations Data: std::unordered_map<std::string, std::vector<float>>
Statistics Data: std::unordered_map<std::string, std::vector<float>>
BPW State Management: Checkpoint and resume functionality

Runtime Impact: Memory overhead only during quantization process, no impact on inference.

5. Batch Processing - No Impact

Status: Batch processing functions maintain identical performance
Analysis:

llama_batch_init: 257 ns (no change)
Batch allocation and management: No performance degradation
Parallel processing: Enhanced through new multi-threaded quantization algorithm

Root Cause Analysis

Performance Degradation Source

The minimal performance impact (0.084 ns in __copy_move_b) stems from:

1. Code Size Inflation

1,487 line additions increase binary size
Instruction cache pressure affects unrelated functions
Branch prediction changes due to altered code layout

2. Memory Allocation Patterns
The _M_default_append function shows increased bottleneck time due to:

Enhanced vector operations for token data structures
Additional memory allocations for new data containers
Complex object construction in vocabulary management

Control Flow Analysis

The _M_default_append CFG shows:

27 basic blocks with complex exception handling
Multiple PLT calls for memory allocation (185-279 ns each)
Exception safety mechanisms adding overhead to vector operations

Action Items

Immediate Optimizations

1. Binary Size Management

Enable Link-Time Optimization (LTO) to reduce code size impact
Profile instruction cache behavior during compilation
Consider function placement optimization for critical paths

2. Memory Allocation Optimization

Implement memory pool allocation for frequent vector operations
Pre-allocate vocabulary containers to reduce dynamic allocation
Optimize token data structure layout for cache efficiency

3. Code Organization

Modularize the 1,136-line target_bpw_type() function into smaller components
Extract algorithmic utilities (Pareto optimization, Lagrangian relaxation)
Separate data structures from processing logic

Build System Enhancements

1. Compilation Optimization

Profile-Guided Optimization (PGO) for branch prediction improvement
Function section placement to minimize cache conflicts
Template instantiation optimization for STL containers

2. Memory Layout Optimization

Align data structures for vectorized operations
Optimize token data padding to reduce memory footprint
Implement custom allocators for performance-critical containers

Conclusion

The changes introduce substantial quantization enhancements with minimal performance impact on core inference functions. The 0.084 ns degradation represents a negligible trade-off for significant algorithmic improvements. The enhanced quantization capabilities provide substantial value while maintaining inference performance integrity.

EAddario and others added 30 commits October 5, 2025 20:16

Add signal handler

533cda3

Add save_bpw_state()

e48ca32

Add load_bpw_state()

02c3073

Add delete_bpw_state()

74c62ed

Persist progress

46706ce

Uninstall signal handler and cleanup

84ada44

server: update readme to mention n_past_max metric (#16436)

c5fef0f

ggml-org/llama.cpp#15361 added new metric exported, but I've missed this doc.

nix : removed metal for nix (#16118)

1d49ca3

ggml : fix unaligned access in AMX code (#16315)

a23b9bd

ci : refactor sdk caching to minimize storage (#16414)

3a002af

* refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci]

llama : add --no-host to disable host buffers (#16310)

3df2244

* implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]>

Fix trimming logic

044fa78

metal : various optimizations + refactoring (#16446)

8ae32dc

* metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt

metal : add support for non-padded FA KV (#16148)

0a319bb

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

memory : use sequential equal splits for recurrent modules (#16442)

0123ff3

rpc : update documentation (#16441)

c61ae20

Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]>

presets : fix pooling param for embedding models (#16455)

ef4c5b8

server : add /v1/health endpoint (#16461)

df1b612

* server : add /v1/health endpoint * cont : update readme

server : improve context checkpoint logic (#16440)

7fdd16b

metal : mark FA blocks (#16372)

b2c08c9

* metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic

server : fix cancel pending task (#16467)

d2ee056

Co-authored-by: DevAI <[email protected]>

Disable CUDA host buffers on integrated GPUs (#16308)

9d08828

ggerganov and others added 6 commits October 31, 2025 16:26

sync : ggml

6d39015

CUDA: Volta tensor core support for MMF (#16843)

31c511a

* CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <[email protected]> --------- Co-authored-by: Aman Gupta <[email protected]>

model : add Granite Hybrid nano types (#16896)

e58d585

Signed-off-by: Giuseppe Scrivano <[email protected]>

Merge branch 'master' into quantize

b02b1b2

DajanaV temporarily deployed to PROD__AL_DEMO November 1, 2025 05:05 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 19 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34

UPSTREAM PR #15550: quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest error #34

Uh oh!

DajanaV commented Nov 1, 2025

Uh oh!

loci-agentic-ai bot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

84 participants