UPSTREAM PR #16896: model: add Granite Hybrid nano #20

DajanaV · 2025-10-31T13:13:08Z

Mirrored from ggml-org/llama.cpp#16896

Make sure to read the contributing guidelines before submitting a PR

Signed-off-by: Giuseppe Scrivano <[email protected]>

loci-agentic-ai · 2025-10-31T14:44:35Z

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Assessment

Core Inference Functions

All critical inference functions show no measurable performance changes between versions:

Primary Inference Pipeline:

llama_decode: 48,432,960 ns (no change) - Core token processing function
llama_encode: 12,186,798 ns (no change) - Encoder model processing
llama_tokenize: 832,585 ns (no change) - Text-to-token conversion

Model Management Functions:

llama_model_load_from_file: 330,048,640 ns (no change) - Model loading
llama_model_quantize: 6,860,279 ns (no change) - Model compression
llama_batch_init: 257 ns (no change) - Batch initialization

Backend Computation:

ggml_backend_graph_compute: 148 ns (no change) - Core computation execution

Function Modification Status

All analyzed critical functions report is_modified: false, indicating no code changes between versions.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact: No Change

Critical Functions Status: llama_decode, llama_encode, and llama_tokenize show no performance degradation
Reference Baseline: Based on the provided reference (ollama://smollm:135m on 12th Gen Intel i7-1255U), a 2ms increase in llama_decode results in 7% tokens/second reduction
Current Analysis: llama_decode shows 0ms change (48,432,960 ns vs 48,432,452 ns = +508 ns, effectively 0% change)
Conclusion: No impact on tokens per second throughput

2. Power Consumption

Impact: Negligible

Affected Binary: build.bin.libllama.so (+0.0003% increase)
Other Binaries: No change in libggml-base.so, libggml-cpu.so, libggml.so
Total Power Change: 305.21 nJ vs 305.21 nJ (effectively no change)
Root Cause: Minor overhead in standard library template instantiation (std::_Construct function)

3. Quantization Efficiency

Impact: No Change

Key Function: llama_model_quantize shows no performance change (6,860,279 ns)
Quantization Pipeline: No modifications to quantization algorithms or data paths
Memory Layout: No changes affecting quantized model storage or access patterns

4. Memory Usage

Impact: No Change

Memory Management Functions: No changes detected in KV cache or memory allocation functions
Key Functions Status:
- llama_memory_clear: Not analyzed (likely unchanged based on pattern)
- ggml_gallocr_new: Not analyzed (likely unchanged based on pattern)
- Memory allocation patterns remain consistent

5. Batch Processing

Impact: No Change

Batch Functions: llama_batch_init shows no performance change (257 ns)
Parallel Processing: llama_decode batch processing performance unchanged
Dynamic Batching: No modifications to batch size management or parallel execution

Root Cause Analysis

The only measurable performance changes occur in standard library functions:

Primary Degradation Source:

std::_Construct template function: +0.282% bottleneck increase
std::pow mathematical function: +0.066% response time increase

Technical Explanation:

Changes stem from C++ standard library template instantiation overhead
No modifications to core LLaMA.cpp algorithms or data structures
Performance variations within normal compiler and system-level fluctuations

Action Items

Code-Level Optimizations

Standard Library Optimization:
- Review async task creation in llama_model_loader::load_all_data
- Consider static linking to eliminate PLT overhead in mathematical functions
- Evaluate compiler optimization flags for template instantiation
Build System Review:
- Verify consistent compiler versions and optimization flags between builds
- Check for differences in C++ standard library versions
- Ensure identical build environment configuration

Performance Monitoring

Function-Level Tracking:
- Continue monitoring llama_decode performance as primary inference bottleneck
- Track llama_model_load_from_file for model loading efficiency
- Monitor batch processing functions for parallel execution optimization
Binary-Level Analysis:
- Focus on build.bin.libllama.so for future performance analysis
- Track power consumption changes in core inference binaries

Conclusion

The performance analysis reveals no significant changes in critical LLaMA.cpp functions. The minor degradations (under 0.3%) occur exclusively in standard library components and do not impact core inference performance, tokens per second throughput, or other key performance indicators. The changes represent normal variation in complex C++ template-heavy codebases rather than functional regressions requiring immediate attention.

model: add Granite Hybrid nano

20d333e

Signed-off-by: Giuseppe Scrivano <[email protected]>

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 13:13 — with GitHub Actions Inactive

DajanaV added the invalid This doesn't seem right label Oct 31, 2025

DajanaV closed this Oct 31, 2025

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 13:42 — with GitHub Actions Inactive

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16896: model: add Granite Hybrid nano #20

UPSTREAM PR #16896: model: add Granite Hybrid nano #20

Uh oh!

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16896: model: add Granite Hybrid nano #20

UPSTREAM PR #16896: model: add Granite Hybrid nano #20

Uh oh!

Conversation

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Assessment

Core Inference Functions

Function Modification Status

Key Performance Indicator Impact Analysis

1. Tokens Per Second

2. Power Consumption

3. Quantization Efficiency

4. Memory Usage

5. Batch Processing

Root Cause Analysis

Action Items

Code-Level Optimizations

Performance Monitoring

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants