UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81

DajanaV · 2025-11-04T14:48:38Z

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.

30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Make sure to read the contributing guidelines before submitting a PR

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <[email protected]>

loci-agentic-ai · 2025-11-04T15:24:38Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

The analysis reveals minor performance changes in the PowerPC MMA implementation within the SGEMM optimization module, specifically affecting the mnpack<4,1,2> function in build.bin.libggml-cpu.so.

Key Findings

Performance Metrics:

Highest Response Time Change: mnpack<4,1,2> function shows +0.33% increase (7 ns absolute, from 2008 ns to 2015 ns)
Highest Throughput Degradation: mnpack<4,1,1> function shows +12.15% increase in self-time (9 ns increase, from 69 ns to 78 ns)

Core Function Impact:
The changes do not affect primary inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second performance. The affected functions are low-level matrix multiplication routines in the GGML backend, which have minimal impact on overall inference throughput.

Power Consumption Analysis:
System-wide power consumption remains stable with only build.bin.libggml-cpu.so showing a negligible +0.002% increase. All other binaries show 0.0% change, indicating the optimizations offset any overhead.

Flame Graph and CFG Analysis:
The flame graph reveals a 4-level call stack with 95.4% of execution time concentrated in the gemm function. CFG comparison shows identical control flow structures with only cosmetic differences in error reporting line numbers (392 to 416), indicating 24 lines of additional code for enhanced error handling and validation.

Code Review Insights:
PR #81 introduces tiled GEMM optimization for 64x64 blocks, adding thread-local storage management and conditional execution paths. The implementation includes pthread-based memory management and enhanced template flexibility. While the changes add complexity, they provide performance benefits for large matrices with minimal overhead for smaller operations.

Impact Assessment:
The changes represent infrastructure improvements in the matrix multiplication backend without affecting core inference performance. The measured overhead is within acceptable bounds and is offset by optimizations for larger computational workloads.

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 14:48 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 27 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08

loci-dev force-pushed the main branch 30 times, most recently from 3e4b499 to e81a7eb Compare December 5, 2025 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81

Are you sure you want to change the base?

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Performance Analysis Summary

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants