-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <[email protected]>
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryThe analysis reveals minor performance changes in the PowerPC MMA implementation within the SGEMM optimization module, specifically affecting the Key FindingsPerformance Metrics:
Core Function Impact: Power Consumption Analysis: Flame Graph and CFG Analysis: Code Review Insights: Impact Assessment: |
44faeaa to
d7421a0
Compare
76ea07c to
4b0bde9
Compare
This commit addresses review comments. Also, we have saperated out legacy mnpack path and matmul_tiled paths for tinyBLAS_Q0_PPC class. 10 ~ 30% improvement in PP Speed with Q4_0 and Q8_0 Models. Tested with Meta-Llama3-8B quatized models with llama-bench, llama-batched-bench. Signed-off-by: Shalini Salomi Bodapati <[email protected]>
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #81OverviewPR #81 implements tiled GEMM optimization for Q4_0 and Q8_0 quantized matrix multiplication on PowerPC architecture with MMA instructions. The changes introduce a 64x64 block-based matrix multiplication strategy with optimized packing routines, targeting 30-50% performance improvement for aligned matrices in LLM inference workloads. Analysis Status: Function-level performance data unavailable for versions Code ChangesThe implementation refactors
The optimization targets matrices where dimensions are exact multiples of tile sizes (mc=64, nc=64, kc=64), with automatic fallback ensuring correctness for all inputs. Key FindingsPerformance-Critical Area ImpactMatrix Multiplication Operations: The changes directly affect quantized matrix multiplication within the GGML CPU backend, specifically for Q4_0 and Q8_0 formats on PowerPC systems. Without function-level metrics, the actual impact on Inference Impact: No measurable impact on tokens per second can be determined from available data. The optimization is PowerPC-specific and does not affect x86_64 systems. For the reference configuration (smollm:135m on 12th Gen Intel i7-1255U), this PR introduces no performance changes as the tiled GEMM path is conditionally compiled for Power Consumption AnalysisBinary-level analysis shows minimal power consumption changes:
All changes fall within measurement noise, indicating no meaningful power consumption regression or improvement between versions. The optimization's energy efficiency benefits would only manifest on PowerPC systems executing the tiled GEMM path, which is not reflected in the current binary analysis. Analysis LimitationsThe absence of function-level performance data prevents detailed assessment of response time and throughput changes. The power consumption analysis reflects compilation differences rather than runtime optimization effects, as the tiled implementation is architecture-specific and not exercised in the analyzed binaries. |
297c352 to
3657a62
Compare
Mirrored from ggml-org/llama.cpp#16999
This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.
30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).
Make sure to read the contributing guidelines before submitting a PR