Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Oct 28, 2025

Mirrored from ggml-org/llama.cpp#16536

This heavily refactors the caching structure of the MMQ shader and also makes it more modular, to work with other kinds of quants.

Basically instead of turning the quants into 8-bit integers during load to shared memory, the quant structs now get copied through shared memory into registers and only reshaped into 8-bit integers directly before the integer dot operation. This saves on shared memory and on registers.

TODO:

  • Q2_K
  • Q3_K
  • Q4_K
  • Q5_K
  • Q6_K

Q2_K performance is not that good yet. Mapping the 256-wide quant structure to 32-wide Q8_1 structures is not that easy to do efficiently, so I'm still trying to find the best way to do that. @jeffbolznv Let me know if you see any obvious issues with the implementation.

@loci-agentic-ai-dev
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #5 - Vulkan MMQ Integer Dot Refactor

Key Findings

Performance Degradations

The analysis identified minimal performance degradations across all metrics:

  • Response Time: std::codecvt_abstract_base::in function shows 0.068% degradation (+0.02 ns, from 29.41 ns to 29.43 ns)
  • Throughput: Same function exhibits identical 0.068% degradation in self-time execution
  • Bottleneck: std::__invoke_r function in llama context graph callbacks shows 0.116% degradation (+0.018 ns)

Impact on Core Functions

No direct impact on critical llama.cpp components:

  • The degraded functions are C++ standard library utilities, not core inference functions
  • Primary performance-critical areas remain unaffected:
    • Matrix multiplication kernels
    • Attention mechanisms
    • Quantization/dequantization routines
    • Memory management (KV cache, mmap operations)
    • Batch processing efficiency

Power Consumption Analysis

Zero measurable impact on energy efficiency:

  • All binaries show 0.0% change in power consumption
  • build.bin.libllama.so: 303,651 nJ (unchanged)
  • build.bin.libggml.so: 6,339 nJ (unchanged)
  • Total system energy profile remains stable despite individual function degradations

Flame Graph and CFG Analysis

Minimal execution complexity:

  • Flame graph reveals single, flat execution profile for degraded function
  • CFG comparison shows identical assembly code between versions
  • Performance degradation stems from micro-architectural factors (cache behavior, instruction scheduling) rather than code changes
  • No structural changes in control flow or branching patterns

GitHub Code Review Insights

Successful major refactoring with no functional regressions:

  • Scope: 928 additions, 401 deletions across 18 files
  • Purpose: Vulkan MMQ shader refactoring and K-Quant support (Q2_K through Q6_K)
  • Architecture: Improved memory efficiency by caching quant structures in shared memory instead of converting to 8-bit integers
  • No Direct Code Impact: The degraded functions are unmodified by the PR changes

Overall Assessment

Change Impact Evaluation

Highly successful optimization with negligible side effects:

  • Primary Objective Achieved: The PR successfully implements K-Quant support and refactors MMQ shader architecture for better memory efficiency
  • Performance Impact: Sub-nanosecond degradations represent measurement noise rather than functional regressions
  • System Stability: Core inference pipeline performance remains unchanged
  • Energy Efficiency: No impact on overall power consumption

Maintainability and Future Considerations

Positive long-term outlook:

  • Code Quality: The refactoring improves modularity by separating MMQ functions into dedicated shader files
  • Scalability: New architecture supports additional quantization formats more efficiently
  • Technical Debt: Reduced shared memory usage and improved register management
  • Risk Profile: Minimal risk of performance regressions in future development

Root Cause of Degradations

Environmental factors, not algorithmic changes:

  • Large-scale refactoring altered global memory layout and compilation patterns
  • Link-time optimizations may have shifted instruction cache behavior
  • Template instantiation changes affected compiler optimization decisions
  • These effects are typical and expected for major architectural refactoring

Conclusion

The performance analysis confirms that PR #5 successfully delivers its intended improvements to Vulkan MMQ quantization support without introducing meaningful performance regressions. The observed sub-nanosecond degradations in standard library functions are within measurement error margins and do not impact the core llama.cpp inference capabilities. The refactoring enhances code maintainability and sets a solid foundation for future quantization optimizations.

Recommendation: Approve and merge the PR. The benefits of improved K-Quant support and memory efficiency far outweigh the negligible timing variations in auxiliary functions.

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16536-branch_ggml-org-0cc4m/vulkan-mmq-dp4a-k-quants branch October 30, 2025 15:26
loci-dev pushed a commit that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants