Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17496

Also added VLEN-agnostic kernel selection to ggml_vec_dot_q2_K_q8_K for RVV-disabled and wider devices.

Also vlen-agnostic kernel selection is added to ggml_vec_dot_q2_K_q8_K
for rvv-disabled and wider devices.
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #321

PR Context: Runtime RISC-V Vector (RVV) detection implementation with VLEN-agnostic kernel selection for Q2_K quantized dot products.

Overview

This PR refactors compile-time VLEN detection into runtime selection, extracting RVV-specific implementations into separate functions. The changes affect 4 files with 297 additions and 229 deletions, primarily in the RISC-V quantization module.

Key Findings

Performance-Critical Functions Impact

Quantization Module - Q2_K Dot Product:

  • ggml_vec_dot_q2_K_q8_K() refactored from switch-based dispatch to runtime function pointer selection
  • First-call overhead: 261 ns for VLEN detection and caching
  • Subsequent calls use cached function pointer with 1-2 cycle indirection overhead
  • The optimized kernels (rvv256, rvv128) remain algorithmically unchanged
  • No direct impact measured on this function in the performance data

Parameter Accessor Functions:

  • Nine functions show improvements ranging from 7 ns to 23 ns in response time
  • Functions: ggml_set_op_params_i32, ggml_set_op_params, ggml_get_op_params_i32 across vec.cpp, quants.c, sgemm.cpp, ggml-cpu.c, traits.cpp, and unary-ops.cpp
  • These improvements are indirect effects from code layout optimization, not direct PR changes
  • Total improvement range: 7-23 ns per function call

Feature Detection:

  • ggml_backend_cpu_get_features() operator shows 4775 ns response time increase
  • Throughput increase: 261 ns for VLEN query and string conversion
  • This is initialization code, called once during backend setup

Inference and Tokens Per Second Impact

Tokenization/Inference Functions:

  • No changes detected in llama_decode, llama_encode, or llama_tokenize functions
  • The Q2_K dot product refactoring affects quantized model inference but shows no measurable regression in the hot path
  • Parameter accessor improvements (7-23 ns per call) provide minor cumulative benefit across tensor operations

Expected Impact on Tokens/Second:

  • Negligible impact on inference throughput
  • The 261 ns first-call overhead is amortized across thousands of token generations
  • No core inference functions show response time changes that would affect tokens per second
  • Reference: 2 ms slowdown in llama_decode causes 7% tokens/second reduction; this PR shows no such changes

Power Consumption Analysis

Impacted Binaries:

  • build.bin.libggml-cpu.so: +0.028% (+36 nJ) - marginal increase from feature detection overhead
  • build.bin.libllama.so: -0.000% (negligible change)
  • build.bin.llama-cvector-generator: -0.000% (negligible change)
  • build.bin.llama-run: -0.000% (negligible change)
  • All other binaries: 0.000% (no change)

Analysis:
The power consumption remains essentially stable. The 36 nJ increase in libggml-cpu.so is attributed to the feature detection overhead during initialization, offset by improvements in parameter accessor functions.

Technical Summary

The PR successfully implements runtime RVV detection without measurable impact on inference performance. The refactoring improves portability by enabling single-binary deployment across RISC-V devices with varying vector lengths (128-bit, 256-bit, or no RVV support). The measured performance changes are primarily in initialization code and parameter accessors, with no impact on core inference functions or tokens per second throughput.

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 3163acc to 409b78f Compare November 26, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants