Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16988

Fixes ggml-org/llama.cpp#16976 .

The problem is that the CUDA kernel selection logic does not check strides, so it's trying to run kernels where the strides don't fit. The tests don't detect this because the strides are always constructed as 2*ne00.

@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Analysis

Based on the analysis of key performance-critical functions in LLaMA.cpp, the following functions show no measurable performance changes between versions:

Core Inference Functions

  • llama_decode: 43,183,520 ns response time (no change)
  • llama_encode: 10,861,871 ns response time (no change)
  • llama_tokenize: 876,478 ns response time (no change)

Model Loading Functions

  • llama_model_load_from_file: 364,761,020 ns response time (no change)

Memory Management Functions

  • llama_memory_clear: 49 ns response time (no change)

Batch Processing Functions

  • llama_batch_init: 250 ns response time (no change)

Key Performance Indicator Impact Assessment

1. Tokens Per Second

Status: No Impact

  • Critical inference functions unchanged: llama_decode, llama_encode, and llama_tokenize show zero performance delta
  • Reference baseline maintained: The 7% tokens/second reduction correlation with 2ms llama_decode slowdown does not apply as no slowdown occurred
  • Inference pipeline stability: Core tokenization and inference execution paths remain unaffected

2. Power Consumption

Status: Minimal Impact

  • Affected binaries:
    • build.bin.libllama.so: -0.0002% reduction (280,661.60 nJ vs 280,662.15 nJ)
    • build.bin.llama-cvector-generator: -0.0001% reduction
    • build.bin.llama-tts: -0.0003% reduction
  • Impact magnitude: Changes are within measurement noise tolerance
  • Root cause: Minor compiler optimization differences rather than algorithmic changes

3. Quantization Efficiency

Status: No Impact

  • llama_model_quantize function: No performance metrics available, indicating no changes
  • Quantization pipeline: GGML quantization backends show no measurable performance delta
  • Format support: No changes to quantization format handling (Q4_0, Q4_1, Q8_0, etc.)

4. Memory Usage

Status: No Impact

  • KV cache management: llama_memory_clear shows identical 49 ns execution time
  • Memory allocation: GGML allocator functions show no performance changes
  • Buffer management: Unified and recurrent memory systems maintain consistent performance

5. Batch Processing

Status: No Impact

  • Batch initialization: llama_batch_init maintains 250 ns execution time
  • Parallel processing: Core batch processing functions show no performance delta
  • Dynamic batching: No changes to adaptive batch size management efficiency

CUDA Kernel Selection Changes Analysis

The PR introduces stride validation improvements in CUDA kernel selection:

Modified Functions

  • ggml_cuda_should_use_mmf: Added stride alignment validation
  • ggml_cuda_should_use_mmvf: Added stride alignment validation

Performance Implications

  • Validation overhead: Additional loop checking src0_nb[i] % (2*ts) != 0 for all dimensions
  • Kernel selection: More conservative selection may prevent crashes but could reduce optimization opportunities
  • Stability improvement: Prevents kernel crashes on uneven tensor contexts

Action Items for Performance Optimization

Code-Level Optimizations

  1. Monitor CUDA kernel selection: Track matrix multiplication performance to ensure stride validation doesn't overly restrict optimized kernel usage
  2. Optimize stride validation: Consider caching stride validation results or making checks conditional based on tensor properties
  3. Validate tensor configurations: Test edge cases with unusual stride patterns to ensure no performance regressions

Build-Level Optimizations

  1. Compiler optimization flags: Ensure consistent optimization levels across builds to minimize measurement noise
  2. Binary layout optimization: Consider link-time optimization to reduce performance variations from binary layout changes
  3. CUDA compilation: Verify CUDA kernel compilation settings maintain optimal performance characteristics

Conclusion

The version comparison shows stable performance across all critical LLaMA.cpp functions. The CUDA kernel selection improvements enhance stability without measurable performance impact on core inference operations. The minimal power consumption changes reflect compiler optimization differences rather than functional modifications. No action is required for performance preservation, but monitoring CUDA workload performance is recommended to validate the stride validation improvements.

@DajanaV DajanaV force-pushed the upstream-PR16988-branch_JohannesGaessler-cuda-fix-uneven-ctx branch from e5cc811 to 7c48209 Compare November 4, 2025 13:42
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

This analysis examines PR #71 implementing CUDA kernel selection fixes for uneven context sizes, comparing version 938a44c5-e27b-4dbb-af24-c3f53a3e65b5 against base a98c0b17-e20d-4b11-8978-6d6d10c53020.

Key Findings

Performance Impact:

  • Highest Response Time Change: llama_context::opt_init() shows +0.14% increase (+5 ns absolute), representing the most significant performance degradation
  • Highest Throughput Change: llama_context::state_seq_save_file() shows +1.51% increase (+4 ns absolute)
  • Core Function Impact: No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize), indicating no impact on tokens per second performance

Power Consumption Analysis:
All binaries show negligible power consumption changes (<0.001%). The largest absolute change occurs in build.bin.libllama.so with +0.70 nJ increase, representing effectively zero percentage change across the entire binary ecosystem.

Flame Graph Analysis:
The opt_init() function shows a structured execution pattern with 59% of runtime concentrated in 13 sequential llama_set_param calls (149 ns each). Smart pointer operations consume 21% of execution time, with the performance degradation primarily attributed to std::unique_ptr dereferencing overhead during optimizer setup.

CFG Comparison:
Control flow analysis reveals identical structural organization between versions. The performance degradation stems from compiler-generated debug information updates (line number changes in error messages) rather than algorithmic modifications. Assembly instruction sequences remain functionally identical.

Code Review Insights:
The changes add stride validation to CUDA kernel selection functions (ggml_cuda_should_use_mmf, ggml_cuda_should_use_mmvf) to prevent crashes on uneven context sizes. The modifications include additional tensor stride alignment checks (src0_nb[i] % (2*ts) != 0) that add minimal validation overhead during GPU backend initialization.

Critical Assessment:
The 0.14% performance increase in opt_init() represents a one-time initialization cost that prevents runtime CUDA kernel crashes. Since no core inference functions are affected, the changes maintain inference throughput while improving system stability. The trade-off between minimal initialization overhead and crash prevention is acceptable for production deployments.

Actionable Recommendations:

  • Monitor GPU kernel fallback behavior in edge cases with non-power-of-2 context sizes
  • Validate that stride validation doesn't affect custom tensor layouts in specialized use cases

@DajanaV DajanaV force-pushed the main branch 15 times, most recently from b1ace60 to bff7103 Compare November 6, 2025 08:11
@DajanaV DajanaV force-pushed the upstream-PR16988-branch_JohannesGaessler-cuda-fix-uneven-ctx branch from 7c48209 to 41735c2 Compare November 6, 2025 08:40
@DajanaV DajanaV force-pushed the main branch 6 times, most recently from 94381d7 to 0eeb29b Compare November 7, 2025 18:11
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 47d1dc9 to 297c352 Compare December 4, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants