Skip to content

Conversation

@gaugarg-nv
Copy link
Contributor

@gaugarg-nv gaugarg-nv commented Apr 3, 2025

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by up to 12% in gen phase throughput for Gemma models.

Performance:

RTX 4090, CUDA 12.8, Master vs PR

  ISL OSL Master: Gen phase tok/sec PR: Gen phase tok/sec Speed-up
gemma3 1B Q4_K - Medium 10 200 318.6111 333.6684 1.047259
  100 200 309.473 328.0762 1.060113
  1000 200 284.6962 319.3516 1.121728
  10000 200 183.7296 206.1121 1.121823
           
gemma3 4B Q4_K - Medium          
  10 200 175.7797 184.4036 1.049061
  100 200 174.9861 181.8483 1.039215
  1000 200 165.9151 175.7443 1.059242
  10000 200 120.0141 126.6009 1.054884
           
gemma3 12B Q4_K - Medium          
  10 200 83.11534 85.4468 1.028051
  100 200 82.62634 84.6703 1.024737
  1000 200 80.07223 81.96644 1.023656
  10000 200 56.99771 59.67587 1.046987

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I probably forgot to adapt the logic at some point.

Co-authored-by: Johannes Gäßler <[email protected]>
@JohannesGaessler JohannesGaessler merged commit c262bed into ggml-org:master Apr 3, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants