CUDA: Prefer vector flash decoding kernel for Gemma models #12738

gaugarg-nv · 2025-04-03T12:59:29Z

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by up to 12% in gen phase throughput for Gemma models.

Performance:

RTX 4090, CUDA 12.8, Master vs PR

	ISL	OSL	Master: Gen phase tok/sec	PR: Gen phase tok/sec	Speed-up
gemma3 1B Q4_K - Medium	10	200	318.6111	333.6684	1.047259
	100	200	309.473	328.0762	1.060113
	1000	200	284.6962	319.3516	1.121728
	10000	200	183.7296	206.1121	1.121823

gemma3 4B Q4_K - Medium
	10	200	175.7797	184.4036	1.049061
	100	200	174.9861	181.8483	1.039215
	1000	200	165.9151	175.7443	1.059242
	10000	200	120.0141	126.6009	1.054884

gemma3 12B Q4_K - Medium
	10	200	83.11534	85.4468	1.028051
	100	200	82.62634	84.6703	1.024737
	1000	200	80.07223	81.96644	1.023656
	10000	200	56.99771	59.67587	1.046987

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

JohannesGaessler

Thank you, I probably forgot to adapt the logic at some point.

ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <[email protected]>

Prefer vector flash decoding kernel for Gemma models

f7d07dd

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

gaugarg-nv requested a review from JohannesGaessler as a code owner April 3, 2025 12:59

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2025

JohannesGaessler approved these changes Apr 3, 2025

View reviewed changes

ggml/src/ggml-cuda/fattn.cu Outdated Show resolved Hide resolved

Update ggml/src/ggml-cuda/fattn.cu

ce71aba

Co-authored-by: Johannes Gäßler <[email protected]>

JohannesGaessler merged commit c262bed into ggml-org:master Apr 3, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

Uh oh!

gaugarg-nv commented Apr 3, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

CUDA: Prefer vector flash decoding kernel for Gemma models #12738

Uh oh!

Conversation

gaugarg-nv commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaugarg-nv commented Apr 3, 2025 •

edited

Loading