Skip to content

Conversation

@ArshM17-NV
Copy link

Summary:

CUDA Graphs (CG) were being disabled for the Embedding Gemma model due to some heuristics. This PR addresses those cases to enable CG usage and improve performance.

Fixes:

  • Add-op heuristic:
    • Location: ggml-cuda.cu#L2831-L2849
    • Issue: The heuristic incorrectly triggered for Embedding Gemma which disabled CUDA Graphs
    • Fix: Skipped the heuristic for specific nodes that erroneously trigger it. Verified that the selected node names are unique and do not appear in other architectures.

Results:

Performance before:

image

Performance after:

image

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@ArshM17-NV
Copy link
Author

@CISC @slaren please review this PR when you get a chance. Thanks!

@slaren
Copy link
Member

slaren commented Oct 31, 2025

The problem I see with this approach is that it will only work if every sequence is the same length. If the sequence length changes, then the CUDA graph will need to be rebuilt. I am not sure how common is that in practice.

@ggerganov
Copy link
Member

I am not sure how common is that in practice.

Likely it is going to occur very often - my impression is most user code does not pad the input embeddings so they typically have varying lengths.

We can add logic in llama_encode to do padding automatically. Could be optional through a llama_context_param parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants