Skip to content

Conversation

anavp-nvidia
Copy link
Contributor

Summary:

CUDA Graphs (CG) were being disabled for the Nemotron Nano v2 (NemotronH) model due to a combination of heuristics and graph-splitting issues. This PR addresses those cases to enable CG usage and improve performance.

Fixes:

  • Add-op batch size heuristic:
    • Location: ggml-cuda.cu#L2666-L2682
    • Issue: The heuristic incorrectly triggered for NemotronH even with batch size = 1 (similar to the Gemma 3n issue)
    • Fix: Skipped the heuristic for specific nodes that erroneously trigger it. Verified that the selected node names are unique and do not appear in other architectures.
  • Copy-op heuristic:
    • Location: ggml-cuda.cu#2684-L2698
    • Issue: cudaMemcpyAsync cannot be accessed via indirection from CUDA Graph, leading to CG disablement.
    • Fix: For required data types, rerouted to existing GGML CUDA copy kernels (instead of cudaMemcpyAsync), enabling indirection from CG. Other types continue to fall back to cudaMemcpyAsync.
  • Excessive CG updates (split ggml_cgraphs):
    • Issue: The get_rows CPU op for input embeddings was scheduled after some GPU ops for Nemotron Nano v2, causing the graph to split and forcing excessive CG updates.
    • Fix: Used ggml_build_forward_expand to move CPU op to the start of the graph and, thus avoiding split ggml_cgraphs.

Results (RTX 5090):

Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M

Performance before:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       | 100 |       1 |  1 |           tg200 |        120.78 ± 0.60 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       | 100 |       1 |  1 |    pp1000+tg200 |        125.85 ± 0.40 |

Performance after:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       | 100 |       1 |  1 |           tg200 |        164.68 ± 1.36 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       | 100 |       1 |  1 |    pp1000+tg200 |        173.88 ± 0.04 |

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 29, 2025
@pwilkin
Copy link
Collaborator

pwilkin commented Sep 29, 2025

This is a very cool enhancement, Nemotron Nano v2 was already very fast. Big thanks!

@ggerganov ggerganov merged commit a014310 into ggml-org:master Sep 30, 2025
57 of 60 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
…-org#16328)

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants