Skip to content

Conversation

anavp-nvidia
Copy link
Contributor

As discussed in PR #16471, this PR removes the legacy copy-op pointer indirection code. This change allows cudaMemcpyAsync to be used instead of CUDA copy kernel for contiguous F32 tensors, resulting in ~4% performance improvement for Nemotron Nano v2 (NemotronH) model on RTX 5090.

Results:

Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M

Performance before:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 |    tg200 @ d100 |        165.50 ± 0.19 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 | pp100+tg200 @ d100 |        174.14 ± 2.02 |

Performance after:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 |    tg200 @ d100 |        172.22 ± 0.16 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 | pp100+tg200 @ d100 |        180.91 ± 0.15 |

@anavp-nvidia anavp-nvidia requested a review from slaren as a code owner October 9, 2025 12:43
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 9, 2025
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since changing addresses of cpy operations in CUDA graphs is no longer supported, the exception for GGML_OP_CPY in ggml_graph_node_has_matching_properties should also be removed.

The indirections in cpy ops should also be removed, since their only purpose was to allow this, as well as ggml_cuda_cpy_dest_ptrs_copy and ggml_cuda_graph::cpy_dest_ptrs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants