Skip to content

Consolidation of tensor copies to backend to reduce API overhead #162

@jakexcosme

Description

@jakexcosme

Note: This issue was copied from ggml-org#15749

Original Author: @agray3
Original Issue Number: ggml-org#15749
Created: 2025-09-02T16:02:57Z


Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Before the generation of each token, tensors are copied from the host CPU to the backend GPU. There currently exists a separate copy for each iteration of a loop:
for (int input_id = 0; input_id < split->n_inputs; input_id++)
where, e.g. for llama3 8B Q4_K_M n_inputs is 6, resulting in 6 separate copies which has significant CUDA API overhead (when running on an NVIDIA GPU) caused by the multiple memcpy and streamSync calls:
Image

These can be consolidated into a single copy, substantially reducing the overheads:
Image

This can be done in an isolated change which involves introducing buffers to hold the multiple tensors in consecutive memory on host and device.

Motivation

The optimisation improves inference performance. At the moment the improvement is modest but significant (e.g. around 1% on Blackwell RTX Pro 6000 for llama3 8B Q4_K_M ). Optimisation of such CPU-side API overheads will be come increasingly important as GPUs continue to get faster while CPU performance is largely stagnant.

Implementation

PR: ggml-org#15750

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions