Consolidation of tensor copies to backend to reduce API overhead

**Note: This issue was copied from [https://github.com/ggml-org/llama.cpp/issues/15749](https://github.com/ggml-org/llama.cpp/issues/15749)**

**Original Author:** @agray3
**Original Issue Number:** #15749
**Created:** 2025-09-02T16:02:57Z

---

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Before the generation of each token, tensors are copied from the host CPU to the backend GPU. There currently exists a separate copy for each iteration of a loop:
```for (int input_id = 0; input_id < split->n_inputs; input_id++) ```
where, e.g. for llama3 8B Q4_K_M n_inputs is 6, resulting in 6 separate copies which has significant CUDA API overhead (when running on an NVIDIA GPU) caused by the multiple memcpy and streamSync calls:
<img width="1086" height="100" alt="Image" src="https://github.com/user-attachments/assets/25865680-88fc-41c7-8a50-117bbe4af27f" />

These can be consolidated into a single copy, substantially reducing the overheads:
<img width="450" height="100" alt="Image" src="https://github.com/user-attachments/assets/b3deb177-c883-4317-91ca-b2f43117ccf4" />

This can be done in an isolated change which involves introducing buffers to hold the multiple tensors in consecutive memory on host and device.

### Motivation

The optimisation improves inference performance. At the moment the improvement is modest but significant (e.g. around 1% on Blackwell RTX Pro 6000 for llama3 8B Q4_K_M ). Optimisation of such CPU-side API overheads will be come increasingly important as GPUs continue to get faster while CPU performance is largely stagnant.

### Implementation

PR: https://github.com/ggml-org/llama.cpp/pull/15750

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidation of tensor copies to backend to reduce API overhead #162

Prerequisites

Feature Description

Motivation

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidation of tensor copies to backend to reduce API overhead #162

Description

Prerequisites

Feature Description

Motivation

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions