Skip to content

Consolidation of tensor copies to backend to reduce API overhead #15749

@agray3

Description

@agray3

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Before the generation of each token, tensors are copied from the host CPU to the backend GPU. There currently exists a separate copy for each iteration of a loop:
for (int input_id = 0; input_id < split->n_inputs; input_id++)
where, e.g. for llama3 8B Q4_K_M n_inputs is 6, resulting in 6 separate copies which has significant CUDA API overhead (when running on an NVIDIA GPU) caused by the multiple memcpy and streamSync calls:
Image

These can be consolidated into a single copy, substantially reducing the overheads:
Image

This can be done in an isolated change which involves introducing buffers to hold the multiple tensors in consecutive memory on host and device.

Motivation

The optimisation improves inference performance. At the moment the improvement is modest but significant (e.g. around 1% on Blackwell RTX Pro 6000 for llama3 8B Q4_K_M ). Optimisation of such CPU-side API overheads will be come increasingly important as GPUs continue to get faster while CPU performance is largely stagnant.

Implementation

PR: #15750

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions