-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Before the generation of each token, tensors are copied from the host CPU to the backend GPU. There currently exists a separate copy for each iteration of a loop:
for (int input_id = 0; input_id < split->n_inputs; input_id++)
where, e.g. for llama3 8B Q4_K_M n_inputs is 6, resulting in 6 separate copies which has significant CUDA API overhead (when running on an NVIDIA GPU) caused by the multiple memcpy and streamSync calls:

These can be consolidated into a single copy, substantially reducing the overheads:

This can be done in an isolated change which involves introducing buffers to hold the multiple tensors in consecutive memory on host and device.
Motivation
The optimisation improves inference performance. At the moment the improvement is modest but significant (e.g. around 1% on Blackwell RTX Pro 6000 for llama3 8B Q4_K_M ). Optimisation of such CPU-side API overheads will be come increasingly important as GPUs continue to get faster while CPU performance is largely stagnant.
Implementation
PR: #15750