v1: Offloading connector #22595

orozery · 2025-08-10T14:14:16Z

This PR adds an offloading connector that delegates to a generic API introduced in #19848.
The actual implementation of this API is built using a factory which is currently empty.
A follow-up small PR will register a CPU implementation based on #20075 (scheduler-side implementation) and #21448 (worker-side implementation).

Part of RFC #19854.
Depends on PRs #19728, #19848, #19737.

github-actions · 2025-08-10T14:14:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces a new offloading connector. The implementation is extensive and adds a lot of new components. My review found several critical issues that need to be addressed. These include a race condition in the tests, a critical assertion that would crash workers on transfer failures, a resource leak due to unjoined threads, and an incorrect list slicing that would lead to errors. These issues affect both the correctness of the new feature and the reliability of its tests.

tests/v1/kv_connector/unit/test_offloading_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

vllm/v1/offloading/worker/worker.py

KuntaiDu · 2025-08-11T05:33:56Z

mark, will take a look and review after this PR gets stable.

This commit adds a new offloading component, composed of: 1. A scheduler side OffloadingManager (abstract) which kicks-off KV data transfers and keeps track of offloaded data. 2. A worker side OffloadingQueueManager which asynchronously manages KV transfers. Signed-off-by: Or Ozeri <[email protected]>

mergify · 2025-08-14T12:39:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This commit move the request block hashes from the KVCacheManager to the Request object itself. In particular, this will allow connectors to access the request block hashes. Signed-off-by: Or Ozeri <[email protected]>

This commit adds a new scheduler-side connector API to collect KV cache events. Additionally, we add a medium field to KV events, to allow distinguishing KV events on different mediums (e.g. blocks stored on cpu, disk, or gpu (default)). Signed-off-by: Or Ozeri <[email protected]>

This commit introduces a new OffloadingConnector for offloading blocks of KV data via a generic interface. Signed-off-by: Or Ozeri <[email protected]>

ApostaC · 2025-08-19T03:59:06Z

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

ZrBac · 2025-08-22T01:57:43Z

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

try this branch, https://github.com/orozery/vllm/tree/cpu-offloading-afa5b7

ApostaC

@orozery Thanks for the great effort! Some high-level comments:

The current implementation is a bit over-complicated. We should simplify the transfer_fn and the LoadStoreSpec abstraction in order to get better performance and better maintainability.
There are a few potential performance improvements that we can do (immediately or as potential follow-ups)
(a) Launch the d2h/h2d copy kernels in a separate cuda stream
(b) Use cuda events to implement the async loading so that we don't need to launch extra python threads in the worker process.

ApostaC · 2025-08-27T00:59:49Z

vllm/v1/offloading/abstract.py

+@dataclass
+class PrepareStoreOutput:
+    block_hashes_to_store: list[int]
+    store_specs: list[LoadStoreSpec]


Having the store_specs being a list of Python objects will be pretty heavy.
From other related PRs, I see that we are going to transmit this list between processes and threads, plus doing some for loops over it in the worker process. This can incur a huge amount of python-level overheads.

A proposal is to use torch.Tensor for now, since the BlockIDLoadStoreSpec are just wrapping around integers.

ApostaC · 2025-08-27T01:01:11Z

vllm/v1/offloading/mediums.py

+class BlockIDLoadStoreSpec(LoadStoreSpec, ABC):
+    """
+    Spec for loading/storing a KV block from a given block number.
+    """
+
+    def __init__(self, block_id: int):
+        self.block_id = block_id
+
+    def __repr__(self) -> str:
+        return str(self.block_id)
+
+
+class GPULoadStoreSpec(BlockIDLoadStoreSpec):
+    """
+    Spec for loading/storing a KV block to GPU memory.
+    """
+
+    @staticmethod
+    def medium() -> str:
+        return "GPU"
+
+
+class CPULoadStoreSpec(BlockIDLoadStoreSpec):
+    """
+    Spec for loading/storing a KV block to CPU memory.
+    """
+
+    @staticmethod
+    def medium() -> str:
+        return "CPU"


The abstraction of such LoadStoreSpec seems to be over-complicated. Why should we have this? Would there be simpler alternatives? (i.e., just use two lists or two tensors for cpu->gpu block ids and gpu->cpu block ids)

ApostaC · 2025-08-27T01:03:27Z

vllm/v1/offloading/spec.py

+    @abstractmethod
+    def get_transfer_functions(
+        self, kv_caches: dict[str, torch.Tensor]
+    ) -> Iterator[tuple[type[LoadStoreSpec], type[LoadStoreSpec],
+                        TransferFunction, int]]:
+        """
+        Get transfer functions along with their respective src and dst types.
+
+        Args:
+            kv_caches: A dictionary of layer_name -> gpu_kv_cache tensor.
+
+        Yields:
+            Tuples of (src_type, dst_type, transfer_function, num_threads).
+        """
+        pass


Not sure the purpose of having such an abstraction for CPU offloading. The logic is a bit hard to follow here.

Can we directly call swap_blocks in the connector? That would be simple and easy to understand.

ApostaC · 2025-08-27T01:05:23Z

vllm/v1/offloading/worker/worker.py

+        for thread_idx in range(num_threads):
+            t = threading.Thread(target=self.run,
+                                 args=(thread_idx, ),
+                                 name=f"{transfer_type}-worker-{thread_idx}")
+            t.start()
+            self._worker_threads.append(t)


Having another thread in the worker process may incur extra overhead.

At a high level, we might want to use CUDA events to achieve async so that we don't need to create new threads.
IIUC, this could be a longer discussion, and we can gradually push the implementation in.

ApostaC · 2025-08-27T01:07:58Z

vllm/v1/offloading/worker/worker.py

+                         job_id)
+
+            try:
+                success = self.transfer_fn(transfer_spec)


From what I saw in other PRs, the transfer_fn and the internal swap_blocks are launched in the same CUDA stream as the LLM inference.
This will make CPU offloading a blocking operation, resulting in a negative performance impact, especially when there is no KV cache reuse.

For performance's sake, we should make sure the d2h and h2d copies are launched in different cuda streams.,

orozery requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 10, 2025 14:14

mergify bot added ci/build v1 labels Aug 10, 2025

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

orozery force-pushed the offloading-connector branch 2 times, most recently from 4b24d03 to 4fca175 Compare August 10, 2025 14:49

orozery force-pushed the offloading-connector branch from 4fca175 to 8d7a0d7 Compare August 11, 2025 13:43

mergify bot added the documentation Improvements or additions to documentation label Aug 11, 2025

orozery force-pushed the offloading-connector branch 2 times, most recently from 866a51c to 4872976 Compare August 11, 2025 19:11

robertgshaw2-redhat mentioned this pull request Aug 13, 2025

[RFC]: KV cache offloading #19854

Open

1 task

mergify bot added the needs-rebase label Aug 14, 2025

orozery force-pushed the offloading-connector branch from 4872976 to 9d2e0b9 Compare August 14, 2025 12:40

mergify bot removed the needs-rebase label Aug 14, 2025

orozery added 3 commits August 14, 2025 15:50

v1: Add Request.block_hashes

446c436

This commit move the request block hashes from the KVCacheManager to the Request object itself. In particular, this will allow connectors to access the request block hashes. Signed-off-by: Or Ozeri <[email protected]>

Introduce offloading connector

11e1629

This commit introduces a new OffloadingConnector for offloading blocks of KV data via a generic interface. Signed-off-by: Or Ozeri <[email protected]>

orozery force-pushed the offloading-connector branch from 9d2e0b9 to 11e1629 Compare August 14, 2025 12:50

ApostaC suggested changes Aug 27, 2025

View reviewed changes

vMaroon mentioned this pull request Aug 27, 2025

vLLM Native CPU Offloading Connector llm-d/llm-d-kv-cache-manager#67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v1: Offloading connector #22595

v1: Offloading connector #22595

orozery commented Aug 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Aug 11, 2025

Uh oh!

mergify bot commented Aug 14, 2025

Uh oh!

ApostaC commented Aug 19, 2025

Uh oh!

ZrBac commented Aug 22, 2025 •

edited

Loading

Uh oh!

ApostaC left a comment

Uh oh!

ApostaC Aug 27, 2025

Uh oh!

ApostaC Aug 27, 2025

Uh oh!

ApostaC Aug 27, 2025

Uh oh!

ApostaC Aug 27, 2025

Uh oh!

ApostaC Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

v1: Offloading connector #22595

Are you sure you want to change the base?

v1: Offloading connector #22595

Conversation

orozery commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Aug 11, 2025

Uh oh!

mergify bot commented Aug 14, 2025

Uh oh!

ApostaC commented Aug 19, 2025

Uh oh!

ZrBac commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orozery commented Aug 10, 2025 •

edited by github-actions bot

Loading

ZrBac commented Aug 22, 2025 •

edited

Loading