|
1 |
| -# KVCache Manager |
| 1 | +# KV-Cache Manager |
2 | 2 |
|
3 |
| -## Introduction |
| 3 | +### Introduction |
4 | 4 |
|
5 |
| -LLM inference can be computationally expensive due to the sequential nature of token generation. |
6 |
| -KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors, |
7 |
| -KVCache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption. |
8 |
| -This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where |
9 |
| -previously computed information can be leveraged effectively. |
10 |
| -Efficient KVCache management and routing are essential for scaling LLM inference and delivering a responsive user experience. |
| 5 | +Efficiently caching Key & Value (KV) tensors is crucial for optimizing LLM inference. |
| 6 | +Reusing the KV-Cache, rather than recomputing it, significantly improves both Time To First Token (TTFT) and overall throughput, while also maximizing system resource-utilization. |
| 7 | +As a distributed LLM inference platform, `llm-d` provides a comprehensive suite of KV-Cache management capabilities to achieve these goals. |
11 | 8 |
|
12 |
| -llmd-kv-cache-manager is a pluggable KVCache Manager for KVCache Aware Routing in vLLM-based serving platforms. |
| 9 | +This repository contains the `llm-d-kv-cache-manager`, a pluggable service designed to enable **KV-Cache Aware Routing** and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms. |
13 | 10 |
|
14 |
| -See [docs](docs/README.md) for more information on goals, architecture and more. |
15 |
| -## Overview |
| 11 | +### Project Northstar |
16 | 12 |
|
17 |
| -The code defines a [KVCacheIndexer](pkg/kv-cache/indexer.go) module that efficiently maintains a global view of KVCache states and localities. |
18 |
| -In the current state of vLLM, the only available information on KVCache availability is that of the offloaded tensors to KVCache Engines via the Connector API. |
| 13 | +See the [Project Northstar](https://docs.google.com/document/d/1EM1QtDUaw7pVRkbHQFTSCQhmWqAcRPJugJgqPbvzGTA/edit?tab=t.ikcvw3heciha) document for a detailed overview of the project's goals and vision. |
19 | 14 |
|
20 |
| -The `kvcache.Indexer` module is a pluggable Go package designed for use by orchestrators to enable KVCache-aware scheduling decisions. |
| 15 | +----- |
| 16 | + |
| 17 | +## KV-Cache Indexer Overview |
| 18 | + |
| 19 | +One of the major component of this project is the **KVCache Indexer**: a high-performance Go service that maintains a global, near-real-time view of KV-Cache block locality. |
| 20 | + |
| 21 | +It is powered by `KVEvents` streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache. |
| 22 | +This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU). |
| 23 | +This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, cache-aware placement decisions. |
| 24 | + |
| 25 | +The diagram below shows the primary data flows: the **Read Path** (scoring) and the **Write Path** (event ingestion). |
21 | 26 |
|
22 | 27 | ```mermaid
|
23 |
| -graph |
24 |
| - subgraph Cluster |
25 |
| - Router |
26 |
| - subgraph KVCacheManager[KVCache Manager] |
27 |
| - KVCacheIndexer[KVCache Indexer] |
28 |
| - PrefixStore[LRU Prefix Store] |
29 |
| - KVBlockToPodIndex[KVBlock to Pod availability Index] |
| 28 | +graph TD |
| 29 | + subgraph Scheduler / Router |
| 30 | + A[Scheduler] |
30 | 31 | end
|
31 |
| - subgraph vLLMNode[vLLM Node] |
32 |
| - vLLMCore[vLLM Core] |
33 |
| - KVCacheEngine["KVCache Engine (LMCache)"] |
| 32 | + |
| 33 | + subgraph KVCacheManager["KV-Cache Manager"] |
| 34 | + B[KVCache Indexer API] |
| 35 | + C[KV-Block Index] |
| 36 | + D[Event Subscriber] |
34 | 37 | end
|
35 |
| - Redis |
36 |
| - end |
37 |
| -
|
38 |
| - Router -->|"Score(prompt, ModelName, relevantPods)"| KVCacheIndexer |
39 |
| - KVCacheIndexer -->|"{Pod to Scores map}"| Router |
40 |
| - Router -->|Route| vLLMNode |
41 |
| - |
42 |
| - KVCacheIndexer -->|"FindLongestTokenizedPrefix(prompt, ModelName) -> tokens"| PrefixStore |
43 |
| - PrefixStore -->|"DigestPromptAsync"| PrefixStore |
44 |
| - KVCacheIndexer -->|"GetPodsForKeys(tokens) -> {KVBlock keys to Pods} availability map"| KVBlockToPodIndex |
45 |
| - KVBlockToPodIndex -->|"Redis MGet(blockKeys) -> {KVBlock keys to Pods}"| Redis |
46 |
| -
|
47 |
| - vLLMCore -->|Connector API| KVCacheEngine |
48 |
| - KVCacheEngine -->|"UpdateIndex(KVBlock keys, nodeIP)"| Redis |
| 38 | +
|
| 39 | + subgraph vLLM Fleet |
| 40 | + E[vLLM Pod 1] |
| 41 | + F[vLLM Pod 2] |
| 42 | + G[...] |
| 43 | + end |
| 44 | +
|
| 45 | + A -- "1. Score(prompt, pods)" --> B |
| 46 | + B -- "2. Query Index" --> C |
| 47 | + B -- "3. Return Scores" --> A |
| 48 | + |
| 49 | + E -- "4. Emit KVEvents" --> D |
| 50 | + F -- "4. Emit KVEvents" --> D |
| 51 | + D -- "5. Update Index" --> C |
| 52 | + |
49 | 53 | ```
|
50 |
| -This overview greatly simplifies the actual architecture and combines steps across several submodules. |
51 |
| -For a detailed architecture, refer to the [architecture](docs/architecture.md) document. |
52 | 54 |
|
53 |
| -## Examples |
| 55 | +1. **Scoring Request**: A scheduler asks the **KVCache Indexer** to score a set of pods for a given prompt |
| 56 | +2. **Index Query**: The indexer calculates the necessary KV-block keys from the prompt and queries the **KV-Block Index** to see which pods have those blocks |
| 57 | +3. **Return Scores**: The indexer returns a map of pods and their corresponding KV-cache-hit scores to the scheduler |
| 58 | +4. **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes |
| 59 | +5. **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time |
| 60 | + |
| 61 | +* For a more detailed breakdown, please see the high-level [Architecture Document](docs/architecture.md). |
| 62 | + |
| 63 | +----- |
54 | 64 |
|
55 |
| -- [KVCache Indexer](examples/kv-cache-index/README.md): |
56 |
| - - A reference implementation of using the `kvcache.Indexer` module. |
57 |
| -- [KVCache Aware Scorer](examples/kv-cache-aware-scorer/README.md): |
58 |
| - - A reference implementation of integrating the `kvcache.Indexer` module in |
59 |
| - [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) in a KVCache aware scorer. |
| 65 | +### Examples |
60 | 66 |
|
| 67 | +* [**KVCache Indexer**](examples/kv_cache_index/README.md): |
| 68 | + A reference implementation showing how to run and use the `kvcache.Indexer` module |
| 69 | +* [**KVCache Aware Scorer**](examples/kv_cache_aware_scorer/README.md): |
| 70 | + A reference implementation of how to integrate the `kvcache.Indexer` into a scheduler like the `llm-d-inference-scheduler` |
| 71 | +* [**KV-Events**](examples/kv_events/README.md): |
| 72 | + Demonstrates how the KV-Cache Manager handles KV-Events through both an offline example with a dummy ZMQ publisher and an online example using a vLLM Helm chart. |
0 commit comments