Skip to content

Commit d85bdbb

Browse files
vMaroonelevran
andauthored
[KV-Events] KV-Events Processing - Part 3 of 3 (#44)
* minor refactoring Signed-off-by: Maroon Ayoub <[email protected]> * implemented kv-events processing with a zmq subscriber and workers pool Signed-off-by: Maroon Ayoub <[email protected]> * updated example with in-memory indexing Signed-off-by: Maroon Ayoub <[email protected]> * switch to ungated bert model in example Signed-off-by: Maroon Ayoub <[email protected]> * - general refactoring - switched to int64 kvblock-hashes instead of string Signed-off-by: Maroon Ayoub <[email protected]> * - added kv-events offline/online examples - updated helm chart to enable kv-events and run online example - general refactoring and minor fixes Signed-off-by: Maroon Ayoub <[email protected]> * added TODOs Signed-off-by: Maroon Ayoub <[email protected]> * Update examples/kv-events/README.md Co-authored-by: Etai Lev Ran <[email protected]> Signed-off-by: Maroon Ayoub <[email protected]> * general refactoring (addressed review comments) Signed-off-by: Maroon Ayoub <[email protected]> * typo fix Signed-off-by: Maroon Ayoub <[email protected]> * Update README.md Signed-off-by: Maroon Ayoub <[email protected]> * fixed default ZMQEndpoint config typo Signed-off-by: Maroon Ayoub <[email protected]> * update vLLM fork commit hash Signed-off-by: Maroon Ayoub <[email protected]> * updated vLLM temporary deployment Signed-off-by: Maroon Ayoub <[email protected]> * - improved docs - minor refactoring Signed-off-by: Maroon Ayoub <[email protected]> * typo fix Signed-off-by: Maroon Ayoub <[email protected]> --------- Signed-off-by: Maroon Ayoub <[email protected]> Signed-off-by: Maroon Ayoub <[email protected]> Co-authored-by: Etai Lev Ran <[email protected]>
1 parent 2d3b68d commit d85bdbb

File tree

38 files changed

+1944
-482
lines changed

38 files changed

+1944
-482
lines changed

.github/workflows/ci-pr-checks.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ jobs:
1313
- name: Checkout source
1414
uses: actions/checkout@v4
1515

16+
- name: Install system dependencies (ZeroMQ)
17+
run: |
18+
sudo apt-get update
19+
sudo apt-get install -y libzmq3-dev pkg-config
20+
1621
- name: Sanity check repo contents
1722
run: ls -la
1823

Dockerfile

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,11 @@ ARG TARGETARCH
2020
WORKDIR /workspace
2121

2222
USER root
23-
RUN dnf install -y gcc-c++ libstdc++ libstdc++-devel clang && dnf clean all
23+
# Install EPEL repository directly and then ZeroMQ, as epel-release is not in default repos.
24+
# The builder is based on UBI8, so we need epel-release-8.
25+
RUN dnf install -y 'https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm' && \
26+
dnf install -y gcc-c++ libstdc++ libstdc++-devel clang zeromq-devel pkgconfig && \
27+
dnf clean all
2428

2529
# Copy the Go Modules manifests
2630
COPY go.mod go.mod
@@ -30,7 +34,7 @@ COPY go.sum go.sum
3034
RUN go mod download
3135

3236
# Copy the go source
33-
COPY examples/kv-cache-index/main.go cmd/cmd.go
37+
COPY examples/kv_events examples/kv_events
3438
COPY . .
3539

3640
# HuggingFace tokenizer bindings
@@ -44,15 +48,20 @@ RUN ranlib lib/*.a
4448
# the docker BUILDPLATFORM arg will be linux/arm64 when for Apple x86 it will be linux/amd64. Therefore,
4549
# by leaving it empty we can ensure that the container and binary shipped on it will have the same platform.
4650

47-
RUN CGO_ENABLED=1 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -ldflags="-extldflags '-L$(pwd)/lib'" -a -o bin/kv-cache-manager cmd/cmd.go
51+
RUN CGO_ENABLED=1 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH:-amd64} go build -ldflags="-extldflags '-L$(pwd)/lib'" -a -o bin/kv-cache-manager examples/kv_events/online/main.go
4852

4953
# Use distroless as minimal base image to package the manager binary
5054
# Refer to https://github.com/GoogleContainerTools/distroless for more details
5155
FROM registry.access.redhat.com/ubi9/ubi:latest
5256
WORKDIR /
57+
# Install zeromq runtime library needed by the manager.
58+
# The final image is UBI9, so we need epel-release-9.
59+
USER root
60+
RUN dnf install -y 'https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm' && \
61+
dnf install -y zeromq
62+
5363
COPY --from=builder /workspace/bin/kv-cache-manager /app/kv-cache-manager
5464
USER 65532:65532
5565

56-
CMD ["sleep", "infinity"]
57-
58-
66+
# Set the entrypoint to the kv-cache-manager binary
67+
ENTRYPOINT ["/app/kv-cache-manager"]

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ e2e-test: download-tokenizer
8888
.PHONY: build
8989
build: check-go download-tokenizer ##
9090
@printf "\033[33;1m==== Building ====\033[0m\n"
91-
go build -ldflags="$(LDFLAGS)" -o bin/$(PROJECT_NAME) examples/kv-cache-index/main.go
91+
go build -ldflags="$(LDFLAGS)" -o bin/$(PROJECT_NAME) examples/kv_cache_index/main.go
9292

9393
.PHONY: image-build
9494
image-build: check-container-tool load-version-json ## Build Docker image ## Build Docker image using $(CONTAINER_TOOL)

README.md

Lines changed: 58 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,72 @@
1-
# KVCache Manager
1+
# KV-Cache Manager
22

3-
## Introduction
3+
### Introduction
44

5-
LLM inference can be computationally expensive due to the sequential nature of token generation.
6-
KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors,
7-
KVCache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption.
8-
This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where
9-
previously computed information can be leveraged effectively.
10-
Efficient KVCache management and routing are essential for scaling LLM inference and delivering a responsive user experience.
5+
Efficiently caching Key & Value (KV) tensors is crucial for optimizing LLM inference.
6+
Reusing the KV-Cache, rather than recomputing it, significantly improves both Time To First Token (TTFT) and overall throughput, while also maximizing system resource-utilization.
7+
As a distributed LLM inference platform, `llm-d` provides a comprehensive suite of KV-Cache management capabilities to achieve these goals.
118

12-
llmd-kv-cache-manager is a pluggable KVCache Manager for KVCache Aware Routing in vLLM-based serving platforms.
9+
This repository contains the `llm-d-kv-cache-manager`, a pluggable service designed to enable **KV-Cache Aware Routing** and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
1310

14-
See [docs](docs/README.md) for more information on goals, architecture and more.
15-
## Overview
11+
### Project Northstar
1612

17-
The code defines a [KVCacheIndexer](pkg/kv-cache/indexer.go) module that efficiently maintains a global view of KVCache states and localities.
18-
In the current state of vLLM, the only available information on KVCache availability is that of the offloaded tensors to KVCache Engines via the Connector API.
13+
See the [Project Northstar](https://docs.google.com/document/d/1EM1QtDUaw7pVRkbHQFTSCQhmWqAcRPJugJgqPbvzGTA/edit?tab=t.ikcvw3heciha) document for a detailed overview of the project's goals and vision.
1914

20-
The `kvcache.Indexer` module is a pluggable Go package designed for use by orchestrators to enable KVCache-aware scheduling decisions.
15+
-----
16+
17+
## KV-Cache Indexer Overview
18+
19+
One of the major component of this project is the **KVCache Indexer**: a high-performance Go service that maintains a global, near-real-time view of KV-Cache block locality.
20+
21+
It is powered by `KVEvents` streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache.
22+
This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU).
23+
This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, cache-aware placement decisions.
24+
25+
The diagram below shows the primary data flows: the **Read Path** (scoring) and the **Write Path** (event ingestion).
2126

2227
```mermaid
23-
graph
24-
subgraph Cluster
25-
Router
26-
subgraph KVCacheManager[KVCache Manager]
27-
KVCacheIndexer[KVCache Indexer]
28-
PrefixStore[LRU Prefix Store]
29-
KVBlockToPodIndex[KVBlock to Pod availability Index]
28+
graph TD
29+
subgraph Scheduler / Router
30+
A[Scheduler]
3031
end
31-
subgraph vLLMNode[vLLM Node]
32-
vLLMCore[vLLM Core]
33-
KVCacheEngine["KVCache Engine (LMCache)"]
32+
33+
subgraph KVCacheManager["KV-Cache Manager"]
34+
B[KVCache Indexer API]
35+
C[KV-Block Index]
36+
D[Event Subscriber]
3437
end
35-
Redis
36-
end
37-
38-
Router -->|"Score(prompt, ModelName, relevantPods)"| KVCacheIndexer
39-
KVCacheIndexer -->|"{Pod to Scores map}"| Router
40-
Router -->|Route| vLLMNode
41-
42-
KVCacheIndexer -->|"FindLongestTokenizedPrefix(prompt, ModelName) -> tokens"| PrefixStore
43-
PrefixStore -->|"DigestPromptAsync"| PrefixStore
44-
KVCacheIndexer -->|"GetPodsForKeys(tokens) -> {KVBlock keys to Pods} availability map"| KVBlockToPodIndex
45-
KVBlockToPodIndex -->|"Redis MGet(blockKeys) -> {KVBlock keys to Pods}"| Redis
46-
47-
vLLMCore -->|Connector API| KVCacheEngine
48-
KVCacheEngine -->|"UpdateIndex(KVBlock keys, nodeIP)"| Redis
38+
39+
subgraph vLLM Fleet
40+
E[vLLM Pod 1]
41+
F[vLLM Pod 2]
42+
G[...]
43+
end
44+
45+
A -- "1. Score(prompt, pods)" --> B
46+
B -- "2. Query Index" --> C
47+
B -- "3. Return Scores" --> A
48+
49+
E -- "4. Emit KVEvents" --> D
50+
F -- "4. Emit KVEvents" --> D
51+
D -- "5. Update Index" --> C
52+
4953
```
50-
This overview greatly simplifies the actual architecture and combines steps across several submodules.
51-
For a detailed architecture, refer to the [architecture](docs/architecture.md) document.
5254

53-
## Examples
55+
1. **Scoring Request**: A scheduler asks the **KVCache Indexer** to score a set of pods for a given prompt
56+
2. **Index Query**: The indexer calculates the necessary KV-block keys from the prompt and queries the **KV-Block Index** to see which pods have those blocks
57+
3. **Return Scores**: The indexer returns a map of pods and their corresponding KV-cache-hit scores to the scheduler
58+
4. **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes
59+
5. **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time
60+
61+
* For a more detailed breakdown, please see the high-level [Architecture Document](docs/architecture.md).
62+
63+
-----
5464

55-
- [KVCache Indexer](examples/kv-cache-index/README.md):
56-
- A reference implementation of using the `kvcache.Indexer` module.
57-
- [KVCache Aware Scorer](examples/kv-cache-aware-scorer/README.md):
58-
- A reference implementation of integrating the `kvcache.Indexer` module in
59-
[llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) in a KVCache aware scorer.
65+
### Examples
6066

67+
* [**KVCache Indexer**](examples/kv_cache_index/README.md):
68+
A reference implementation showing how to run and use the `kvcache.Indexer` module
69+
* [**KVCache Aware Scorer**](examples/kv_cache_aware_scorer/README.md):
70+
A reference implementation of how to integrate the `kvcache.Indexer` into a scheduler like the `llm-d-inference-scheduler`
71+
* [**KV-Events**](examples/kv_events/README.md):
72+
Demonstrates how the KV-Cache Manager handles KV-Events through both an offline example with a dummy ZMQ publisher and an online example using a vLLM Helm chart.

0 commit comments

Comments
 (0)