KVCache Manager

Overview

The KVCache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:

Improve user experience
- By reducing Time-To-First-Token (TTFT)
  - Enabled through higher KVCache hit rates and reduced tensor transfers
  - Supported by smart routing and distributed cache availability
  - Optimized by proactive pre-placement of hot caches and session duplication/migration
Reduce serving costs
- By improving compute utilization
  - Minimize re-compute via KVCache reuse and locality-aware request handling
  - Leverage zero-copy cache transfers across nodes
Enable system scalability
- Through a distributed KVCache pool
  - Allows cache offloading and reuse across multiple serving instances
- User session duplication/migration for true and seamless load balancing

Vision

This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows, which involve heavy context-reuse across sessions and instances. Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from cross-instance KVCache coordination.

To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication, and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is not yet treated as a first-class concern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVCache Manager

Overview

Vision

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

KVCache Manager

Overview

Vision