The KVCache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:
- Improve user experience
- By reducing Time-To-First-Token (TTFT)
- Enabled through higher KVCache hit rates and reduced tensor transfers
- Supported by smart routing and distributed cache availability
- Optimized by proactive pre-placement of hot caches and session duplication/migration
- By reducing Time-To-First-Token (TTFT)
- Reduce serving costs
- By improving compute utilization
- Minimize re-compute via KVCache reuse and locality-aware request handling
- Leverage zero-copy cache transfers across nodes
- By improving compute utilization
- Enable system scalability
- Through a distributed KVCache pool
- Allows cache offloading and reuse across multiple serving instances
- User session duplication/migration for true and seamless load balancing
- Through a distributed KVCache pool
This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows, which involve heavy context-reuse across sessions and instances. Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from cross-instance KVCache coordination.
To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication, and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is not yet treated as a first-class concern.