Skip to content

Latest commit

 

History

History
33 lines (25 loc) · 1.49 KB

File metadata and controls

33 lines (25 loc) · 1.49 KB

KVCache Manager

Overview

The KVCache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:

  • Improve user experience
    • By reducing Time-To-First-Token (TTFT)
      • Enabled through higher KVCache hit rates and reduced tensor transfers
      • Supported by smart routing and distributed cache availability
      • Optimized by proactive pre-placement of hot caches and session duplication/migration
  • Reduce serving costs
    • By improving compute utilization
      • Minimize re-compute via KVCache reuse and locality-aware request handling
      • Leverage zero-copy cache transfers across nodes
  • Enable system scalability
    • Through a distributed KVCache pool
      • Allows cache offloading and reuse across multiple serving instances
    • User session duplication/migration for true and seamless load balancing

Vision

This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows, which involve heavy context-reuse across sessions and instances. Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from cross-instance KVCache coordination.

To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication, and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is not yet treated as a first-class concern.