Skip to content

Benchmark and support KV cache-aware routing for multi-replica deployments #16

@anfredette

Description

@anfredette

llm-d experiments show that routing requests to the replica with the highest KV cache hit rate dramatically improves latency and throughput. We should measure this impact and incorporate it into recommendations.

Acceptance Criteria

  • Benchmark multi-replica deployments with and without KV cache-aware routing
  • Measure latency and throughput improvements
  • Update capacity planning logic to account for routing efficiency
  • Document routing strategies (round-robin vs cache-aware vs semantic routing)
  • Integrate with router/gateway selection (see item Add llm-d as deployment target alongside KServe/vLLM #11)
  • Recommend KV cache-aware routing when beneficial

Notes

  • Most impactful for workloads with high prompt similarity (e.g., RAG, customer support)
  • Requires router/gateway support (llm-d has built-in support)
  • Benchmarking should use realistic traffic patterns with varying cache hit rates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions