Benchmark and support KV cache-aware routing for multi-replica deployments

llm-d experiments show that routing requests to the replica with the highest KV cache hit rate dramatically improves latency and throughput. We should measure this impact and incorporate it into recommendations.

## Acceptance Criteria
- Benchmark multi-replica deployments with and without KV cache-aware routing
- Measure latency and throughput improvements
- Update capacity planning logic to account for routing efficiency
- Document routing strategies (round-robin vs cache-aware vs semantic routing)
- Integrate with router/gateway selection (see item #11)
- Recommend KV cache-aware routing when beneficial

## Notes
- Most impactful for workloads with high prompt similarity (e.g., RAG, customer support)
- Requires router/gateway support (llm-d has built-in support)
- Benchmarking should use realistic traffic patterns with varying cache hit rates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark and support KV cache-aware routing for multi-replica deployments #16

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark and support KV cache-aware routing for multi-replica deployments #16

Description

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions