Skip to content

Topology Aware Routing to lower KV Cache transfer latency #545

@RishabhSaini

Description

@RishabhSaini

Problem

Prefill and decode pods are currently selected independently without network locality awareness. KV cache transfer between pods on different nodes incurs inter-node network latency (infiniband upto 400 GB/s)
vs. intra-node transfer (NVLink upto 900GB/s), directly impacting TTFT SLO compliance.

Proposal

Add topology-aware prefill pod selection:

  • After selecting decode pod, prefer prefill pods on same node (giving them a higher score)
  • Fallback to global selection if non-colocated pods still have a higher score

Metadata

Metadata

Assignees

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions