Poor performance with bursty workloads

**What happened**:
While doing performance experiments comparing Knative requests-based load balancing with llm-d load-aware routing, I have found some cases where llm-d inference scheduler underperforms requests-based load balancing. 

For example, with 8 replicas of Llama-3.3-70b-fp8 on 2xH100 per replica (TP=2), with prefix caching and prefix-aware routing disabled (queue-scorer and kv-cache-scorer only, each with weight 1.0), if we run [guidellm](https://github.com/vllm-project/guidellm) with 1000 ISL, 1000 OSL, in concurrency mode with concurrency=800, we get slightly lower throughput with llm-d:

<img width="578" height="500" alt="Image" src="https://github.com/user-attachments/assets/293e669d-2e77-4689-9caa-29ae33fcd87f" />

and significantly worse TTFT
<img width="578" height="500" alt="Image" src="https://github.com/user-attachments/assets/ab24d7dc-80b9-4e12-b4b0-816d951def45" />

vllm:num_requests_waiting metrics show quite a bit of queuing at some replicas throughout the test
<img width="2372" height="926" alt="Image" src="https://github.com/user-attachments/assets/e2315784-cfda-48d3-9d1e-3aa4efb203cf" />

I believe this is because this workload is quite bursty, being concurrency based with fixed sequence lengths per request. Requests come in waves and when many requests arrive simultaneously they get routed to the same Pod. 

Please note that this is a corner-case, picked because it is the best case scenario for knative. In cases where the sequence lengths are heterogeneous or prefix caching comes into play, llm-d significantly outperforms knative. 

**What you expected to happen**:
Ideally we would like to see on-par or better performance from llm-d-inference-scheduler intelligent routing.

**Anything else we need to know?**:
Related issue: https://github.com/llm-d/llm-d-inference-scheduler/issues/228, but an alternative solution may be to introduce a random filter, so that we can introduce some randomness into the scheduler, e.g. pick 2 random endpoints and route to the better of two ([power of two choices](https://medium.com/the-intuition-project/load-balancing-the-intuition-behind-the-power-of-two-random-choices-6de2e139ac2f)).

**Environment**:
llm-d 0.2.0 installed with llm-d-infra quickstart/examples, including [inference-scheduling](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling) and [precise-prefix-cache-aware](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/precise-prefix-cache-aware).

I will continue experimentation to try to better understand the root-cause of the issue and test some potential improvements, but wanted to open this issue to get the discussion started!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor performance with bursty workloads #298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Poor performance with bursty workloads #298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions