-
Couldn't load subscription status.
- Fork 85
Description
What happened:
While doing performance experiments comparing Knative requests-based load balancing with llm-d load-aware routing, I have found some cases where llm-d inference scheduler underperforms requests-based load balancing.
For example, with 8 replicas of Llama-3.3-70b-fp8 on 2xH100 per replica (TP=2), with prefix caching and prefix-aware routing disabled (queue-scorer and kv-cache-scorer only, each with weight 1.0), if we run guidellm with 1000 ISL, 1000 OSL, in concurrency mode with concurrency=800, we get slightly lower throughput with llm-d:
vllm:num_requests_waiting metrics show quite a bit of queuing at some replicas throughout the test

I believe this is because this workload is quite bursty, being concurrency based with fixed sequence lengths per request. Requests come in waves and when many requests arrive simultaneously they get routed to the same Pod.
Please note that this is a corner-case, picked because it is the best case scenario for knative. In cases where the sequence lengths are heterogeneous or prefix caching comes into play, llm-d significantly outperforms knative.
What you expected to happen:
Ideally we would like to see on-par or better performance from llm-d-inference-scheduler intelligent routing.
Anything else we need to know?:
Related issue: #228, but an alternative solution may be to introduce a random filter, so that we can introduce some randomness into the scheduler, e.g. pick 2 random endpoints and route to the better of two (power of two choices).
Environment:
llm-d 0.2.0 installed with llm-d-infra quickstart/examples, including inference-scheduling and precise-prefix-cache-aware.
I will continue experimentation to try to better understand the root-cause of the issue and test some potential improvements, but wanted to open this issue to get the discussion started!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
