Skip to content

Poor performance with bursty workloads #298

@dagrayvid

Description

@dagrayvid

What happened:
While doing performance experiments comparing Knative requests-based load balancing with llm-d load-aware routing, I have found some cases where llm-d inference scheduler underperforms requests-based load balancing.

For example, with 8 replicas of Llama-3.3-70b-fp8 on 2xH100 per replica (TP=2), with prefix caching and prefix-aware routing disabled (queue-scorer and kv-cache-scorer only, each with weight 1.0), if we run guidellm with 1000 ISL, 1000 OSL, in concurrency mode with concurrency=800, we get slightly lower throughput with llm-d:

Image

and significantly worse TTFT
Image

vllm:num_requests_waiting metrics show quite a bit of queuing at some replicas throughout the test
Image

I believe this is because this workload is quite bursty, being concurrency based with fixed sequence lengths per request. Requests come in waves and when many requests arrive simultaneously they get routed to the same Pod.

Please note that this is a corner-case, picked because it is the best case scenario for knative. In cases where the sequence lengths are heterogeneous or prefix caching comes into play, llm-d significantly outperforms knative.

What you expected to happen:
Ideally we would like to see on-par or better performance from llm-d-inference-scheduler intelligent routing.

Anything else we need to know?:
Related issue: #228, but an alternative solution may be to introduce a random filter, so that we can introduce some randomness into the scheduler, e.g. pick 2 random endpoints and route to the better of two (power of two choices).

Environment:
llm-d 0.2.0 installed with llm-d-infra quickstart/examples, including inference-scheduling and precise-prefix-cache-aware.

I will continue experimentation to try to better understand the root-cause of the issue and test some potential improvements, but wanted to open this issue to get the discussion started!

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions