-
Notifications
You must be signed in to change notification settings - Fork 45
Description
What happened:
According to the documentation by default the seed is different for each instance because it is taken from the nanoseconds part of the time the instance is run.
I´m running 3 instances and I´m getting exactly the same TTFT for all 3 instances. More precisely, using the following promql:
histogram_quantile(0.3,
sum by(le, instance) (
rate(vllm:time_to_first_token_seconds_bucket[30s])
)
)
I get the same values or with minimal differences at least. I have tried changing the percentile and the time frame but I always get the same values of TTFT across all instances. How is this possible?
What you expected to happen:
Different values in the last X seconds
How to reproduce it (as minimally and precisely as possible):
I´m running 3 instances with the following parameters:
- args:
- --model
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
#- --max-model-len
#- "2048"
- --served-model-name=HighEndLLM
- --port
- "8000"
- --mode=random
- --time-to-first-token=5000
- --enable-kvcache
- --max-num-seqs=25
- --time-factor-under-load=3
- --inter-token-latency=100
# only if prefill/decode dissagregation enabled
#- --kv-cache-transfer-latency=10
# can't be more than 30%
- --time-to-first-token-std-dev=1500
- --inter-token-latency-std-dev=30
#- --kv-cache-transfer-time-std-dev=3
And sending the following workload:
ab -v 1 -n 10000 -c 200 -T application/json -p /tmp/request.json http://$m/v1/completions
Anything else we need to know?:
Thanks!
Environment:
ghcr.io/llm-d/llm-d-inference-sim:v0.6.1