Skip to content

Commit 28f880b

Browse files
committed
Add doc for prefill-overhead
Signed-off-by: Qifan Deng <[email protected]>
1 parent 679e6ff commit 28f880b

File tree

2 files changed

+33
-17
lines changed

2 files changed

+33
-17
lines changed

README.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,13 +48,13 @@ API responses contains a subset of the fields provided by the OpenAI API.
4848
<summary>Click to show the structure of requests/responses</summary>
4949

5050
- `/v1/chat/completions`
51-
- **request**
51+
- request
5252
- stream
5353
- model
5454
- messages
5555
- role
5656
- content
57-
- **response**
57+
- response
5858
- id
5959
- created
6060
- model
@@ -63,19 +63,19 @@ API responses contains a subset of the fields provided by the OpenAI API.
6363
- finish_reason
6464
- message
6565
- `/v1/completions`
66-
- **request**
66+
- request
6767
- stream
6868
- model
6969
- prompt
7070
- max_tokens (for future usage)
71-
- **response**
71+
- response
7272
- id
7373
- created
7474
- model
7575
- choices
7676
- text
7777
- `/v1/models`
78-
- **response**
78+
- response
7979
- object (list)
8080
- data
8181
- id
@@ -107,6 +107,15 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
107107
- `inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108108
- `kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109109
- `kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
110+
- `prefill-overhead`: The base overhead in milliseconds for prefilling a single token. This value, in conjunction with `prefill-complexity` and `prefill-overhead-std-dev`, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of `0` and is ignored if `time-to-first-token` is not `0`.
111+
- `prefill-complexity`: Defines how the prefill time scales with the number of prompt tokens. This is required if `prefill-overhead` is used. Options are `"n^2"` and `"nlog(n)"`, with a default of `"n^2"`.
112+
- `prefill-overhead-std-dev`: The standard deviation in milliseconds for the time taken before the first token is returned. This is required if `prefill-overhead` is used, with a default of `0`.
113+
114+
---
115+
116+
- `kv-cache-transfer-overhead`: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along with `kv-cache-transfer-complexity` and `kv-cache-transfer-overhead-std-dev`, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of `0` and is ignored if `kv-cache-transfer-latency` is not `0`.
117+
- `kv-cache-transfer-complexity`: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required if `kv-cache-transfer-overhead` is used. Options are `"linear"` and `"in-place"`, with a default of `"linear"`.
118+
- `kv-cache-transfer-overhead-std-dev`: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required if `kv-cache-transfer-overhead` is used, with a default of `0`.
110119
- `seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
111120
- `max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
112121
- `min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
@@ -174,9 +183,9 @@ To run the vLLM Simulator image under Docker, run:
174183
```bash
175184
docker run --rm --publish 8000:8000 ghcr.io/llm-d/llm-d-inference-sim:dev --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora-modules '{"name":"tweet-summary-0"}' '{"name":"tweet-summary-1"}'
176185
```
177-
**Note:** To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
186+
Note: To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
178187

179-
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
188+
Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
180189

181190
## Standalone testing
182191

pkg/common/config.go

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -66,13 +66,17 @@ type Configuration struct {
6666
// cause the actual time to first token to differ by more than 70% from TimeToFirstToken
6767
TimeToFirstTokenStdDev int `yaml:"time-to-first-token-std-dev" json:"time-to-first-token-std-dev"`
6868

69-
// PrefillOverhead time taken to prefill the context, in milliseconds
70-
// PrefillOverhead along with PrefillComplexity defines the time taken to prefill the context
69+
// PrefillOverhead the base overhead for prefill of one token, in milliseconds
70+
// in conjunction with PrefillComplexity and PrefillOverheadStdDev
71+
// this value defines the time taken to prefill the whole context
7172
PrefillOverhead int `yaml:"prefill-overhead" json:"prefill-overhead"`
72-
// PrefillOverheadStdDev similar to TimeToFirstTokenStdDev
73-
PrefillOverheadStdDev int `yaml:"prefill-overhead-std-dev" json:"prefill-overhead-std-dev"`
74-
// options are "n^2" and "nlog(n)"
73+
// PrefillComplexity defines how prefill time scales with number of prompt tokens
74+
// options are "n^2" and "nlog(n)", default is "n^2"
7575
PrefillComplexity string `yaml:"prefill-complexity" json:"prefill-complexity"`
76+
// PrefillOverheadStdDev standard deviation for time before the first token will be returned
77+
// in milliseconds, required if PrefillOverhead is used, default is 0, the range is according to
78+
// the implementation of policy, see PrefillComplexity
79+
PrefillOverheadStdDev int `yaml:"prefill-overhead-std-dev" json:"prefill-overhead-std-dev"`
7680

7781
// InterTokenLatency time between generated tokens, in milliseconds
7882
InterTokenLatency int `yaml:"inter-token-latency" json:"inter-token-latency"`
@@ -89,14 +93,17 @@ type Configuration struct {
8993
// KVCacheTransferLatency
9094
KVCacheTransferLatencyStdDev int `yaml:"kv-cache-transfer-latency-std-dev" json:"kv-cache-transfer-latency-std-dev"`
9195

92-
// KVCacheTransfer overhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated,
93-
// in milliseconds.
94-
// KVCacheTransferOverhead along with KVCacheTransferComplexity defines the time taken to transfer kv-cache.
96+
// KVCacheTransferOverhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated, for one token
97+
// in milliseconds, in conjunction with KVCacheTransferComplexity and KVCacheTransferOverheadStdDev defines the time taken to transfer kv-cache for whole content
9598
KVCacheTransferOverhead int `yaml:"kv-cache-transfer-overhead" json:"kv-cache-transfer-overhead"`
96-
// KVCacheTransferOverheadStdDev similar to TimeToFirstTokenStdDev
97-
KVCacheTransferOverheadStdDev int `yaml:"kv-cache-transfer-overhead-std-dev" json:"kv-cache-transfer-overhead-std-dev"`
99+
// KVCacheTransferComplexity the complexity of kv cache transfer against number of prompt tokens
98100
// options are "linear" and "in-place", default is "linear"
99101
KVCacheTransferComplexity string `yaml:"kv-cache-transfer-complexity" json:"kv-cache-transfer-complexity"`
102+
// KVCacheTransferOverheadStdDev standard deviation for time taken to transfer kv-cache
103+
// from another vLLM instance in case P/D is activated, in milliseconds
104+
// required if KVCacheTransferOverhead is used, default is 0, the range is according to
105+
// the implementation of policy, see KVCacheTransferComplexity
106+
KVCacheTransferOverheadStdDev int `yaml:"kv-cache-transfer-overhead-std-dev" json:"kv-cache-transfer-overhead-std-dev"`
100107

101108
// Mode defines the simulator response generation mode, valid values: echo, random
102109
Mode string `yaml:"mode" json:"mode"`

0 commit comments

Comments
 (0)