You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-7Lines changed: 16 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,13 +48,13 @@ API responses contains a subset of the fields provided by the OpenAI API.
48
48
<summary>Click to show the structure of requests/responses</summary>
49
49
50
50
-`/v1/chat/completions`
51
-
-**request**
51
+
- request
52
52
- stream
53
53
- model
54
54
- messages
55
55
- role
56
56
- content
57
-
-**response**
57
+
- response
58
58
- id
59
59
- created
60
60
- model
@@ -63,19 +63,19 @@ API responses contains a subset of the fields provided by the OpenAI API.
63
63
- finish_reason
64
64
- message
65
65
-`/v1/completions`
66
-
-**request**
66
+
- request
67
67
- stream
68
68
- model
69
69
- prompt
70
70
- max_tokens (for future usage)
71
-
-**response**
71
+
- response
72
72
- id
73
73
- created
74
74
- model
75
75
- choices
76
76
- text
77
77
-`/v1/models`
78
-
-**response**
78
+
- response
79
79
- object (list)
80
80
- data
81
81
- id
@@ -107,6 +107,15 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
107
107
-`inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108
108
-`kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109
109
-`kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
110
+
-`prefill-overhead`: The base overhead in milliseconds for prefilling a single token. This value, in conjunction with `prefill-complexity` and `prefill-overhead-std-dev`, determines the overall time taken to prefill the entire context. It's an optional parameter with a default of `0` and is ignored if `time-to-first-token` is not `0`.
111
+
-`prefill-complexity`: Defines how the prefill time scales with the number of prompt tokens. This is required if `prefill-overhead` is used. Options are `"n^2"` and `"nlog(n)"`, with a default of `"n^2"`.
112
+
-`prefill-overhead-std-dev`: The standard deviation in milliseconds for the time taken before the first token is returned. This is required if `prefill-overhead` is used, with a default of `0`.
113
+
114
+
---
115
+
116
+
-`kv-cache-transfer-overhead`: The base overhead in milliseconds for transferring the KV-cache of a single token from another vLLM instance when P/D is activated. Along with `kv-cache-transfer-complexity` and `kv-cache-transfer-overhead-std-dev`, it defines the total time for the KV-cache transfer of the entire context. This parameter is optional with a default of `0` and is ignored if `kv-cache-transfer-latency` is not `0`.
117
+
-`kv-cache-transfer-complexity`: The complexity of the KV-cache transfer relative to the number of prompt tokens. This is required if `kv-cache-transfer-overhead` is used. Options are `"linear"` and `"in-place"`, with a default of `"linear"`.
118
+
-`kv-cache-transfer-overhead-std-dev`: The standard deviation in milliseconds for the time taken to transfer the KV-cache. This is required if `kv-cache-transfer-overhead` is used, with a default of `0`.
110
119
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
111
120
-`max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
112
121
-`min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
@@ -174,9 +183,9 @@ To run the vLLM Simulator image under Docker, run:
**Note:** To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
186
+
Note: To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
178
187
179
-
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
188
+
Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
// KVCacheTransfer overhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated,
93
-
// in milliseconds.
94
-
// KVCacheTransferOverhead along with KVCacheTransferComplexity defines the time taken to transfer kv-cache.
96
+
// KVCacheTransferOverhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated, for one token
97
+
// in milliseconds, in conjunction with KVCacheTransferComplexity and KVCacheTransferOverheadStdDev defines the time taken to transfer kv-cache for whole content
0 commit comments