Skip to content

Commit 20a6580

Browse files
authored
[Feat] Include runner and convert flag (#803)
* feat: Include runner and convert flag Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com> * chores: Add validation and description of runner and convert Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com> * docs: Include runner and convert flag at helm README Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com> * feat: Move runner and convert to vLLM Configuration Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com> --------- Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com>
1 parent fd69fbc commit 20a6580

File tree

4 files changed

+20
-0
lines changed

4 files changed

+20
-0
lines changed

helm/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,8 @@ This table documents all available configuration values for the Production Stack
132132
| `servingEngineSpec.modelSpec[].vllmConfig.maxNumSeqs` | integer | `256` | Maximum number of sequences to be processed in a single iteration |
133133
| `servingEngineSpec.modelSpec[].vllmConfig.maxLoras` | integer | `0` | The maximum number of LoRA models to be loaded in a single batch |
134134
| `servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization` | number | `0.9` | The fraction of GPU memory to be used for the model executor (0-1) |
135+
| `servingEngineSpec.modelSpec[].vllmConfig.runner` | string | `""` | The runner type for the model, can be "auto" or "pooling" |
136+
| `servingEngineSpec.modelSpec[].vllmConfig.convert` | string | `""` | The conversion type for the model, can be "token_embed", "embed", "token_classify", "classify", or "score" |
135137
| `servingEngineSpec.modelSpec[].vllmConfig.extraArgs` | list | `["--disable-log-requests"]` | Extra command line arguments to pass to vLLM |
136138

137139
#### LMCache Configuration

helm/templates/deployment-vllm-multi.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,14 @@ spec:
168168
- "--max_loras"
169169
- {{ .maxLoras | quote }}
170170
{{- end }}
171+
{{- if hasKey . "runner" }}
172+
- "--runner"
173+
- {{ .runner | quote }}
174+
{{- end }}
175+
{{- if hasKey . "convert" }}
176+
- "--convert"
177+
- {{ .convert | quote }}
178+
{{- end }}
171179
{{- if .extraArgs }}
172180
{{- range .extraArgs }}
173181
- {{ . | quote }}

helm/values.schema.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,14 @@
208208
"dtype": {
209209
"type": "string"
210210
},
211+
"runner": {
212+
"type": "string",
213+
"enum": ["auto", "pooling"]
214+
},
215+
"convert": {
216+
"type": "string",
217+
"enum": ["token_embed", "embed", "token_classify", "classify", "score"]
218+
},
211219
"extraArgs": {
212220
"type": "array",
213221
"items": {

helm/values.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ servingEngineSpec:
8282
# - maxNumSeqs: (optional, int) Maximum number of sequences to be processed in a single iteration., e.g., 32
8383
# - maxLoras: (optional, int) The maximum number of LoRA models to be loaded in a single batch, e.g., 4
8484
# - gpuMemoryUtilization: (optional, float) The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. e.g., 0.95
85+
# - runner: (optional, string) The runner type for the model, can be "auto" or "pooling". e.g., "pooling"
86+
# - convert: (optional, string) The conversion type for the model, can be "token_embed", "embed", "token_classify", "classify", or "score". e.g., "embed"
8587
# - extraArgs: (optional, list) Extra command line arguments to pass to vLLM, e.g., ["--disable-log-requests"]
8688
#
8789
# - lmcacheConfig: (optional, map) The configuration of the LMCache for KV offloading, supported options are:

0 commit comments

Comments
 (0)