Skip to content

Commit cdacc5c

Browse files
add guide, update helm charts and readme, minor scorer changes
1 parent 04db6e9 commit cdacc5c

File tree

6 files changed

+278
-4
lines changed

6 files changed

+278
-4
lines changed

config/charts/inferencepool/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,46 @@ $ helm install triton-llama3-8b-instruct \
121121
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
122122
```
123123

124+
### Install with SLO-Aware Routing
125+
126+
To enable SLO-aware routing, you must enable the latency predictor, which is deployed as a set of sidecar containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
127+
128+
You can enable the latency predictor by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line.
129+
130+
Here is an example of how to install the chart with SLO-aware routing enabled:
131+
132+
```txt
133+
$ helm install vllm-llama3-8b-instruct . \
134+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
135+
--set inferenceExtension.latencyPredictor.enabled=true \
136+
--set provider.name=gke
137+
```
138+
139+
#### SLO-Aware Router Environment Variables
140+
141+
The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.
142+
143+
| Environment Variable | Description | Default |
144+
| -------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- |
145+
| `SAMPLING_MEAN` | The sampling mean (lambda) for the Poisson distribution of token sampling. | `100.0` |
146+
| `MAX_SAMPLED_TOKENS` | The maximum number of tokens to sample for TPOT prediction. | `20` |
147+
| `SLO_BUFFER_FACTOR` | A buffer to apply to the SLO to make it more or less strict. | `1.0` |
148+
| `NEG_HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has negative headroom. | `0.8` |
149+
| `NEG_HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has negative headroom. | `0.2` |
150+
| `HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
151+
| `HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
152+
| `HEADROOM_SELECTION_STRATEGY` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
153+
| `COMPOSITE_KV_WEIGHT` | The weight to give to the KV cache utilization in the composite score. | `1` |
154+
| `COMPOSITE_QUEUE_WEIGHT` | The weight to give to the queue size in the composite score. | `1` |
155+
| `COMPOSITE_PREFIX_WEIGHT` | The weight to give to the prefix cache score in the composite score. | `1` |
156+
| `STICKY_EPSILON` | The probability of exploring a non-sticky pod. | `0.01` |
157+
| `NEG_HEADROOM_EPSILON` | The probability of exploring a pod with negative headroom. | `0.01` |
158+
| `AFFINITY_GATE_TAU` | The stickiness threshold for the affinity gate. | `0.80` |
159+
| `AFFINITY_GATE_TAU_GLOBAL` | The global stickiness threshold for the affinity gate. | `0.99` |
160+
| `POD_SELECTION_MODE` | The mode for selecting a pod from the weighted list. Options: `linear` (weighted random), `max` (argmax). | `linear` |
161+
162+
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
163+
124164
### Install with High Availability (HA)
125165

126166
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.

config/charts/inferencepool/templates/epp-config.yaml

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,28 @@ data:
1111
- type: queue-scorer
1212
- type: kv-cache-utilization-scorer
1313
- type: prefix-cache-scorer
14+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
15+
- type: slo-aware-routing
16+
- type: slo-aware-profile-handler
17+
- type: max-score-picker
18+
{{- end }}
1419
schedulingProfiles:
20+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
21+
- name: prefix
22+
plugins:
23+
- pluginRef: prefix-cache-scorer
24+
- name: default
25+
plugins:
26+
- pluginRef: slo-aware-routing
27+
weight: 0
28+
- pluginRef: queue-scorer
29+
- pluginRef: kv-cache-utilization-scorer
30+
- pluginRef: max-score-picker
31+
- name: slo
32+
plugins:
33+
- pluginRef: slo-aware-routing
34+
- pluginRef: max-score-picker
35+
{{- else }}
1536
- name: default
1637
plugins:
1738
- pluginRef: queue-scorer
@@ -20,17 +41,29 @@ data:
2041
weight: 2
2142
- pluginRef: prefix-cache-scorer
2243
weight: 3
44+
{{- end }}
2345
{{- if (hasKey .Values.inferenceExtension "pluginsCustomConfig") }}
2446
{{- .Values.inferenceExtension.pluginsCustomConfig | toYaml | nindent 2 }}
2547
{{- end }}
26-
2748
---
28-
{{- if .Values.inferenceExtension.sidecar.enabled }}
49+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
2950
apiVersion: v1
3051
kind: ConfigMap
3152
metadata:
32-
name: {{ .Values.inferenceExtension.sidecar.configMap.name }}
53+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
3354
namespace: {{ .Release.Namespace }}
3455
data:
35-
{{- .Values.inferenceExtension.sidecar.configMap.data | toYaml | nindent 2 }}
56+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.trainingServer.config }}
57+
{{ $key }}: {{ $value | quote }}
58+
{{- end }}
59+
---
60+
apiVersion: v1
61+
kind: ConfigMap
62+
metadata:
63+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-prediction
64+
namespace: {{ .Release.Namespace }}
65+
data:
66+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.predictionServers.config }}
67+
{{ $key }}: {{ $value | quote }}
68+
{{- end }}
3669
{{- end }}

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,9 @@ spec:
9696
{{- if gt (.Values.inferenceExtension.replicas | int) 1 }}
9797
- --ha-enable-leader-election
9898
{{- end }}
99+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
100+
- --enable-latency-predictor
101+
{{- end }}
99102
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
100103
{{- range $key, $value := .Values.inferenceExtension.flags }}
101104
- --{{ $key }}
@@ -147,6 +150,20 @@ spec:
147150
valueFrom:
148151
fieldRef:
149152
fieldPath: metadata.namespace
153+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
154+
- name: PREDICTION_SERVER_URL
155+
value: "{{- $count := int .Values.inferenceExtension.latencyPredictor.predictionServers.count -}}
156+
{{- $startPort := int .Values.inferenceExtension.latencyPredictor.predictionServers.startPort -}}
157+
{{- range $i := until $count -}}
158+
{{- if $i }},{{ end }}http://localhost:{{ add $startPort $i }}
159+
{{- end }}"
160+
- name: TRAINING_SERVER_URL
161+
value: "http://localhost:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
162+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.eppEnv }}
163+
- name: {{ $key }}
164+
value: {{ $value | quote }}
165+
{{- end }}
166+
{{- end }}
150167
{{- if .Values.inferenceExtension.tracing.enabled }}
151168
- name: OTEL_SERVICE_NAME
152169
value: "gateway-api-inference-extension"
@@ -177,13 +194,94 @@ spec:
177194
volumeMounts:
178195
- name: plugins-config-volume
179196
mountPath: "/config"
197+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
198+
# Training Server Sidecar Container
199+
- name: training-server
200+
image: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.hub }}/{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.name }}:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.tag }}
201+
imagePullPolicy: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.pullPolicy }}
202+
ports:
203+
- containerPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
204+
name: training-port
205+
livenessProbe:
206+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.livenessProbe | nindent 10 }}
207+
readinessProbe:
208+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.readinessProbe | nindent 10 }}
209+
resources:
210+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.resources | nindent 10 }}
211+
envFrom:
212+
- configMapRef:
213+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
214+
env:
215+
- name: POD_NAME
216+
valueFrom:
217+
fieldRef:
218+
fieldPath: metadata.name
219+
- name: SERVER_TYPE
220+
value: "training"
221+
volumeMounts:
222+
- name: training-server-storage
223+
mountPath: /models
224+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
225+
# Prediction Server Sidecar Container {{ add $i 1 }}
226+
- name: prediction-server-{{ add $i 1 }}
227+
image: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.hub }}/{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.name }}:{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.tag }}
228+
imagePullPolicy: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.pullPolicy }}
229+
command: ["uvicorn"]
230+
args: ["prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"]
231+
ports:
232+
- containerPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
233+
name: predict-port-{{ add $i 1 }}
234+
livenessProbe:
235+
httpGet:
236+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.httpGet.path }}
237+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
238+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.initialDelaySeconds }}
239+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.periodSeconds }}
240+
readinessProbe:
241+
httpGet:
242+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.httpGet.path }}
243+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
244+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.initialDelaySeconds }}
245+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.periodSeconds }}
246+
failureThreshold: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.failureThreshold }}
247+
resources:
248+
{{- toYaml $.Values.inferenceExtension.latencyPredictor.predictionServers.resources | nindent 10 }}
249+
envFrom:
250+
- configMapRef:
251+
name: {{ include "gateway-api-inference-extension.name" $ }}-latency-predictor-prediction
252+
env:
253+
- name: PREDICT_PORT
254+
value: "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"
255+
- name: POD_NAME
256+
valueFrom:
257+
fieldRef:
258+
fieldPath: metadata.name
259+
- name: SERVER_TYPE
260+
value: "prediction-{{ add $i 1 }}"
261+
- name: TRAINING_SERVER_URL
262+
value: "http://localhost:{{ $.Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
263+
volumeMounts:
264+
- name: prediction-server-{{ add $i 1 }}-storage
265+
mountPath: /server_models
266+
{{- end }}
267+
{{- end }}
180268
volumes:
181269
{{- if .Values.inferenceExtension.sidecar.volumes }}
182270
{{- tpl (toYaml .Values.inferenceExtension.sidecar.volumes) $ | nindent 6 }}
183271
{{- end }}
184272
- name: plugins-config-volume
185273
configMap:
186274
name: {{ include "gateway-api-inference-extension.name" . }}
275+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
276+
- name: training-server-storage
277+
emptyDir:
278+
sizeLimit: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.volumeSize }}
279+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
280+
- name: prediction-server-{{ add $i 1 }}-storage
281+
emptyDir:
282+
sizeLimit: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.volumeSize }}
283+
{{- end }}
284+
{{- end }}
187285
{{- if .Values.inferenceExtension.affinity }}
188286
affinity:
189287
{{- toYaml .Values.inferenceExtension.affinity | nindent 8 }}

config/charts/inferencepool/templates/epp-service.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,18 @@ spec:
1515
- name: http-metrics
1616
protocol: TCP
1717
port: {{ .Values.inferenceExtension.metricsPort | default 9090 }}
18+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
19+
- name: latency-predictor-training
20+
protocol: TCP
21+
port: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
22+
targetPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
23+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
24+
- name: latency-predictor-{{ add $i 1 }}
25+
protocol: TCP
26+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
27+
targetPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
28+
{{- end }}
29+
{{- end }}
1830
{{- with .Values.inferenceExtension.extraServicePorts }}
1931
{{- toYaml . | nindent 4 }}
2032
{{- end }}

site-src/guides/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,12 @@ Deploy the sample InferenceObjective which allows you to specify priority of req
274274

275275
--8<-- "site-src/_includes/bbr.md"
276276

277+
### Next Steps: Advanced Features
278+
279+
You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:
280+
281+
* [SLO-Aware Routing](./slo-aware-routing.md)
282+
277283
### Cleanup
278284

279285
The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide.

0 commit comments

Comments
 (0)