Skip to content

Commit 0a25e74

Browse files
add guide, update helm charts and readme, minor scorer changes
1 parent b2a7d45 commit 0a25e74

File tree

7 files changed

+291
-3
lines changed

7 files changed

+291
-3
lines changed

config/charts/inferencepool/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,46 @@ $ helm install triton-llama3-8b-instruct \
121121
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
122122
```
123123

124+
### Install with SLO-Aware Routing
125+
126+
To enable SLO-aware routing, you must enable the latency predictor, which is deployed as a set of sidecar containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
127+
128+
You can enable the latency predictor by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line.
129+
130+
Here is an example of how to install the chart with SLO-aware routing enabled:
131+
132+
```txt
133+
$ helm install vllm-llama3-8b-instruct . \
134+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
135+
--set inferenceExtension.latencyPredictor.enabled=true \
136+
--set provider.name=gke
137+
```
138+
139+
#### SLO-Aware Router Environment Variables
140+
141+
The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.
142+
143+
| Environment Variable | Description | Default |
144+
| -------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- |
145+
| `SAMPLING_MEAN` | The sampling mean (lambda) for the Poisson distribution of token sampling. | `100.0` |
146+
| `MAX_SAMPLED_TOKENS` | The maximum number of tokens to sample for TPOT prediction. | `20` |
147+
| `SLO_BUFFER_FACTOR` | A buffer to apply to the SLO to make it more or less strict. | `1.0` |
148+
| `NEG_HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has negative headroom. | `0.8` |
149+
| `NEG_HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has negative headroom. | `0.2` |
150+
| `HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
151+
| `HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
152+
| `HEADROOM_SELECTION_STRATEGY` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
153+
| `COMPOSITE_KV_WEIGHT` | The weight to give to the KV cache utilization in the composite score. | `1` |
154+
| `COMPOSITE_QUEUE_WEIGHT` | The weight to give to the queue size in the composite score. | `1` |
155+
| `COMPOSITE_PREFIX_WEIGHT` | The weight to give to the prefix cache score in the composite score. | `1` |
156+
| `STICKY_EPSILON` | The probability of exploring a non-sticky pod. | `0.01` |
157+
| `NEG_HEADROOM_EPSILON` | The probability of exploring a pod with negative headroom. | `0.01` |
158+
| `AFFINITY_GATE_TAU` | The stickiness threshold for the affinity gate. | `0.80` |
159+
| `AFFINITY_GATE_TAU_GLOBAL` | The global stickiness threshold for the affinity gate. | `0.99` |
160+
| `POD_SELECTION_MODE` | The mode for selecting a pod from the weighted list. Options: `linear` (weighted random), `max` (argmax). | `linear` |
161+
162+
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
163+
124164
### Install with High Availability (HA)
125165

126166
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.

config/charts/inferencepool/templates/epp-config.yaml

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,28 @@ data:
1111
- type: queue-scorer
1212
- type: kv-cache-utilization-scorer
1313
- type: prefix-cache-scorer
14+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
15+
- type: slo-aware-routing
16+
- type: slo-aware-profile-handler
17+
- type: max-score-picker
18+
{{- end }}
1419
schedulingProfiles:
20+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
21+
- name: prefix
22+
plugins:
23+
- pluginRef: prefix-cache-scorer
24+
- name: default
25+
plugins:
26+
- pluginRef: slo-aware-routing
27+
weight: 0
28+
- pluginRef: queue-scorer
29+
- pluginRef: kv-cache-utilization-scorer
30+
- pluginRef: max-score-picker
31+
- name: slo
32+
plugins:
33+
- pluginRef: slo-aware-routing
34+
- pluginRef: max-score-picker
35+
{{- else }}
1536
- name: default
1637
plugins:
1738
- pluginRef: queue-scorer
@@ -20,7 +41,29 @@ data:
2041
weight: 2
2142
- pluginRef: prefix-cache-scorer
2243
weight: 3
44+
{{- end }}
2345
{{- if (hasKey .Values.inferenceExtension "pluginsCustomConfig") }}
2446
{{- .Values.inferenceExtension.pluginsCustomConfig | toYaml | nindent 2 }}
2547
{{- end }}
26-
48+
---
49+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
50+
apiVersion: v1
51+
kind: ConfigMap
52+
metadata:
53+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
54+
namespace: {{ .Release.Namespace }}
55+
data:
56+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.trainingServer.config }}
57+
{{ $key }}: {{ $value | quote }}
58+
{{- end }}
59+
---
60+
apiVersion: v1
61+
kind: ConfigMap
62+
metadata:
63+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-prediction
64+
namespace: {{ .Release.Namespace }}
65+
data:
66+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.predictionServers.config }}
67+
{{ $key }}: {{ $value | quote }}
68+
{{- end }}
69+
{{- end }}

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@ spec:
5757
{{- if gt (.Values.inferenceExtension.replicas | int) 1 }}
5858
- --ha-enable-leader-election
5959
{{- end }}
60+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
61+
- --enable-latency-predictor
62+
{{- end }}
6063
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
6164
{{- range $key, $value := .Values.inferenceExtension.flags }}
6265
- --{{ $key }}
@@ -108,6 +111,20 @@ spec:
108111
valueFrom:
109112
fieldRef:
110113
fieldPath: metadata.namespace
114+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
115+
- name: PREDICTION_SERVER_URL
116+
value: "{{- $count := int .Values.inferenceExtension.latencyPredictor.predictionServers.count -}}
117+
{{- $startPort := int .Values.inferenceExtension.latencyPredictor.predictionServers.startPort -}}
118+
{{- range $i := until $count -}}
119+
{{- if $i }},{{ end }}http://localhost:{{ add $startPort $i }}
120+
{{- end }}"
121+
- name: TRAINING_SERVER_URL
122+
value: "http://localhost:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
123+
{{- range $key, $value := .Values.inferenceExtension.latencyPredictor.eppEnv }}
124+
- name: {{ $key }}
125+
value: {{ $value | quote }}
126+
{{- end }}
127+
{{- end }}
111128
{{- if .Values.inferenceExtension.tracing.enabled }}
112129
- name: OTEL_SERVICE_NAME
113130
value: "gateway-api-inference-extension"
@@ -138,10 +155,91 @@ spec:
138155
volumeMounts:
139156
- name: plugins-config-volume
140157
mountPath: "/config"
158+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
159+
# Training Server Sidecar Container
160+
- name: training-server
161+
image: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.hub }}/{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.name }}:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.tag }}
162+
imagePullPolicy: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.pullPolicy }}
163+
ports:
164+
- containerPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
165+
name: training-port
166+
livenessProbe:
167+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.livenessProbe | nindent 10 }}
168+
readinessProbe:
169+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.readinessProbe | nindent 10 }}
170+
resources:
171+
{{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.resources | nindent 10 }}
172+
envFrom:
173+
- configMapRef:
174+
name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
175+
env:
176+
- name: POD_NAME
177+
valueFrom:
178+
fieldRef:
179+
fieldPath: metadata.name
180+
- name: SERVER_TYPE
181+
value: "training"
182+
volumeMounts:
183+
- name: training-server-storage
184+
mountPath: /models
185+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
186+
# Prediction Server Sidecar Container {{ add $i 1 }}
187+
- name: prediction-server-{{ add $i 1 }}
188+
image: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.hub }}/{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.name }}:{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.tag }}
189+
imagePullPolicy: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.pullPolicy }}
190+
command: ["uvicorn"]
191+
args: ["prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"]
192+
ports:
193+
- containerPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
194+
name: predict-port-{{ add $i 1 }}
195+
livenessProbe:
196+
httpGet:
197+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.httpGet.path }}
198+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
199+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.initialDelaySeconds }}
200+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.periodSeconds }}
201+
readinessProbe:
202+
httpGet:
203+
path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.httpGet.path }}
204+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
205+
initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.initialDelaySeconds }}
206+
periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.periodSeconds }}
207+
failureThreshold: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.failureThreshold }}
208+
resources:
209+
{{- toYaml $.Values.inferenceExtension.latencyPredictor.predictionServers.resources | nindent 10 }}
210+
envFrom:
211+
- configMapRef:
212+
name: {{ include "gateway-api-inference-extension.name" $ }}-latency-predictor-prediction
213+
env:
214+
- name: PREDICT_PORT
215+
value: "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"
216+
- name: POD_NAME
217+
valueFrom:
218+
fieldRef:
219+
fieldPath: metadata.name
220+
- name: SERVER_TYPE
221+
value: "prediction-{{ add $i 1 }}"
222+
- name: TRAINING_SERVER_URL
223+
value: "http://localhost:{{ $.Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
224+
volumeMounts:
225+
- name: prediction-server-{{ add $i 1 }}-storage
226+
mountPath: /server_models
227+
{{- end }}
228+
{{- end }}
141229
volumes:
142230
- name: plugins-config-volume
143231
configMap:
144232
name: {{ include "gateway-api-inference-extension.name" . }}
233+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
234+
- name: training-server-storage
235+
emptyDir:
236+
sizeLimit: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.volumeSize }}
237+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
238+
- name: prediction-server-{{ add $i 1 }}-storage
239+
emptyDir:
240+
sizeLimit: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.volumeSize }}
241+
{{- end }}
242+
{{- end }}
145243
{{- if .Values.inferenceExtension.affinity }}
146244
affinity:
147245
{{- toYaml .Values.inferenceExtension.affinity | nindent 8 }}

config/charts/inferencepool/templates/epp-service.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,18 @@ spec:
1515
- name: http-metrics
1616
protocol: TCP
1717
port: {{ .Values.inferenceExtension.metricsPort | default 9090 }}
18+
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
19+
- name: latency-predictor-training
20+
protocol: TCP
21+
port: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
22+
targetPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
23+
{{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
24+
- name: latency-predictor-{{ add $i 1 }}
25+
protocol: TCP
26+
port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
27+
targetPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
28+
{{- end }}
29+
{{- end }}
1830
{{- with .Values.inferenceExtension.extraServicePorts }}
1931
{{- toYaml . | nindent 4 }}
2032
{{- end }}

pkg/epp/scheduling/framework/plugins/multi/slo_aware_router/scorer.go

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,11 @@ func (s *SLOAwareRouter) epsilonGreedyAffinityGate(
8282
prefixStickyThreshold float64,
8383
) ([]PodPredictionResult, bool) {
8484
logger := log.FromContext(ctx)
85-
85+
if prefixStickyThreshold <= 0 {
86+
// Affinity gating disabled
87+
logger.V(logutil.DEBUG).Info("Affinity gating disabled (threshold <= 0)", "path", label)
88+
return candidates, false
89+
}
8690
eligible := make([]PodPredictionResult, 0, len(candidates))
8791
for _, p := range candidates {
8892
if p.PrefixCacheScore >= prefixStickyThreshold {
@@ -301,7 +305,7 @@ func (s *SLOAwareRouter) getPrefixCacheScoreForPod(ctx context.Context, cycleSta
301305

302306
if err != nil {
303307
// The prefix cache plugin might not be enabled, which is a valid scenario.
304-
log.FromContext(ctx).V(logutil.DEBUG).Info("Prefix cache state not found in cycle state, returning prefix cache score of 0.0", "pod", pod.GetPod().String())
308+
log.FromContext(ctx).V(logutil.DEBUG).Info("prefix cache state not found in cycle state, returning prefix cache score of 0.0: %v", err, "pod", pod.GetPod().String())
305309
return 0.0
306310
}
307311

site-src/guides/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,12 @@ Deploy the sample InferenceObjective which allows you to specify priority of req
207207

208208
--8<-- "site-src/_includes/bbr.md"
209209

210+
### Next Steps: Advanced Features
211+
212+
You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:
213+
214+
* [SLO-Aware Routing](./slo-aware-routing.md)
215+
210216
### Cleanup
211217

212218
The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide.

0 commit comments

Comments
 (0)