add guide, update helm charts and readme, minor scorer changes

BenjaminBraunDev · BenjaminBraunDev · commit cdacc5c5add6 · 2025-11-20T20:09:28.000Z
diff --git a/config/charts/inferencepool/README.md b/config/charts/inferencepool/README.md
@@ -121,6 +121,46 @@ $ helm install triton-llama3-8b-instruct \
   oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
 ```
 
+### Install with SLO-Aware Routing
+
+To enable SLO-aware routing, you must enable the latency predictor, which is deployed as a set of sidecar containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
+
+You can enable the latency predictor by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line.
+
+Here is an example of how to install the chart with SLO-aware routing enabled:
+
+```txt
+$ helm install vllm-llama3-8b-instruct . \
+  --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
+  --set inferenceExtension.latencyPredictor.enabled=true \
+  --set provider.name=gke
+```
+
+#### SLO-Aware Router Environment Variables
+
+The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.
+
+| Environment Variable             | Description                                                                                             | Default     |
+| -------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- |
+| `SAMPLING_MEAN`                  | The sampling mean (lambda) for the Poisson distribution of token sampling.                              | `100.0`     |
+| `MAX_SAMPLED_TOKENS`             | The maximum number of tokens to sample for TPOT prediction.                                             | `20`        |
+| `SLO_BUFFER_FACTOR`              | A buffer to apply to the SLO to make it more or less strict.                                            | `1.0`       |
+| `NEG_HEADROOM_TTFT_WEIGHT`       | The weight to give to the TTFT when a pod has negative headroom.                                        | `0.8`       |
+| `NEG_HEADROOM_TPOT_WEIGHT`       | The weight to give to the TPOT when a pod has negative headroom.                                        | `0.2`       |
+| `HEADROOM_TTFT_WEIGHT`           | The weight to give to the TTFT when a pod has positive headroom.                                        | `0.8`       |
+| `HEADROOM_TPOT_WEIGHT`           | The weight to give to the TPOT when a pod has positive headroom.                                        | `0.2`       |
+| `HEADROOM_SELECTION_STRATEGY`    | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least`     |
+| `COMPOSITE_KV_WEIGHT`            | The weight to give to the KV cache utilization in the composite score.                                  | `1`         |
+| `COMPOSITE_QUEUE_WEIGHT`         | The weight to give to the queue size in the composite score.                                            | `1`         |
+| `COMPOSITE_PREFIX_WEIGHT`        | The weight to give to the prefix cache score in the composite score.                                    | `1`         |
+| `STICKY_EPSILON`                 | The probability of exploring a non-sticky pod.                                                          | `0.01`      |
+| `NEG_HEADROOM_EPSILON`           | The probability of exploring a pod with negative headroom.                                              | `0.01`      |
+| `AFFINITY_GATE_TAU`              | The stickiness threshold for the affinity gate.                                                         | `0.80`      |
+| `AFFINITY_GATE_TAU_GLOBAL`       | The global stickiness threshold for the affinity gate.                                                  | `0.99`      |
+| `POD_SELECTION_MODE`             | The mode for selecting a pod from the weighted list. Options: `linear` (weighted random), `max` (argmax). | `linear`    |
+
+**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
+
 ### Install with High Availability (HA)
 
 To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
diff --git a/config/charts/inferencepool/templates/epp-config.yaml b/config/charts/inferencepool/templates/epp-config.yaml
@@ -11,7 +11,28 @@ data:
     - type: queue-scorer
     - type: kv-cache-utilization-scorer
     - type: prefix-cache-scorer
+    {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+    - type: slo-aware-routing
+    - type: slo-aware-profile-handler
+    - type: max-score-picker
+    {{- end }}
     schedulingProfiles:
+    {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+    - name: prefix
+      plugins:
+      - pluginRef: prefix-cache-scorer
+    - name: default
+      plugins:
+      - pluginRef: slo-aware-routing
+        weight: 0
+      - pluginRef: queue-scorer
+      - pluginRef: kv-cache-utilization-scorer
+      - pluginRef: max-score-picker
+    - name: slo
+      plugins:
+      - pluginRef: slo-aware-routing
+      - pluginRef: max-score-picker
+    {{- else }}
     - name: default
       plugins:
       - pluginRef: queue-scorer
@@ -20,17 +41,29 @@ data:
         weight: 2
       - pluginRef: prefix-cache-scorer
         weight: 3
+    {{- end }}
   {{- if (hasKey .Values.inferenceExtension "pluginsCustomConfig") }}
   {{- .Values.inferenceExtension.pluginsCustomConfig | toYaml | nindent 2 }}
   {{- end }}
-  
 ---
-{{- if .Values.inferenceExtension.sidecar.enabled }}
+{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
 apiVersion: v1
 kind: ConfigMap
 metadata:
-  name: {{ .Values.inferenceExtension.sidecar.configMap.name }}
+  name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
   namespace: {{ .Release.Namespace }}
 data:
-  {{- .Values.inferenceExtension.sidecar.configMap.data | toYaml | nindent 2 }}
+  {{- range $key, $value := .Values.inferenceExtension.latencyPredictor.trainingServer.config }}
+  {{ $key }}: {{ $value | quote }}
+  {{- end }}
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-prediction
+  namespace: {{ .Release.Namespace }}
+data:
+  {{- range $key, $value := .Values.inferenceExtension.latencyPredictor.predictionServers.config }}
+  {{ $key }}: {{ $value | quote }}
+  {{- end }}
 {{- end }}
diff --git a/config/charts/inferencepool/templates/epp-deployment.yaml b/config/charts/inferencepool/templates/epp-deployment.yaml
@@ -96,6 +96,9 @@ spec:
         {{- if gt (.Values.inferenceExtension.replicas | int) 1 }}
         - --ha-enable-leader-election
         {{- end }}
+        {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+        - --enable-latency-predictor
+        {{- end }}
         # Pass additional flags via the inferenceExtension.flags field in values.yaml.
         {{- range $key, $value := .Values.inferenceExtension.flags }}
         - --{{ $key }}
@@ -147,6 +150,20 @@ spec:
           valueFrom:
             fieldRef:
               fieldPath: metadata.namespace
+        {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+        - name: PREDICTION_SERVER_URL
+          value: "{{- $count := int .Values.inferenceExtension.latencyPredictor.predictionServers.count -}}
+                  {{- $startPort := int .Values.inferenceExtension.latencyPredictor.predictionServers.startPort -}}
+                  {{- range $i := until $count -}}
+                    {{- if $i }},{{ end }}http://localhost:{{ add $startPort $i }}
+                  {{- end }}"
+        - name: TRAINING_SERVER_URL
+          value: "http://localhost:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
+        {{- range $key, $value := .Values.inferenceExtension.latencyPredictor.eppEnv }}
+        - name: {{ $key }}
+          value: {{ $value | quote }}
+        {{- end }}
+        {{- end }}
         {{- if .Values.inferenceExtension.tracing.enabled }}
         - name: OTEL_SERVICE_NAME
           value: "gateway-api-inference-extension"
@@ -177,13 +194,94 @@ spec:
         volumeMounts:
         - name: plugins-config-volume
           mountPath: "/config"
+      {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+      # Training Server Sidecar Container
+      - name: training-server
+        image: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.hub }}/{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.name }}:{{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.tag }}
+        imagePullPolicy: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.image.pullPolicy }}
+        ports:
+        - containerPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
+          name: training-port
+        livenessProbe:
+          {{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.livenessProbe | nindent 10 }}
+        readinessProbe:
+          {{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.readinessProbe | nindent 10 }}
+        resources:
+          {{- toYaml .Values.inferenceExtension.latencyPredictor.trainingServer.resources | nindent 10 }}
+        envFrom:
+        - configMapRef:
+            name: {{ include "gateway-api-inference-extension.name" . }}-latency-predictor-training
+        env:
+        - name: POD_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        - name: SERVER_TYPE
+          value: "training"
+        volumeMounts:
+        - name: training-server-storage
+          mountPath: /models
+      {{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
+      # Prediction Server Sidecar Container {{ add $i 1 }}
+      - name: prediction-server-{{ add $i 1 }}
+        image: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.hub }}/{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.name }}:{{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.tag }}
+        imagePullPolicy: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.image.pullPolicy }}
+        command: ["uvicorn"]
+        args: ["prediction_server:app", "--host", "0.0.0.0", "--port", "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"]
+        ports:
+        - containerPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
+          name: predict-port-{{ add $i 1 }}
+        livenessProbe:
+          httpGet:
+            path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.httpGet.path }}
+            port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
+          initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.initialDelaySeconds }}
+          periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.livenessProbe.periodSeconds }}
+        readinessProbe:
+          httpGet:
+            path: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.httpGet.path }}
+            port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
+          initialDelaySeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.initialDelaySeconds }}
+          periodSeconds: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.periodSeconds }}
+          failureThreshold: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.readinessProbe.failureThreshold }}
+        resources:
+          {{- toYaml $.Values.inferenceExtension.latencyPredictor.predictionServers.resources | nindent 10 }}
+        envFrom:
+        - configMapRef:
+            name: {{ include "gateway-api-inference-extension.name" $ }}-latency-predictor-prediction
+        env:
+        - name: PREDICT_PORT
+          value: "{{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}"
+        - name: POD_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        - name: SERVER_TYPE
+          value: "prediction-{{ add $i 1 }}"
+        - name: TRAINING_SERVER_URL
+          value: "http://localhost:{{ $.Values.inferenceExtension.latencyPredictor.trainingServer.port }}"
+        volumeMounts:
+        - name: prediction-server-{{ add $i 1 }}-storage
+          mountPath: /server_models
+      {{- end }}
+      {{- end }}
       volumes:
       {{- if .Values.inferenceExtension.sidecar.volumes }}
       {{- tpl (toYaml .Values.inferenceExtension.sidecar.volumes) $ | nindent 6 }}
       {{- end }}
       - name: plugins-config-volume
         configMap:
           name: {{ include "gateway-api-inference-extension.name" . }}
+      {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+      - name: training-server-storage
+        emptyDir: 
+          sizeLimit: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.volumeSize }}
+      {{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
+      - name: prediction-server-{{ add $i 1 }}-storage
+        emptyDir: 
+          sizeLimit: {{ $.Values.inferenceExtension.latencyPredictor.predictionServers.volumeSize }}
+      {{- end }}
+      {{- end }}
       {{- if .Values.inferenceExtension.affinity }}
       affinity:
         {{- toYaml .Values.inferenceExtension.affinity | nindent 8 }}
diff --git a/config/charts/inferencepool/templates/epp-service.yaml b/config/charts/inferencepool/templates/epp-service.yaml
@@ -15,6 +15,18 @@ spec:
     - name: http-metrics
       protocol: TCP
       port: {{ .Values.inferenceExtension.metricsPort | default 9090 }}
+    {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+    - name: latency-predictor-training
+      protocol: TCP
+      port: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
+      targetPort: {{ .Values.inferenceExtension.latencyPredictor.trainingServer.port }}
+    {{- range $i := until (int .Values.inferenceExtension.latencyPredictor.predictionServers.count) }}
+    - name: latency-predictor-{{ add $i 1 }}
+      protocol: TCP
+      port: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
+      targetPort: {{ add $.Values.inferenceExtension.latencyPredictor.predictionServers.startPort $i }}
+    {{- end }}
+    {{- end }}
     {{- with .Values.inferenceExtension.extraServicePorts }}
     {{- toYaml . | nindent 4 }}
     {{- end }}
diff --git a/site-src/guides/index.md b/site-src/guides/index.md
@@ -274,6 +274,12 @@ Deploy the sample InferenceObjective which allows you to specify priority of req
 
 --8<-- "site-src/_includes/bbr.md"
 
+### Next Steps: Advanced Features
+
+You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:
+
+*   [SLO-Aware Routing](./slo-aware-routing.md)
+
 ### Cleanup
 
    The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide.
diff --git a/site-src/guides/slo-aware-routing.md b/site-src/guides/slo-aware-routing.md