Skip to content

Commit 729c53b

Browse files
Add helm values and polish README and SLO routing guide
1 parent 2e220d7 commit 729c53b

File tree

3 files changed

+93
-38
lines changed

3 files changed

+93
-38
lines changed

config/charts/inferencepool/README.md

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,10 @@ Here is an example of how to install the chart with SLO-aware routing enabled:
132132
```txt
133133
$ helm install vllm-llama3-8b-instruct . \
134134
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
135+
--set inferenceExtension.monitoring.gke.enabled=true \
135136
--set inferenceExtension.latencyPredictor.enabled=true \
136-
--set provider.name=gke
137+
--set provider.name=gke \
138+
-f values.yaml
137139
```
138140

139141
#### SLO-Aware Router Environment Variables
@@ -150,14 +152,6 @@ The behavior of the SLO-aware router can be fine-tuned using the following envir
150152
| `HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
151153
| `HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
152154
| `HEADROOM_SELECTION_STRATEGY` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
153-
| `COMPOSITE_KV_WEIGHT` | The weight to give to the KV cache utilization in the composite score. | `1` |
154-
| `COMPOSITE_QUEUE_WEIGHT` | The weight to give to the queue size in the composite score. | `1` |
155-
| `COMPOSITE_PREFIX_WEIGHT` | The weight to give to the prefix cache score in the composite score. | `1` |
156-
| `STICKY_EPSILON` | The probability of exploring a non-sticky pod. | `0.01` |
157-
| `NEG_HEADROOM_EPSILON` | The probability of exploring a pod with negative headroom. | `0.01` |
158-
| `AFFINITY_GATE_TAU` | The stickiness threshold for the affinity gate. | `0.80` |
159-
| `AFFINITY_GATE_TAU_GLOBAL` | The global stickiness threshold for the affinity gate. | `0.99` |
160-
| `POD_SELECTION_MODE` | The mode for selecting a pod from the weighted list. Options: `linear` (weighted random), `max` (argmax). | `linear` |
161155

162156
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
163157

config/charts/inferencepool/values.yaml

Lines changed: 88 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
inferenceExtension:
22
replicas: 1
33
image:
4-
name: epp
5-
hub: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension
6-
tag: main
4+
name: epp-wlp-latencypredictor-helm-v2
5+
hub: us-docker.pkg.dev/kaushikmitra-gke-dev/kaushikmitra-docker-repo
6+
tag: latest
77
pullPolicy: Always
88
extProcPort: 9002
99
env: []
@@ -12,11 +12,6 @@ inferenceExtension:
1212
extraContainerPorts: []
1313
# Define additional service ports
1414
extraServicePorts: []
15-
# extraServicePorts:
16-
# - name: http
17-
# port: 8081
18-
# protocol: TCP
19-
# targetPort: 8081
2015

2116
# This is the plugins configuration file.
2217
# pluginsCustomConfig:
@@ -43,10 +38,6 @@ inferenceExtension:
4338
affinity: {}
4439

4540
tolerations: []
46-
47-
# Sidecar configuration for EPP
48-
sidecar:
49-
enabled: false
5041

5142
# Monitoring configuration for EPP
5243
monitoring:
@@ -71,6 +62,89 @@ inferenceExtension:
7162
sampler: "parentbased_traceidratio"
7263
samplerArg: "0.1"
7364

65+
# Latency Predictor Configuration
66+
latencyPredictor:
67+
enabled: false
68+
69+
# Training Server Configuration
70+
trainingServer:
71+
image:
72+
hub: us-docker.pkg.dev/kaushikmitra-gke-dev/kaushikmitra-docker-repo
73+
name: latencypredictor-v3-training-server
74+
tag: latest
75+
pullPolicy: Always
76+
port: 8000
77+
resources:
78+
requests:
79+
cpu: "2000m"
80+
memory: "4Gi"
81+
limits:
82+
cpu: "4000m"
83+
memory: "8Gi"
84+
livenessProbe:
85+
httpGet:
86+
path: /healthz
87+
port: 8000
88+
initialDelaySeconds: 30
89+
periodSeconds: 20
90+
readinessProbe:
91+
httpGet:
92+
path: /readyz
93+
port: 8000
94+
initialDelaySeconds: 45
95+
periodSeconds: 10
96+
volumeSize: "20Gi"
97+
config:
98+
LATENCY_RETRAINING_INTERVAL_SEC: "1"
99+
LATENCY_MIN_SAMPLES_FOR_RETRAIN: "100"
100+
LATENCY_TTFT_MODEL_PATH: "/models/ttft.joblib"
101+
LATENCY_TPOT_MODEL_PATH: "/models/tpot.joblib"
102+
LATENCY_TTFT_SCALER_PATH: "/models/ttft_scaler.joblib"
103+
LATENCY_TPOT_SCALER_PATH: "/models/tpot_scaler.joblib"
104+
LATENCY_MODEL_TYPE: "xgboost"
105+
LATENCY_MAX_TRAINING_DATA_SIZE_PER_BUCKET: "5000"
106+
LATENCY_QUANTILE_ALPHA: "0.9"
107+
108+
# Prediction Server Configuration
109+
predictionServers:
110+
count: 10
111+
startPort: 8001
112+
image:
113+
hub: us-docker.pkg.dev/kaushikmitra-gke-dev/kaushikmitra-docker-repo
114+
name: latencypredictor-v3-prediction-server
115+
tag: latest
116+
pullPolicy: Always
117+
resources:
118+
requests:
119+
cpu: "500m"
120+
memory: "1Gi"
121+
limits:
122+
cpu: "1000m"
123+
memory: "2Gi"
124+
livenessProbe:
125+
httpGet:
126+
path: /healthz
127+
initialDelaySeconds: 15
128+
periodSeconds: 15
129+
readinessProbe:
130+
httpGet:
131+
path: /readyz
132+
initialDelaySeconds: 10
133+
periodSeconds: 5
134+
failureThreshold: 10
135+
volumeSize: "10Gi"
136+
config:
137+
LATENCY_MODEL_TYPE: "xgboost"
138+
PREDICT_HOST: "0.0.0.0"
139+
LOCAL_TTFT_MODEL_PATH: "/server_models/ttft.joblib"
140+
LOCAL_TPOT_MODEL_PATH: "/server_models/tpot.joblib"
141+
LOCAL_TTFT_SCALER_PATH: "/server_models/ttft_scaler.joblib"
142+
LOCAL_TPOT_SCALER_PATH: "/server_models/tpot_scaler.joblib"
143+
144+
# EPP Environment Variables for Latency Predictor
145+
eppEnv:
146+
LATENCY_MAX_SAMPLE_SIZE: "10000"
147+
74148
inferencePool:
75149
targetPorts:
76150
- number: 8000
@@ -94,25 +168,12 @@ provider:
94168
# Set to true if the cluster is an Autopilot cluster.
95169
autopilot: false
96170

97-
# Istio-specific configuration.
98-
# This block is only used if name is "istio".
99-
istio:
100-
destinationRule:
101-
# Provide a way to override the default calculated host
102-
host: ""
103-
# Optional: Enables customization of the traffic policy
104-
trafficPolicy: {}
105-
# connectionPool:
106-
# http:
107-
# maxRequestsPerConnection: 256000
108-
109-
# DEPRECATED and will be removed in v1.3. Instead, use `provider.istio.*`.
110171
istio:
111172
destinationRule:
112173
# Provide a way to override the default calculated host
113-
host: ""
174+
host: ""
114175
# Optional: Enables customization of the traffic policy
115176
trafficPolicy: {}
116177
# connectionPool:
117178
# http:
118-
# maxRequestsPerConnection: 256000
179+
# maxRequestsPerConnection: 256000

site-src/guides/slo-aware-routing.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ Key categories of metrics include:
6161
- **SLO Violations**: Counters and gauges are available to track when SLOs are violated. This can be used to alert on SLO breaches.
6262
- **SLO Thresholds**: The current SLO thresholds for TTFT and TPOT are also exposed as metrics.
6363

64+
NOTE: TPOT is equivalen to vLLM's **ITL** (Inter Token Latency), as vLLM defines TPOT as the average time per output token *including the TTFT*. This is commonly known as NTPOT in other contexts, and we don't capture that metric here.
65+
6466
The following is a comprehensive list of the Prometheus metrics exposed:
6567

6668
| Metric Name | Description |
@@ -81,5 +83,3 @@ The following is a comprehensive list of the Prometheus metrics exposed:
8183
| `inference_objective_request_ttft_slo_violation_total` | Counter of TTFT SLO violations for each model and target model. |
8284
| `inference_objective_request_tpot_slo_violation` | Boolean indicator (0 or 1) of whether the last TPOT measurement violated the SLO threshold for each model and target model. |
8385
| `inference_objective_request_tpot_slo_violation_total` | Counter of TPOT SLO violations for each model and target model. |
84-
| `inference_objective_request_ttft_slo_threshold_seconds` | Current TTFT SLO threshold in seconds for each model and target model. |
85-
| `inference_objective_request_tpot_slo_threshold_seconds` | Current TPOT SLO threshold in seconds for each model and target model. |

0 commit comments

Comments
 (0)