Skip to content

Commit 248f4e9

Browse files
authored
Consolidate ha config into a single enableLeaderElection, also fix rolling update stuck bug (#1620)
1 parent b4a418c commit 248f4e9

File tree

3 files changed

+49
-19
lines changed

3 files changed

+49
-19
lines changed

config/charts/inferencepool/README.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -103,19 +103,30 @@ $ helm install triton-llama3-8b-instruct \
103103

104104
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
105105

106-
To enable HA, set `inferenceExtension.flags.has-enable-leader-election` to `true` and increase the number of replicas in your `values.yaml` file:
106+
To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
107107

108-
```yaml
109-
inferenceExtension:
110-
replicas: 3
111-
has-enable-leader-election: true
112-
```
108+
* Via `--set` flag:
113109

114-
Then apply it with:
110+
```txt
111+
helm install vllm-llama3-8b-instruct \
112+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
113+
--set inferenceExtension.enableLeaderElection=true \
114+
--set provider=[none|gke] \
115+
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
116+
```
115117

116-
```txt
117-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
118-
```
118+
* Via `values.yaml`:
119+
120+
```yaml
121+
inferenceExtension:
122+
enableLeaderElection: true
123+
```
124+
125+
Then apply it with:
126+
127+
```txt
128+
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
129+
```
119130

120131
### Install with Monitoring
121132

@@ -171,8 +182,7 @@ The following table list the configurable parameters of the chart.
171182
| `inferenceExtension.extraServicePorts` | List of additional service ports to expose. Defaults to `[]`. |
172183
| `inferenceExtension.flags` | List of flags which are passed through to endpoint picker. Example flags, enable-pprof, grpc-port etc. Refer [runner.go](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/cmd/epp/runner/runner.go) for complete list. |
173184
| `inferenceExtension.affinity` | Affinity for the endpoint picker. Defaults to `{}`. |
174-
| `inferenceExtension.tolerations` | Tolerations for the endpoint picker. Defaults to `[]`. |
175-
| `inferenceExtension.flags.has-enable-leader-election` | Enable leader election for high availability. When enabled, only one EPP pod (the leader) will be ready to serve traffic. |
185+
| `inferenceExtension.tolerations` | Tolerations for the endpoint picker. Defaults to `[]`. | |
176186
| `inferenceExtension.monitoring.interval` | Metrics scraping interval for monitoring. Defaults to `10s`. |
177187
| `inferenceExtension.monitoring.secret.name` | Name of the service account token secret for metrics authentication. Defaults to `inference-gateway-sa-metrics-reader-secret`. |
178188
| `inferenceExtension.monitoring.prometheus.enabled` | Enable Prometheus ServiceMonitor creation for EPP metrics collection. Defaults to `false`. |

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,19 @@ metadata:
66
labels:
77
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
88
spec:
9-
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
9+
{{- if .Values.inferenceExtension.enableLeaderElection }}
10+
replicas: 3
11+
{{- else }}
12+
replicas: 1
13+
{{- end }}
14+
strategy:
15+
# The current recommended EPP deployment pattern is to have a single active replica. This ensures
16+
# optimal performance of the stateful operations such prefix cache aware scorer.
17+
# The Recreate strategy the old replica is killed immediately, and allow the new replica(s) to
18+
# quickly take over. This is particularly important in the high availability set up with leader
19+
# election, as the rolling update strategy would prevent the old leader being killed because
20+
# otherwise the maxUnavailable would be 100%.
21+
type: Recreate
1022
selector:
1123
matchLabels:
1224
{{- include "gateway-api-inference-extension.selectorLabels" . | nindent 6 }}
@@ -33,10 +45,6 @@ spec:
3345
- "json"
3446
- --config-file
3547
- "/config/{{ .Values.inferenceExtension.pluginsConfigFile }}"
36-
{{- range .Values.inferenceExtension.flags }}
37-
- "--{{ .name }}"
38-
- "{{ .value }}"
39-
{{- end }}
4048
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
4149
- --total-queued-requests-metric
4250
- "nv_trt_llm_request_metrics{request_type=waiting}"
@@ -45,6 +53,14 @@ spec:
4553
- --lora-info-metric
4654
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
4755
{{- end }}
56+
{{- if .Values.inferenceExtension.enableLeaderElection }}
57+
- --ha-enable-leader-election
58+
{{- end }}
59+
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
60+
{{- range .Values.inferenceExtension.flags }}
61+
- "--{{ .name }}"
62+
- "{{ .value }}"
63+
{{- end }}
4864
ports:
4965
- name: grpc
5066
containerPort: 9002
@@ -77,8 +93,8 @@ spec:
7793
port: 9003
7894
service: inference-extension
7995
{{- end }}
80-
initialDelaySeconds: 5
81-
periodSeconds: 10
96+
periodSeconds: 2
97+
8298
env:
8399
- name: NAMESPACE
84100
valueFrom:

config/charts/inferencepool/templates/gke.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ spec:
1313
kind: InferencePool
1414
name: {{ .Release.Name }}
1515
default:
16+
# Set a more aggressive health check than the default 5s for faster switch
17+
# over during EPP rollout.
18+
timeoutSec: 2
19+
checkIntervalSec: 2
1620
config:
1721
type: HTTP
1822
httpHealthCheck:

0 commit comments

Comments
 (0)