Skip to content

Commit de41a96

Browse files
liu-congkfswain
authored andcommitted
Consolidate ha config into a single enableLeaderElection, also fix rolling update stuck bug (#1620)
1 parent 1529635 commit de41a96

File tree

3 files changed

+56
-20
lines changed

3 files changed

+56
-20
lines changed

config/charts/inferencepool/README.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -83,19 +83,30 @@ $ helm install triton-llama3-8b-instruct \
8383

8484
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
8585

86-
To enable HA, set `inferenceExtension.flags.has-enable-leader-election` to `true` and increase the number of replicas in your `values.yaml` file:
86+
To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
8787

88-
```yaml
89-
inferenceExtension:
90-
replicas: 3
91-
has-enable-leader-election: true
92-
```
88+
* Via `--set` flag:
9389

94-
Then apply it with:
90+
```txt
91+
helm install vllm-llama3-8b-instruct \
92+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
93+
--set inferenceExtension.enableLeaderElection=true \
94+
--set provider=[none|gke] \
95+
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
96+
```
9597

96-
```txt
97-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
98-
```
98+
* Via `values.yaml`:
99+
100+
```yaml
101+
inferenceExtension:
102+
enableLeaderElection: true
103+
```
104+
105+
Then apply it with:
106+
107+
```txt
108+
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
109+
```
99110

100111
### Install with Monitoring
101112

@@ -150,7 +161,8 @@ The following table list the configurable parameters of the chart.
150161
| `inferenceExtension.extraContainerPorts` | List of additional container ports to expose. Defaults to `[]`. |
151162
| `inferenceExtension.extraServicePorts` | List of additional service ports to expose. Defaults to `[]`. |
152163
| `inferenceExtension.flags` | List of flags which are passed through to endpoint picker. Example flags, enable-pprof, grpc-port etc. Refer [runner.go](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/cmd/epp/runner/runner.go) for complete list. |
153-
| `inferenceExtension.flags.has-enable-leader-election` | Enable leader election for high availability. When enabled, only one EPP pod (the leader) will be ready to serve traffic. |
164+
| `inferenceExtension.affinity` | Affinity for the endpoint picker. Defaults to `{}`. |
165+
| `inferenceExtension.tolerations` | Tolerations for the endpoint picker. Defaults to `[]`. | |
154166
| `inferenceExtension.monitoring.interval` | Metrics scraping interval for monitoring. Defaults to `10s`. |
155167
| `inferenceExtension.monitoring.secret.name` | Name of the service account token secret for metrics authentication. Defaults to `inference-gateway-sa-metrics-reader-secret`. |
156168
| `inferenceExtension.monitoring.prometheus.enabled` | Enable Prometheus ServiceMonitor creation for EPP metrics collection. Defaults to `false`. |

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,19 @@ metadata:
66
labels:
77
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
88
spec:
9-
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
9+
{{- if .Values.inferenceExtension.enableLeaderElection }}
10+
replicas: 3
11+
{{- else }}
12+
replicas: 1
13+
{{- end }}
14+
strategy:
15+
# The current recommended EPP deployment pattern is to have a single active replica. This ensures
16+
# optimal performance of the stateful operations such prefix cache aware scorer.
17+
# The Recreate strategy the old replica is killed immediately, and allow the new replica(s) to
18+
# quickly take over. This is particularly important in the high availability set up with leader
19+
# election, as the rolling update strategy would prevent the old leader being killed because
20+
# otherwise the maxUnavailable would be 100%.
21+
type: Recreate
1022
selector:
1123
matchLabels:
1224
{{- include "gateway-api-inference-extension.selectorLabels" . | nindent 6 }}
@@ -35,10 +47,6 @@ spec:
3547
- "json"
3648
- --config-file
3749
- "/config/{{ .Values.inferenceExtension.pluginsConfigFile }}"
38-
{{- range .Values.inferenceExtension.flags }}
39-
- "--{{ .name }}"
40-
- "{{ .value }}"
41-
{{- end }}
4250
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
4351
- --total-queued-requests-metric
4452
- "nv_trt_llm_request_metrics{request_type=waiting}"
@@ -47,6 +55,14 @@ spec:
4755
- --lora-info-metric
4856
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
4957
{{- end }}
58+
{{- if .Values.inferenceExtension.enableLeaderElection }}
59+
- --ha-enable-leader-election
60+
{{- end }}
61+
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
62+
{{- range .Values.inferenceExtension.flags }}
63+
- "--{{ .name }}"
64+
- "{{ .value }}"
65+
{{- end }}
5066
ports:
5167
- name: grpc
5268
containerPort: 9002
@@ -79,11 +95,15 @@ spec:
7995
port: 9003
8096
service: inference-extension
8197
{{- end }}
82-
initialDelaySeconds: 5
83-
periodSeconds: 10
84-
{{- with .Values.inferenceExtension.env }}
98+
periodSeconds: 2
99+
85100
env:
86-
{{- toYaml . | nindent 8 }}
101+
- name: NAMESPACE
102+
valueFrom:
103+
fieldRef:
104+
fieldPath: metadata.namespace
105+
{{- if .Values.inferenceExtension.env }}
106+
{{- toYaml .Values.inferenceExtension.env | nindent 8 }}
87107
{{- end }}
88108
volumeMounts:
89109
- name: plugins-config-volume

config/charts/inferencepool/templates/gke.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ spec:
1313
kind: InferencePool
1414
name: {{ .Release.Name }}
1515
default:
16+
# Set a more aggressive health check than the default 5s for faster switch
17+
# over during EPP rollout.
18+
timeoutSec: 2
19+
checkIntervalSec: 2
1620
config:
1721
type: HTTP
1822
httpHealthCheck:

0 commit comments

Comments
 (0)