Skip to content

Commit fe516d7

Browse files
nirrozenbaumkfswain
authored andcommitted
use replicas field in helm to decide if EPP should run in HA mode (#1628)
Signed-off-by: Nir Rozenbaum <[email protected]>
1 parent 148af93 commit fe516d7

File tree

3 files changed

+10
-14
lines changed

3 files changed

+10
-14
lines changed

config/charts/inferencepool/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -81,16 +81,16 @@ $ helm install triton-llama3-8b-instruct \
8181

8282
### Install with High Availability (HA)
8383

84-
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
84+
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
8585

86-
To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
86+
To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
8787

8888
* Via `--set` flag:
8989

9090
```txt
9191
helm install vllm-llama3-8b-instruct \
9292
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
93-
--set inferenceExtension.enableLeaderElection=true \
93+
--set inferenceExtension.replicas=3 \
9494
--set provider=[none|gke] \
9595
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
9696
```
@@ -99,7 +99,7 @@ To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
9999

100100
```yaml
101101
inferenceExtension:
102-
enableLeaderElection: true
102+
replicas: 3
103103
```
104104
105105
Then apply it with:
@@ -152,7 +152,7 @@ The following table list the configurable parameters of the chart.
152152
| `inferencePool.targetPortNumber` | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
153153
| `inferencePool.modelServerType` | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
154154
| `inferencePool.modelServers.matchLabels` | Label selector to match vllm backends managed by the inference pool. |
155-
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. Defaults to `1`. |
155+
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. If More than one replica is used, EPP will run in HA active-passive mode. Defaults to `1`. |
156156
| `inferenceExtension.image.name` | Name of the container image used for the endpoint picker. |
157157
| `inferenceExtension.image.hub` | Registry URL where the endpoint picker image is hosted. |
158158
| `inferenceExtension.image.tag` | Image tag of the endpoint picker. |

config/charts/inferencepool/templates/epp-deployment.yaml

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,7 @@ metadata:
66
labels:
77
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
88
spec:
9-
{{- if .Values.inferenceExtension.enableLeaderElection }}
10-
replicas: 3
11-
{{- else }}
12-
replicas: 1
13-
{{- end }}
9+
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
1410
strategy:
1511
# The current recommended EPP deployment pattern is to have a single active replica. This ensures
1612
# optimal performance of the stateful operations such prefix cache aware scorer.
@@ -55,7 +51,7 @@ spec:
5551
- --lora-info-metric
5652
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
5753
{{- end }}
58-
{{- if .Values.inferenceExtension.enableLeaderElection }}
54+
{{- if gt .Values.inferenceExtension.replicas 1 }}
5955
- --ha-enable-leader-election
6056
{{- end }}
6157
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
@@ -74,7 +70,7 @@ spec:
7470
{{- toYaml . | nindent 8 }}
7571
{{- end }}
7672
livenessProbe:
77-
{{- if .Values.inferenceExtension.enableLeaderElection }}
73+
{{- if gt .Values.inferenceExtension.replicas 1 }}
7874
grpc:
7975
port: 9003
8076
service: liveness
@@ -86,7 +82,7 @@ spec:
8682
initialDelaySeconds: 5
8783
periodSeconds: 10
8884
readinessProbe:
89-
{{- if .Values.inferenceExtension.enableLeaderElection }}
85+
{{- if gt .Values.inferenceExtension.replicas 1 }}
9086
grpc:
9187
port: 9003
9288
service: readiness

config/charts/inferencepool/templates/leader-election-rbac.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{{- if .Values.inferenceExtension.enableLeaderElection }}
1+
{{- if gt .Values.inferenceExtension.replicas 1 }}
22
---
33
kind: Role
44
apiVersion: rbac.authorization.k8s.io/v1

0 commit comments

Comments
 (0)