use replicas field in helm to decide if EPP should run in HA mode (#1628)

nirrozenbaum · web-flow · commit 6e6823af0c31 · 2025-09-22T12:42:14.000-07:00
Signed-off-by: Nir Rozenbaum &lt;nirro@il.ibm.com&gt;
diff --git a/config/charts/inferencepool/README.md b/config/charts/inferencepool/README.md
@@ -101,16 +101,16 @@ $ helm install triton-llama3-8b-instruct \
 
 ### Install with High Availability (HA)
 
-To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
+To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
 
-To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
+To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
 
 * Via `--set` flag:
 
   ```txt
   helm install vllm-llama3-8b-instruct \
   --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
-  --set inferenceExtension.enableLeaderElection=true \
+  --set inferenceExtension.replicas=3 \
   --set provider=[none|gke] \
   oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
   ```
@@ -119,7 +119,7 @@ To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
 
   ```yaml
   inferenceExtension:
-    enableLeaderElection: true
+    replicas: 3
   ```
 
   Then apply it with:
@@ -172,7 +172,7 @@ The following table list the configurable parameters of the chart.
 | `inferencePool.targetPortNumber`            | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
 | `inferencePool.modelServerType`            | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
 | `inferencePool.modelServers.matchLabels`    | Label selector to match vllm backends managed by the inference pool.                                                   |
-| `inferenceExtension.replicas`               | Number of replicas for the endpoint picker extension service. Defaults to `1`.                                         |
+| `inferenceExtension.replicas`               | Number of replicas for the endpoint picker extension service. If More than one replica is used, EPP will run in HA active-passive mode. Defaults to `1`.                                         |
 | `inferenceExtension.image.name`             | Name of the container image used for the endpoint picker.                                                              |
 | `inferenceExtension.image.hub`              | Registry URL where the endpoint picker image is hosted.                                                                |
 | `inferenceExtension.image.tag`              | Image tag of the endpoint picker.                                                                                      |
diff --git a/config/charts/inferencepool/templates/epp-deployment.yaml b/config/charts/inferencepool/templates/epp-deployment.yaml
@@ -6,11 +6,7 @@ metadata:
   labels:
     {{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
 spec:
-  {{- if .Values.inferenceExtension.enableLeaderElection }}
-  replicas: 3
-  {{- else }}
-  replicas: 1
-  {{- end }}
+  replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
   strategy:
     # The current recommended EPP deployment pattern is to have a single active replica. This ensures 
     # optimal performance of the stateful operations such prefix cache aware scorer.
@@ -53,7 +49,7 @@ spec:
         - --lora-info-metric
         - "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
         {{- end }}
-        {{- if .Values.inferenceExtension.enableLeaderElection }}
+        {{- if gt .Values.inferenceExtension.replicas 1 }}
         - --ha-enable-leader-election
         {{- end }}
         # Pass additional flags via the inferenceExtension.flags field in values.yaml.
@@ -72,7 +68,7 @@ spec:
         {{- toYaml .Values.inferenceExtension.extraContainerPorts | nindent 8 }}
         {{- end }}
         livenessProbe:
-          {{- if .Values.inferenceExtension.enableLeaderElection }}
+          {{- if gt .Values.inferenceExtension.replicas 1 }}
           grpc:
             port: 9003
             service: liveness
@@ -84,7 +80,7 @@ spec:
           initialDelaySeconds: 5
           periodSeconds: 10
         readinessProbe:
-          {{- if .Values.inferenceExtension.enableLeaderElection }}
+          {{- if gt .Values.inferenceExtension.replicas 1 }}
           grpc:
             port: 9003
             service: readiness
diff --git a/config/charts/inferencepool/templates/leader-election-rbac.yaml b/config/charts/inferencepool/templates/leader-election-rbac.yaml
@@ -1,4 +1,4 @@
-{{- if .Values.inferenceExtension.enableLeaderElection }}
+{{- if gt .Values.inferenceExtension.replicas 1 }}
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-{{- if .Values.inferenceExtension.enableLeaderElection }}`
	`1`	`+{{- if gt .Values.inferenceExtension.replicas 1 }}`
`2`	`2`	`---`
`3`	`3`	`kind: Role`
`4`	`4`	`apiVersion: rbac.authorization.k8s.io/v1`