Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions config/charts/inferencepool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,16 @@ $ helm install triton-llama3-8b-instruct \

### Install with High Availability (HA)

To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.
To deploy the EndpointPicker in a high-availability (HA) active-passive configuration set replicas to be greater than one. In such a setup, only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.

To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.
To enable HA, set `inferenceExtension.replicas` to a number greater than 1.

* Via `--set` flag:

```txt
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set inferenceExtension.enableLeaderElection=true \
--set inferenceExtension.replicas=3 \
--set provider=[none|gke] \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
```
Expand All @@ -119,7 +119,7 @@ To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.

```yaml
inferenceExtension:
enableLeaderElection: true
replicas: 3
```

Then apply it with:
Expand Down Expand Up @@ -172,7 +172,7 @@ The following table list the configurable parameters of the chart.
| `inferencePool.targetPortNumber` | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
| `inferencePool.modelServerType` | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
| `inferencePool.modelServers.matchLabels` | Label selector to match vllm backends managed by the inference pool. |
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. Defaults to `1`. |
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. If More than one replica is used, EPP will run in HA active-passive mode. Defaults to `1`. |
| `inferenceExtension.image.name` | Name of the container image used for the endpoint picker. |
| `inferenceExtension.image.hub` | Registry URL where the endpoint picker image is hosted. |
| `inferenceExtension.image.tag` | Image tag of the endpoint picker. |
Expand Down
12 changes: 4 additions & 8 deletions config/charts/inferencepool/templates/epp-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,7 @@ metadata:
labels:
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
spec:
{{- if .Values.inferenceExtension.enableLeaderElection }}
replicas: 3
{{- else }}
replicas: 1
{{- end }}
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
strategy:
# The current recommended EPP deployment pattern is to have a single active replica. This ensures
# optimal performance of the stateful operations such prefix cache aware scorer.
Expand Down Expand Up @@ -53,7 +49,7 @@ spec:
- --lora-info-metric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
{{- end }}
{{- if .Values.inferenceExtension.enableLeaderElection }}
{{- if gt .Values.inferenceExtension.replicas 1 }}
- --ha-enable-leader-election
{{- end }}
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
Expand All @@ -72,7 +68,7 @@ spec:
{{- toYaml .Values.inferenceExtension.extraContainerPorts | nindent 8 }}
{{- end }}
livenessProbe:
{{- if .Values.inferenceExtension.enableLeaderElection }}
{{- if gt .Values.inferenceExtension.replicas 1 }}
grpc:
port: 9003
service: liveness
Expand All @@ -84,7 +80,7 @@ spec:
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
{{- if .Values.inferenceExtension.enableLeaderElection }}
{{- if gt .Values.inferenceExtension.replicas 1 }}
grpc:
port: 9003
service: readiness
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{{- if .Values.inferenceExtension.enableLeaderElection }}
{{- if gt .Values.inferenceExtension.replicas 1 }}
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
Expand Down