EPP HA deployment

**What would you like to be added**:

Our current user guide sets 1 replica for EPP: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/92431f582ca4f8c6d75781e303acd8c84492dbea/config/manifests/inferencepool-resources.yaml#L39

We should provide the option to run EPP in HA mode. I suggest the following priorities:

- [x] Short-term:  Active-passive mode. This is a good starting point, as in some benchmarks we saw EPP can handle >500 QPS, and 10s (potentially 100s) of model servers, without a clear bottleneck. So a single leader EPP is sufficient for many use cases. Currently we run EPP without leader election [code](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/92431f582ca4f8c6d75781e303acd8c84492dbea/pkg/epp/server/runserver.go#L124), we will need to modify EPP to support leader election, and run some test to verify the downtime if the leader dies.

- [ ] Mid-Long term: Support active-active EPPs, with options to shard the InferenceModels and InferencePool. 
With multiple active EPP replicas, we expect decreased performance on stateful operations such as queuing and prefix cache aware routing. This will require some experiments to quantify the impact. 

If there are too many inference models, or if if the pool becomes very large, we may need to further shard them.

**Why is this needed**:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EPP HA deployment #692

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EPP HA deployment #692

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions