-
Notifications
You must be signed in to change notification settings - Fork 182
Description
What would you like to be added:
Our current user guide sets 1 replica for EPP:
replicas: 1 |
We should provide the option to run EPP in HA mode. I suggest the following priorities:
-
Short-term: Active-passive mode. This is a good starting point, as in some benchmarks we saw EPP can handle >500 QPS, and 10s (potentially 100s) of model servers, without a clear bottleneck. So a single leader EPP is sufficient for many use cases. Currently we run EPP without leader election code, we will need to modify EPP to support leader election, and run some test to verify the downtime if the leader dies.
-
Mid-Long term: Support active-active EPPs, with options to shard the InferenceModels and InferencePool.
With multiple active EPP replicas, we expect decreased performance on stateful operations such as queuing and prefix cache aware routing. This will require some experiments to quantify the impact.
If there are too many inference models, or if if the pool becomes very large, we may need to further shard them.
Why is this needed: