Skip to content

EPP HA deploymentΒ #692

@liu-cong

Description

@liu-cong

What would you like to be added:

Our current user guide sets 1 replica for EPP:

We should provide the option to run EPP in HA mode. I suggest the following priorities:

  • Short-term: Active-passive mode. This is a good starting point, as in some benchmarks we saw EPP can handle >500 QPS, and 10s (potentially 100s) of model servers, without a clear bottleneck. So a single leader EPP is sufficient for many use cases. Currently we run EPP without leader election code, we will need to modify EPP to support leader election, and run some test to verify the downtime if the leader dies.

  • Mid-Long term: Support active-active EPPs, with options to shard the InferenceModels and InferencePool.
    With multiple active EPP replicas, we expect decreased performance on stateful operations such as queuing and prefix cache aware routing. This will require some experiments to quantify the impact.

If there are too many inference models, or if if the pool becomes very large, we may need to further shard them.

Why is this needed:

Metadata

Metadata

Assignees

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions