-
Notifications
You must be signed in to change notification settings - Fork 855
Description
I see quite a bit of work being done in 2.9 and 2.10 to improve Pipeline stability. And my team is looking to use Pipeline in production environment. So we are conducting some tests. In our EKS clusters, we have a Karpenter group dedicated to Seldon. To ensure nodes are fresh, Karpenter will drain nodes every 6 days. Thus I want simulate what happens(on 2.10 with MSK cluster) when a node gets drained. To do this:
- Started a locust test with 2 workers to invoke the Pipeline.
- While the test is running, rolled one of the seldon-dataflow-engine pods and observe the number of invocation fails.
I see the Pipeline object status PipelineReady switched to False. After about 1 minute, the Pipeline objects recovered. During that time, there were some successful requests. But most requests timed out. (request timeout set to 5 s)
Since all our pods has to be rolled every 6 days, this introduces at least a minute of model downtime every 6 days. Is there any configuration changes on SeldoI can make to Seldon to reduce that 1 minute window?