Pipeline Invocations Sporadically Times Out When a Dataflow Engine Pods Gets Rolled

I see quite a bit of work being done in 2.9 and 2.10  to improve Pipeline stability. And my team is looking to use Pipeline in production environment. So we are conducting some tests. In our EKS clusters, we have a Karpenter group dedicated to Seldon. To ensure nodes are fresh, Karpenter will drain nodes every 6 days. Thus I want simulate what happens(on 2.10 with MSK cluster) when a node gets drained. To do this:

1. Started a locust test with 2 workers to invoke the Pipeline. 
2. While the test is running, rolled one of the seldon-dataflow-engine pods and observe the number of invocation fails.

I see the Pipeline object status `PipelineReady` switched to `False`.  After about 1 minute, the Pipeline objects recovered. During that time,  there were some successful requests. But most requests timed out. (request timeout set to 5 s) 

Since all our pods has to be rolled every 6 days, this introduces at least a minute of model downtime every 6 days. Is there any configuration changes on SeldoI can make to Seldon to reduce that 1 minute window?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline Invocations Sporadically Times Out When a Dataflow Engine Pods Gets Rolled #6870

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pipeline Invocations Sporadically Times Out When a Dataflow Engine Pods Gets Rolled #6870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions