Skip to content

Pipeline Invocations Sporadically Times Out When a Dataflow Engine Pods Gets Rolled #6870

@charleschangdp

Description

@charleschangdp

I see quite a bit of work being done in 2.9 and 2.10 to improve Pipeline stability. And my team is looking to use Pipeline in production environment. So we are conducting some tests. In our EKS clusters, we have a Karpenter group dedicated to Seldon. To ensure nodes are fresh, Karpenter will drain nodes every 6 days. Thus I want simulate what happens(on 2.10 with MSK cluster) when a node gets drained. To do this:

  1. Started a locust test with 2 workers to invoke the Pipeline.
  2. While the test is running, rolled one of the seldon-dataflow-engine pods and observe the number of invocation fails.

I see the Pipeline object status PipelineReady switched to False. After about 1 minute, the Pipeline objects recovered. During that time, there were some successful requests. But most requests timed out. (request timeout set to 5 s)

Since all our pods has to be rolled every 6 days, this introduces at least a minute of model downtime every 6 days. Is there any configuration changes on SeldoI can make to Seldon to reduce that 1 minute window?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions