Skip to content

User pod launching timeout with large single user image when cluster scaling up #3010

@xcompass

Description

@xcompass

Bug description

During the cluster scaling up, once the node is provisioned and is ready in k8s, the user pods may get scheduled to the newly provisioned node. However, if the continuous image puller is still pulling the single user image because the image is large (a few GB) or slow network, the user will get timeout error message from UI if the image pulling is not completed by the set timeout. Increasing the timeout might be a quick fix but it is not a good user experience as user may have to wait long time to get access (for us, it may take 10mins).

Expected behaviour

The user pods should not be scheduled to the new node before image pulling is done.

Actual behaviour

See Bug description

How to reproduce

  1. Create a large single user image or use a slower registry/network
  2. Trigger a scaling up event
  3. Spawn a few user pods, some of them will be scheduled to the new node when the node is ready
  4. The user will see timeout from UI

Our current workaround

Adding a taint before starting image puller or when provisioning the node to prevent the user pod being schedule to the node. Then remove taint after image puller is done.

We have to code ready to send a PR. But want to see if there is a better way to solve it. Happy to send the PR anytime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions