-
Notifications
You must be signed in to change notification settings - Fork 825
Description
Bug description
During the cluster scaling up, once the node is provisioned and is ready in k8s, the user pods may get scheduled to the newly provisioned node. However, if the continuous image puller is still pulling the single user image because the image is large (a few GB) or slow network, the user will get timeout error message from UI if the image pulling is not completed by the set timeout. Increasing the timeout might be a quick fix but it is not a good user experience as user may have to wait long time to get access (for us, it may take 10mins).
Expected behaviour
The user pods should not be scheduled to the new node before image pulling is done.
Actual behaviour
See Bug description
How to reproduce
- Create a large single user image or use a slower registry/network
- Trigger a scaling up event
- Spawn a few user pods, some of them will be scheduled to the new node when the node is ready
- The user will see timeout from UI
Our current workaround
Adding a taint before starting image puller or when provisioning the node to prevent the user pod being schedule to the node. Then remove taint after image puller is done.
We have to code ready to send a PR. But want to see if there is a better way to solve it. Happy to send the PR anytime.