-
Notifications
You must be signed in to change notification settings - Fork 4.3k
OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OCI provider: Remove unprovisioned nodes in error state as requested instead of returning an error #8806
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| package common | ||
|
|
||
| const ( | ||
| // InstanceStateUnfulfilled is a status indicating that the pool was unable to fulfill the operation | ||
| InstanceStateUnfulfilled = "Unfulfilled" | ||
| // InstanceIDUnfulfilled is the generic placeholder name for upcoming instances | ||
| InstanceIDUnfulfilled = "instance_placeholder" | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,6 +9,7 @@ import ( | |
| "net/http" | ||
| "sync" | ||
|
|
||
| ocicommon "k8s.io/autoscaler/cluster-autoscaler/cloudprovider/oci/common" | ||
| "k8s.io/klog/v2" | ||
|
|
||
| "github.com/pkg/errors" | ||
|
|
@@ -70,16 +71,23 @@ func (c *nodePoolCache) rebuild(staticNodePools map[string]NodePool, maxGetNodep | |
|
|
||
| // removeInstance tries to remove the instance from the node pool. | ||
| func (c *nodePoolCache) removeInstance(nodePoolID, instanceID string, nodeName string) error { | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
|
|
||
| if instanceID == "" { | ||
| klog.Errorf("Node %s doesn't have an instance id so it can't be deleted.", nodeName) | ||
| klog.Errorf("This could be due to a Compute Instance issue in OCI such as Out Of Host Capacity error. Check the instance status on OCI Console.") | ||
| return errors.Errorf("Node %s doesn't have an instance id so it can't be deleted.", nodeName) | ||
| } else if ocicommon.InstanceIDUnfulfilled == instanceID { | ||
| // Remove an unprovisioned instance caused by capacity or quota issues so that it does not prevent or delay the | ||
| // autoscaler from attempting to scale a different pool that meets the scheduling requirements. | ||
| size, err := c.getSize(nodePoolID) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| return c.setSize(nodePoolID, size-1) | ||
| } | ||
|
|
||
| klog.Infof("Deleting instance %q from node pool %q", instanceID, nodePoolID) | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
|
|
||
| // always try to remove the instance. This call is idempotent | ||
| scaleDown := true | ||
|
|
@@ -142,8 +150,11 @@ func (c *nodePoolCache) getByInstance(instanceID string) (*oke.NodePool, error) | |
|
|
||
| for _, nodePool := range c.cache { | ||
| for _, node := range nodePool.Nodes { | ||
| // Either the IDs match or we're looking for an unfulfilled instance, and we've found an unfulfilled node. | ||
| if *node.Id == instanceID { | ||
| return nodePool, nil | ||
| } else if ocicommon.InstanceIDUnfulfilled == instanceID && *node.Id == "" { | ||
| return nodePool, nil | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible that this line is going to cause us to return the wrong node pool? For instance if a user used But when we do
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I pushed a commit to only match an unfulfilled instance to a node pool if that pool has a node in In short, if a node in a pool can't be created due to an error, we treat that pool as the match to an unfulfilled instance. |
||
| } | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code change would we ever have a situation where the above if-condition (instanceID == "")is true?
If yes, should we decrement the size of the nodepool there as well?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposed code change is handling a specific scenario where the Cluster Autoscaler has requested that we remove an unregistered node whose underlying compute instance ID cannot be established. Outside of that scenario, I am not sure why a compute ID could not be determined or if whether decrementing the size of the node-pool is the right thing to do.