-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
Lately, we see continuous failures to rollout new MD in Azure environments.
The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.
Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)
Here are logs from KKP user-cluster based MD
failed to create machine deployment: Internal error occurred: failed calling webhook "machine-controller.kubermatic.io-machinedeployments": failed to call webhook: Post "https://machine-controller-webhook.cluster-XXXXX.svc.cluster.local./machinedeployments?timeout=10s": context deadline exceeded
{
"error": {
"code": 500,
"message": "failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'"
}
}
I have seen that if I increase wehbook timeout to 30s situation improves a bit.
But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.