Skip to content

azure machine-controller webhook timeout #1857

@dharapvj

Description

@dharapvj

Lately, we see continuous failures to rollout new MD in Azure environments.

The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.

Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)

Here are logs from KKP user-cluster based MD

failed to create machine deployment: Internal error occurred: failed calling webhook "machine-controller.kubermatic.io-machinedeployments": failed to call webhook: Post "https://machine-controller-webhook.cluster-XXXXX.svc.cluster.local./machinedeployments?timeout=10s": context deadline exceeded
{
  "error": {
    "code": 500,
    "message": "failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'"
  }
}

I have seen that if I increase wehbook timeout to 30s situation improves a bit.

But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions