Skip to content

AKS With Linkerd CNI Fails "Unauthorized" Whenever Daemonset Is Restarted #14643

@ark-bensoer

Description

@ark-bensoer

What is the issue?

Hello, I have consistently been running into issues with the Linkerd CNI where restarting the pods causes it to fail to startup. The only way to resolve it has been to cycle all of the nodes. The issue only appears when the Linkerd CNI pod is restarted on the same node . This has become a problem during upgrades as they fail part way through due to these pods getting stuck and unable to boot.

Restarting the pod does not resolve the issue. The only workflow that I have had work so far is to drain each node and then kill the node in Azure and have Azure create a new one. A fresh install of Linkerd CNI on a new node is successful

The near same error was described also here: #11478

My issue includes though that I am running Linkerd CNI in privileged: true (as an early attempt to get around the error) and I also have cilium configured with cni.exclusive: false , but yet am still having the same issues

How can it be reproduced?

The problem is also it is not completely consistent. Some AKS clusters are fine, and others are not. There is no consistency either in clusters - just some versions in some clusters cause issues, and other versions in others

The basic setup though is with Terraform:

  1. An AKS Cluster
  2. Cilium setup via Helm and setup as BYOCNI for AKS
resource "helm_release" "cilium" {
  name       = "cilium"
  repository = "https://helm.cilium.io/"
  chart      = "cilium"
  version    = var.cilium_version
  namespace  = "kube-system"

  set {
    name  = "aksbyocni.enabled"
    value = true
  }

  set {
    name  = "nodeinit.enabled"
    value = true
  }

  set {
    name  = "hubble.relay.enabled"
    value = true
  }

  set {
    name  = "hubble.ui.enabled"
    value = true
  }

  set {
    name  = "cni.exclusive"
    value = false
  }

  set {
    name  = "upgradeCompatibility"
    value = var.upgrade_compatible_version
  }
}
  1. Linkerd, CRD and CNI setup via Helm
resource "kubernetes_namespace" "linkerd_namespace" {
  metadata {
    name = "linkerd"
    annotations = {
      "linkerd.io/inject" = "disabled"
    }
    labels = {
        "linkerd.io/is-control-plane" = "true"
        "config.linkerd.io/admission-webhooks" = "disabled"
        "linkerd.io/control-plane-ns" = "linkerd"
        "pod-security.kubernetes.io/enforce" = "restricted"
    }
  }
}

resource "kubernetes_namespace" "linkerd_cni_namespace" {
  metadata {
    name = "linkerd-cni"
    labels = {
      "config.linkerd.io/admission-webhooks" = "disabled"
      "linkerd.io/cni-resource"              = "true",
      "pod-security.kubernetes.io/enforce" = "privileged"
    }
  }
}

resource "helm_release" "linkerd_cni" {
  depends_on = [kubernetes_namespace.linkerd_namespace, helm_release.linkerd_crds]
  name       = "linkerd-cni"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd2-cni"
  namespace  = "linkerd-cni"
  version    = var.linkerd_version
  wait       = true

  set {
    name = "privileged"
    value = true
  }
}

resource "helm_release" "linkerd_crds" {
  depends_on = [kubernetes_namespace.linkerd_namespace]
  name       = "linkerd-crds"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-crds"
  namespace  = "linkerd"
  version    = var.linkerd_version
  wait       = true

  set {
    name  = "cniEnabled"
    value = true
  }

  set {
    name = "installGatewayAPI"
    value = false
  }
}

resource "helm_release" "linkerd" { 
  name       = "linkerd-control-plane"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-control-plane"
  namespace  = "linkerd"
  version    = var.linkerd_version
  wait       = true

  values = [
    "${file("${path.module}/values-ha.yaml")}"
  ]

  set {
    name  = "controllerReplicas"
    value = var.replicas
  }

  set {
    name = "controller.podDisruptionBudget.maxUnavailable"
    value = (var.replicas - 1)
  }

  set {
    name  = "cniEnabled"
    value = true
  }

  set {
    name  = "identity.externalCA"
    value = true
  }

  set {
    name  = "identity.issuer.scheme"
    value = "kubernetes.io/tls"
  }

  set {
    name  = "disableHeartBeat"
    value = true
  }

  set {
    name  = "webhookFailurePolicy"
    value = "Fail"
  }

  set {
    name  = "networkValidator.connectAddr"
    value = "0.0.0.0:4140"
  }

  depends_on = [
    helm_release.linkerd_crds,
    kubectl_manifest.linkerd_identity_trust_roots
  ]
}

  1. Then simply try restarting the cni daemon
kubectl rollout restart daemonset linkerd-cni -n linkerd-cni

The issue has been reproduced on the following version: 2024.11.8 , 2025.1.1 , 2025.7.6 . 2024.11.8 most frequently could do it with a restart. 2025.1.1 and 2025.7.6 occurred while trying to upgrade to these version which caused the LinkerdCNI to cycle.

2025.7.6 does not fail on restart in some of our clusters, but it reliably fails in others.

Logs, error output, etc

There are no Linkerd CNI logs, as the pod never gets far enough to boot. There are only the following event logs when calling describe on the pod

  Warning  FailedCreatePodSandBox  35s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0198e98f6b2399f9f9d1f67bf9f92a9515f57cef85a69271a2a6a5d786ef7430": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

Digging into the kubelet logs, shows a similar story

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: I1021 21:34:27.589370    3962 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="linkerd-cni/linkerd-cni-k6mfr"

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: I1021 21:34:27.589767    3962 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="linkerd-cni/linkerd-cni-k6mfr"

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951174    3962 log.go:32] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized"

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951265    3962 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized" pod="linkerd-cni/linkerd-cni-k6mfr"

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951307    3962 kuberuntime_manager.go:1170] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized" pod="linkerd-cni/linkerd-cni-k6mfr"

Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951366    3962 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"linkerd-cni-k6mfr_linkerd-cni(ebdb75aa-dae1-44e6-ab99-c5ccdbfac270)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"linkerd-cni-k6mfr_linkerd-cni(ebdb75aa-dae1-44e6-ab99-c5ccdbfac270)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\\\": plugin type=\\\"linkerd-cni\\\" name=\\\"linkerd-cni\\\" failed (add): Unauthorized\"" pod="linkerd-cni/linkerd-cni-k6mfr" podUID="ebdb75aa-dae1-44e6-ab99-c5ccdbfac270"

output of linkerd check -o short

bensoer@BSOER-MBP ~ % linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.7.6 but the latest edge version is 25.10.5
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.7.6 but the latest edge version is 25.10.5
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-6d96f76896-kbvpp (edge-25.7.6)
	* linkerd-destination-6d96f76896-pqcs8 (edge-25.7.6)
	* linkerd-destination-6d96f76896-wrv8t (edge-25.7.6)
	* linkerd-identity-bc5659f75-78zwx (edge-25.7.6)
	* linkerd-identity-bc5659f75-jjsrq (edge-25.7.6)
	* linkerd-identity-bc5659f75-zdfwt (edge-25.7.6)
	* linkerd-proxy-injector-9dc897ccb-b9mpv (edge-25.7.6)
	* linkerd-proxy-injector-9dc897ccb-p4k5r (edge-25.7.6)
	* linkerd-proxy-injector-9dc897ccb-vc7lf (edge-25.7.6)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

Status check results are √

Environment

  • Kubernetes Version: v1.31.2
  • Cluster Environment: AKS
  • LinkerD Version:
bensoer@BSOER-MBP ~ % linkerd version
Client version: edge-25.7.6
Server version: edge-25.7.6

Possible solution

Im out of ideas unfortunatly. The following attempts have been made as a solution, but did not change anything:

  • Set Cilium cni.exclusive to false
  • Set LinkerdCNI privileged to true
  • Upgrade to latest stable versions (25.7.6 was "recommended" from the release notes when it was applied on our other clusters, though I notice its now marked as "not recommended")

Additional context

If there is something in my configuration you believe I am doing wrong, I am happy to give it also a try. Though I feel like I have exhausted most, if not all, options.

Would you like to work on fixing this bug?

None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions