-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
Hello, I have consistently been running into issues with the Linkerd CNI where restarting the pods causes it to fail to startup. The only way to resolve it has been to cycle all of the nodes. The issue only appears when the Linkerd CNI pod is restarted on the same node . This has become a problem during upgrades as they fail part way through due to these pods getting stuck and unable to boot.
Restarting the pod does not resolve the issue. The only workflow that I have had work so far is to drain each node and then kill the node in Azure and have Azure create a new one. A fresh install of Linkerd CNI on a new node is successful
The near same error was described also here: #11478
My issue includes though that I am running Linkerd CNI in privileged: true (as an early attempt to get around the error) and I also have cilium configured with cni.exclusive: false , but yet am still having the same issues
How can it be reproduced?
The problem is also it is not completely consistent. Some AKS clusters are fine, and others are not. There is no consistency either in clusters - just some versions in some clusters cause issues, and other versions in others
The basic setup though is with Terraform:
- An AKS Cluster
- Cilium setup via Helm and setup as BYOCNI for AKS
resource "helm_release" "cilium" {
name = "cilium"
repository = "https://helm.cilium.io/"
chart = "cilium"
version = var.cilium_version
namespace = "kube-system"
set {
name = "aksbyocni.enabled"
value = true
}
set {
name = "nodeinit.enabled"
value = true
}
set {
name = "hubble.relay.enabled"
value = true
}
set {
name = "hubble.ui.enabled"
value = true
}
set {
name = "cni.exclusive"
value = false
}
set {
name = "upgradeCompatibility"
value = var.upgrade_compatible_version
}
}
- Linkerd, CRD and CNI setup via Helm
resource "kubernetes_namespace" "linkerd_namespace" {
metadata {
name = "linkerd"
annotations = {
"linkerd.io/inject" = "disabled"
}
labels = {
"linkerd.io/is-control-plane" = "true"
"config.linkerd.io/admission-webhooks" = "disabled"
"linkerd.io/control-plane-ns" = "linkerd"
"pod-security.kubernetes.io/enforce" = "restricted"
}
}
}
resource "kubernetes_namespace" "linkerd_cni_namespace" {
metadata {
name = "linkerd-cni"
labels = {
"config.linkerd.io/admission-webhooks" = "disabled"
"linkerd.io/cni-resource" = "true",
"pod-security.kubernetes.io/enforce" = "privileged"
}
}
}
resource "helm_release" "linkerd_cni" {
depends_on = [kubernetes_namespace.linkerd_namespace, helm_release.linkerd_crds]
name = "linkerd-cni"
repository = "https://helm.linkerd.io/edge"
chart = "linkerd2-cni"
namespace = "linkerd-cni"
version = var.linkerd_version
wait = true
set {
name = "privileged"
value = true
}
}
resource "helm_release" "linkerd_crds" {
depends_on = [kubernetes_namespace.linkerd_namespace]
name = "linkerd-crds"
repository = "https://helm.linkerd.io/edge"
chart = "linkerd-crds"
namespace = "linkerd"
version = var.linkerd_version
wait = true
set {
name = "cniEnabled"
value = true
}
set {
name = "installGatewayAPI"
value = false
}
}
resource "helm_release" "linkerd" {
name = "linkerd-control-plane"
repository = "https://helm.linkerd.io/edge"
chart = "linkerd-control-plane"
namespace = "linkerd"
version = var.linkerd_version
wait = true
values = [
"${file("${path.module}/values-ha.yaml")}"
]
set {
name = "controllerReplicas"
value = var.replicas
}
set {
name = "controller.podDisruptionBudget.maxUnavailable"
value = (var.replicas - 1)
}
set {
name = "cniEnabled"
value = true
}
set {
name = "identity.externalCA"
value = true
}
set {
name = "identity.issuer.scheme"
value = "kubernetes.io/tls"
}
set {
name = "disableHeartBeat"
value = true
}
set {
name = "webhookFailurePolicy"
value = "Fail"
}
set {
name = "networkValidator.connectAddr"
value = "0.0.0.0:4140"
}
depends_on = [
helm_release.linkerd_crds,
kubectl_manifest.linkerd_identity_trust_roots
]
}
- Then simply try restarting the cni daemon
kubectl rollout restart daemonset linkerd-cni -n linkerd-cni
The issue has been reproduced on the following version: 2024.11.8 , 2025.1.1 , 2025.7.6 . 2024.11.8 most frequently could do it with a restart. 2025.1.1 and 2025.7.6 occurred while trying to upgrade to these version which caused the LinkerdCNI to cycle.
2025.7.6 does not fail on restart in some of our clusters, but it reliably fails in others.
Logs, error output, etc
There are no Linkerd CNI logs, as the pod never gets far enough to boot. There are only the following event logs when calling describe on the pod
Warning FailedCreatePodSandBox 35s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0198e98f6b2399f9f9d1f67bf9f92a9515f57cef85a69271a2a6a5d786ef7430": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
Digging into the kubelet logs, shows a similar story
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: I1021 21:34:27.589370 3962 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="linkerd-cni/linkerd-cni-k6mfr"
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: I1021 21:34:27.589767 3962 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="linkerd-cni/linkerd-cni-k6mfr"
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951174 3962 log.go:32] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized"
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951265 3962 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized" pod="linkerd-cni/linkerd-cni-k6mfr"
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951307 3962 kuberuntime_manager.go:1170] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\": plugin type=\"linkerd-cni\" name=\"linkerd-cni\" failed (add): Unauthorized" pod="linkerd-cni/linkerd-cni-k6mfr"
Oct 21 21:34:27 aks-monitor-22419019-vmss000009 kubelet[3962]: E1021 21:34:27.951366 3962 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"linkerd-cni-k6mfr_linkerd-cni(ebdb75aa-dae1-44e6-ab99-c5ccdbfac270)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"linkerd-cni-k6mfr_linkerd-cni(ebdb75aa-dae1-44e6-ab99-c5ccdbfac270)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"174cc5753bce5e78a29e5f6810b82c94e42f9491d8e0d2ca5fd12ce14e24b7da\\\": plugin type=\\\"linkerd-cni\\\" name=\\\"linkerd-cni\\\" failed (add): Unauthorized\"" pod="linkerd-cni/linkerd-cni-k6mfr" podUID="ebdb75aa-dae1-44e6-ab99-c5ccdbfac270"
output of linkerd check -o short
bensoer@BSOER-MBP ~ % linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.7.6 but the latest edge version is 25.10.5
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.7.6 but the latest edge version is 25.10.5
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-6d96f76896-kbvpp (edge-25.7.6)
* linkerd-destination-6d96f76896-pqcs8 (edge-25.7.6)
* linkerd-destination-6d96f76896-wrv8t (edge-25.7.6)
* linkerd-identity-bc5659f75-78zwx (edge-25.7.6)
* linkerd-identity-bc5659f75-jjsrq (edge-25.7.6)
* linkerd-identity-bc5659f75-zdfwt (edge-25.7.6)
* linkerd-proxy-injector-9dc897ccb-b9mpv (edge-25.7.6)
* linkerd-proxy-injector-9dc897ccb-p4k5r (edge-25.7.6)
* linkerd-proxy-injector-9dc897ccb-vc7lf (edge-25.7.6)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
Status check results are √
Environment
- Kubernetes Version: v1.31.2
- Cluster Environment: AKS
- LinkerD Version:
bensoer@BSOER-MBP ~ % linkerd version
Client version: edge-25.7.6
Server version: edge-25.7.6
Possible solution
Im out of ideas unfortunatly. The following attempts have been made as a solution, but did not change anything:
- Set Cilium
cni.exclusivetofalse - Set LinkerdCNI
privilegedtotrue - Upgrade to latest stable versions (25.7.6 was "recommended" from the release notes when it was applied on our other clusters, though I notice its now marked as "not recommended")
Additional context
If there is something in my configuration you believe I am doing wrong, I am happy to give it also a try. Though I feel like I have exhausted most, if not all, options.
Would you like to work on fixing this bug?
None