Some volumes fail to mount again after cluster update #861

thobens · 2023-06-27T09:09:00Z

thobens
Jun 27, 2023

Description

After the automatic cluster update, there seems to be an issue with mounting volumes again. Seemingly random, some pods can't mount their volumes. In the Hetzner Cloud Console I can see that the respective volumes are attached to the nodes.

kubectl -n my-namespace get events
LAST SEEN   TYPE      REASON        OBJECT     MESSAGE
3m25s       Warning   FailedMount   pod/db-0   MountVolume.SetUp failed for volume "pvc-438cfa7b-d698-46e9-8ce3-828c2915d07b" : rpc error: code = InvalidArgument desc = missing device path
23m         Warning   FailedMount   pod/db-0   Unable to attach or mount volumes: unmounted volumes=[db-claim0], unattached volumes=[kube-api-access-dp87p db-claim0]: timed out waiting for the condition
11m         Warning   FailedMount   pod/db-0   Unable to attach or mount volumes: unmounted volumes=[db-claim0], unattached volumes=[db-claim0 kube-api-access-dp87p]: timed out waiting for the condition

So I found this issue: hetznercloud/csi-driver#278 and the solution was apparently to delete dangling volumeattachments and then recreate the pods using them:

kubectl get volumeattachments |grep pvc-438cfa7b-d698-46e9-8ce3-828c2915d07b

outputs something like:

csi-b6c4d43c15054265d620d98d8c3757c213703cf8833a3fed4deaed06e50f163e   csi.hetzner.cloud   pvc-438cfa7b-d698-46e9-8ce3-828c2915d07b   my-cluster-agent-large-fsn1-our   true       14h

The first column of this output is the volumeattachment's ID, which we can use to delete it:

kubectl delete volumeattachments.storage.k8s.io csi-b6c4d43c15054265d620d98d8c3757c213703cf8833a3fed4deaed06e50f163e

After that I deleted the pod that used this pvc. It got recreated and everything worked fine again.

According to the aforementioned GitHub issue, root cause seems to be a bug in the Hetzner CSI driver.

EDIT:

Checking the CSI driver version:

kubectl describe -n kube-system pod hcloud-csi-controller-0
...
  hcloud-csi-driver:
    Container ID:   containerd://4b9f44607d50017bd841745638cba6357f0ac75c30f9ef4813a12f83fa9e5105
    Image:          hetznercloud/hcloud-csi-driver:1.6.0
    Image ID:       docker.io/hetznercloud/hcloud-csi-driver@sha256:1475d525f9a4039ae8f1d81666a0fc912d92f34415f6c53723656dff0ee16bd1
...

So it seems to be 1.6.0. According to the mentioned issue, v2.1.1 should fix the issue.

Kube.tf file

locals {
  hcloud_token = "xxxxxxxxxxx"

  ssh_port = 22
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"


  ssh_public_key = file("~/.ssh/id_ed25519.pub")
  ssh_private_key = file("~/.ssh/id_ed25519")
  network_region = "eu-central" 


  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1

    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1

    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1

    }
  ]

  agent_nodepools = [
    {
      name        = "agent-large-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1

    },
    {
      name        = "agent-large-nbg1",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1

    },
    {
      name        = "agent-large-hel1",
      server_type = "cpx21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1

    },
  ]
  enable_wireguard = true

  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  ingress_controller = "nginx"

  kured_options = {
    "reboot-days": "su"
    "start-time": "3am"
    "end-time": "8am"
    "time-zone": "Local"
  }

  initial_k3s_channel = "v1.25"

  cluster_name = "my-cluster"

  extra_firewall_rules = [
    {
      description = "Allow inbound traffic to the Kube API server from our office ip"
      direction       = "in"
      protocol        = "tcp"
      port            = "6443"
      source_ips      = ["x.x.x.x/32"]
      destination_ips = [] # Won't be used for this rule
    },
    {
      description = "Allow inbound SSH traffic from our office ip"
      direction       = "in"
      protocol        = "tcp"
      port            = local.ssh_port
      source_ips      = ["x.x.x.x/32"]
      destination_ips = [] # Won't be used for this rule
    },
    {
      description = "Allow outbound SSH traffic"
      direction       = "out"
      protocol        = "tcp"
      port            = "22"
      source_ips      = [] # Won't be used for this rule
      destination_ips = ["0.0.0.0/0", "::/0"] 
    },
  ]


  additional_tls_sans = ["my.cluster.com"]

  lb_hostname = "lb.cluster.com"

  enable_rancher = true

  create_kubeconfig = false


}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.3.3"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.38.2"
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

Client: Arch Linux

Answered by thobens

Jul 3, 2023

cat'ing the file at /var/post-install/hcloud-csi.yaml outputs 404: Not Found. Which suggests that the path to the hcloud-csi.yaml file is not correct. Looking at the code in init.tf it becomes clear that the value 2.1.1 should be v2.1.1. Deploying this works fine. So thanks for your comment, it helped me to understand a bit more about the debugging process :)

View full answer

thobens · 2023-06-27T15:07:01Z

thobens
Jun 27, 2023
Author

The question arises, can I just set hcloud_csi_version to 2.1.1 or higher and then terraform apply?

0 replies

mysticaltech · 2023-06-27T17:11:24Z

mysticaltech
Jun 27, 2023
Maintainer

@thobens Yes, that should work!

0 replies

thobens · 2023-06-28T08:43:28Z

thobens
Jun 28, 2023
Author

I've tried to update the CSI driver, but after running terraform apply I get the following error:

...
module.kube-hetzner.null_resource.kustomization (remote-exec): error: accumulating resources: accumulation err='accumulating resources from 'hcloud-csi.yml': missing Resource metadata': must build at directory: '/var/post_install/hcloud-csi.yml': file is not directory
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 285, in resource "null_resource" "kustomization":
│  285:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_1494727049.sh": Process exited with status 1

The only change I made to kube.tf was uncommenting hetzner_csi_version and set it to 2.1.1.

After that I upgraded the kube-hetzner module from 2.2.0 to 2.2.4 and tried to apply again, but still no luck.

Thank you for your time and let me know if you need more info.

3 replies

mysticaltech Jun 30, 2023
Maintainer

@thobens Please ssh into the node (see the debug section in the readme), and cat the value of the terraform file. That will enable us to understand what is happening.

mysticaltech Jun 30, 2023
Maintainer

Also, if the fix is obvious, a PR is always welcome!

thobens Jul 3, 2023
Author

cat'ing the file at /var/post-install/hcloud-csi.yaml outputs 404: Not Found. Which suggests that the path to the hcloud-csi.yaml file is not correct. Looking at the code in init.tf it becomes clear that the value 2.1.1 should be v2.1.1. Deploying this works fine. So thanks for your comment, it helped me to understand a bit more about the debugging process :)

Answer selected by thobens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some volumes fail to mount again after cluster update #861

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Some volumes fail to mount again after cluster update #861

Uh oh!

Uh oh!

thobens Jun 27, 2023

Description

Kube.tf file

Screenshots

Platform

Replies: 3 comments · 3 replies

Uh oh!

thobens Jun 27, 2023 Author

Uh oh!

mysticaltech Jun 27, 2023 Maintainer

Uh oh!

thobens Jun 28, 2023 Author

Uh oh!

mysticaltech Jun 30, 2023 Maintainer

Uh oh!

mysticaltech Jun 30, 2023 Maintainer

Uh oh!

thobens Jul 3, 2023 Author

thobens
Jun 27, 2023

Replies: 3 comments 3 replies

thobens
Jun 27, 2023
Author

mysticaltech
Jun 27, 2023
Maintainer

thobens
Jun 28, 2023
Author

mysticaltech Jun 30, 2023
Maintainer

mysticaltech Jun 30, 2023
Maintainer

thobens Jul 3, 2023
Author