Service connectivity from robot agent #1891

AndrinGautschi · 2025-09-02T08:57:51Z

AndrinGautschi
Sep 2, 2025

Hi,

Networking is not exactly my strong suit and it seems that I just reached the end of my (limited) wisdom with the following issue:

I added a robot node, configured everything according to the docs and managed that the robot node is shown in the clusterand that pods get issued on it. So far so good.

If I ssh into the robot node, I can ping both all my nodes and pods. What I can't ping, and here my issue starts, is the service IP's (everything with 10.43.0.0/16). This leads to the longhorn pod issued by a deamonset to fail with:

...
E0902 08:36:10.772693       1 reflector.go:200] "Failed to watch" err="failed to list *v1.Pod: Get \"https://10.43.0.1:443/api/v1/pods?resourceVersion=4906402\": dial tcp 10.43.0.1:443: connect: no route to host" logger="UnhandledError" reflector="k8s.io/client-go/informers/factory.go:160" type="*v1.Pod"
E0902 08:36:10.772698       1 reflector.go:200] "Failed to watch" err="failed to list *v1beta2.SystemBackup: Get \"https://10.43.0.1:443/apis/longhorn.io/v1beta2/systembackups?resourceVersion=4906270\": dial tcp 10.43.0.1:443: connect: no route to host" logger="UnhandledError" reflector="pkg/client/informers/externalversions/factory.go:141" type="*v1beta2.SystemBackup"
...

Clearly, this is a networking issue. I use the default flannel as csi with enabled wireguard and have configured the flannel interface (flannel-iface: enp0s31f6) in my robot k3s-agent-config. Nonetheless, neither pinging 10.43.0.0/16 from the node, nor connections attempted from within pods are successful.

I'm currently testing with 'not sharing routes with vSwitch' in hetzner/console/network — unsuccessful I might add.

Did I miss something? Do I have to enable some additional routing? HCCM seems to be configured correctly (networking is enabled, clusterCIDR is set correctly)...

Any suggestion or idea would be greatly appreciated.

Answered by AndrinGautschi

Sep 4, 2025

It seems as if I managed to solve it. For future reference:

Do NOT patch the provided ccm, but instead install it preferably from the getgo with helm. My kube.tf:

  hetzner_ccm_use_helm = true
  hetzner_ccm_values = <<EOT
args:
  cloud-provider: hcloud
  webhook-secure-port: "0"

kind: Deployment

replicaCount: 1

env:
  HCLOUD_TOKEN:
    valueFrom:
      secretKeyRef:
        name: hcloud
        key: token
  ROBOT_USER:
    valueFrom:
      secretKeyRef:
        name: robot-secret
        key: robot-username
        optional: true
  ROBOT_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: robot-secret
        key: robot-password
        optional: true
  HCLOUD_LOAD_BALANCERS_E…

View full answer

AndrinGautschi · 2025-09-04T10:14:42Z

AndrinGautschi
Sep 4, 2025
Author

It seems as if I managed to solve it. For future reference:

Do NOT patch the provided ccm, but instead install it preferably from the getgo with helm. My kube.tf:

  hetzner_ccm_use_helm = true
  hetzner_ccm_values = <<EOT
args:
  cloud-provider: hcloud
  webhook-secure-port: "0"

kind: Deployment

replicaCount: 1

env:
  HCLOUD_TOKEN:
    valueFrom:
      secretKeyRef:
        name: hcloud
        key: token
  ROBOT_USER:
    valueFrom:
      secretKeyRef:
        name: robot-secret
        key: robot-username
        optional: true
  ROBOT_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: robot-secret
        key: robot-password
        optional: true
  HCLOUD_LOAD_BALANCERS_ENABLED:
    value: "true"
  HCLOUD_LOAD_BALANCERS_LOCATION:
    value: "nbg1" # your value
  HCLOUD_LOAD_BALANCERS_USE_PRIVATE_IP:
    value: "true"
  HCLOUD_LOAD_BALANCERS_DISABLE_PRIVATE_INGRESS:
    value: "true"
  HCLOUD_NETWORK_ROUTES_ENABLED:
    value: "false" # this is not needed anyway if you just use flannel an not any other fancy networking layer like cilium

image:
  repository: docker.io/hetznercloud/hcloud-cloud-controller-manager

monitoring:
  enabled: true
  podMonitor:
    enabled: false

networking:
  enabled: true
  clusterCIDR: 10.42.0.0/16 # this is the default value set by k3s and kube-hetzner uses it as well
  network:
    valueFrom:
      secretKeyRef:
        name: hcloud
        key: network

resources:
  requests:
    cpu: 100m
    memory: 50Mi

additionalTolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

priorityClassName: "system-cluster-critical"

robot:
  enabled: false # if you enable this on cluster creation, you have to make sure the secret referenced in the env vars is available, otherwise deploy will run into a timeout... in my cluster I redeploy the helm chart with robot enabled and the secret available

rbac:
  create: true

  EOT

If you made sure your ccm is deployed correctly, make sure your network interface on the robot are not cluttered by previous attempts and your k3s-agent is properly configured (you should use node-ip: <your-internal-node-ip>). Also, I attached my flannel not to the main interface, but instead the created vlan4000 (the one for vSwitch), then added iptable rules to allow anything from the vlan4000 interface but strictly block most packets from the main interface. This reproduces the standard firewall rules set for all other cloud nodes and enables seamless communication (and also iptable lookups) within the cluster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Service connectivity from robot agent #1891

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Service connectivity from robot agent #1891

Uh oh!

AndrinGautschi Sep 2, 2025

Replies: 1 comment

Uh oh!

Uh oh!

AndrinGautschi Sep 4, 2025 Author

AndrinGautschi
Sep 2, 2025

AndrinGautschi
Sep 4, 2025
Author