Skip to content

Add retry to net.LookupHost() in nodeadm#2585

Merged
mselim00 merged 6 commits intoawslabs:mainfrom
nkvetsinski:main
Jan 19, 2026
Merged

Add retry to net.LookupHost() in nodeadm#2585
mselim00 merged 6 commits intoawslabs:mainfrom
nkvetsinski:main

Conversation

@nkvetsinski
Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:

Our team runs continuous integration tests on Outpost nodes. We noticed that from time to time, nodes fail nodeadm-config service with the following error:

Jan 13 10:16:50 localhost nodeadm[1463]: fatal cli/main.go:35 Command failed {"error": "lookup REDACTED-URL..amazonaws.com on 10.0.0.2:53: dial udp 10.0.0.2:53: connect: network is unreachable"}

Upon examining the code, I found out that the error comes from net.LookupHost(). During node bootstrap, there is nondeterministic amount of time that takes the networking to become ready between nodeadm-boot-hook.service reconfigures it and nodeadm-run.service calling net.LookupHost().

I examined log timings for nodes that failed and succeeded and in both cases nodeadm-config.service starts after nodeadm-boot-hook.service. However in some cases the bootstrap fails because network setup takes a bit longer:

# successful run
Jan 13 18:06:11.980960 localhost systemd[1]: Starting nodeadm-boot-hook.service - EKS Nodeadm Boot Hook...
Jan 13 18:06:13.266558 localhost systemd[1]: nodeadm-boot-hook.service: Deactivated successfully.
Jan 13 18:06:13.299117 localhost systemd[1]: Finished nodeadm-boot-hook.service - EKS Nodeadm Boot Hook.

Jan 13 18:06:13.421353 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Jan 13 18:06:14.205935 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Jan 13 18:06:14.249065 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.

Jan 13 18:06:17.249319 ip-10-0-104-24.us-west-2.compute.internal systemd[1]: Starting nodeadm-run.service - EKS Nodeadm Run...
Jan 13 18:06:18.291598 ip-10-0-104-24.us-west-2.compute.internal systemd[1]: nodeadm-run.service: Deactivated successfully.
Jan 13 18:06:18.339084 ip-10-0-104-24.us-west-2.compute.internal systemd[1]: Finished nodeadm-run.service - EKS Nodeadm Run.

# failed run
Jan 13 01:06:50.555727 localhost systemd[1]: Starting nodeadm-boot-hook.service - EKS Nodeadm Boot Hook...
Jan 13 01:06:50.683977 localhost systemd[1]: nodeadm-boot-hook.service: Deactivated successfully.
Jan 13 01:06:50.724072 localhost systemd[1]: Finished nodeadm-boot-hook.service - EKS Nodeadm Boot Hook.

Jan 13 01:06:50.934320 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Jan 13 01:06:51.634914 localhost systemd[1]: nodeadm-config.service: Main process exited, code=exited, status=1/FAILURE
Jan 13 01:06:51.635117 localhost systemd[1]: nodeadm-config.service: Failed with result 'exit-code'.
Jan 13 01:06:51.724136 localhost systemd[1]: Failed to start nodeadm-config.service - EKS Nodeadm Config.

Hence I decided to introduce some retries in nodeadm.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

I manually built the AMI and was able to join node to my cluster:

○ nodeadm-config.service - EKS Nodeadm Config
     Loaded: loaded (/etc/systemd/system/nodeadm-config.service; enabled; preset: disabled)
     Active: inactive (dead) since Tue 2026-01-13 23:33:47 UTC; 41s ago
       Docs: https://github.com/awslabs/amazon-eks-ami
    Process: 1582 ExecStart=/usr/bin/nodeadm init --skip run --config-source imds://user-data --config-cache /run/eks/nodeadm/config.json (code=exited, status=0/SUCCESS)
   Main PID: 1582 (code=exited, status=0/SUCCESS)
        CPU: 69ms

Jan 13 23:33:47 localhost nodeadm[1582]: info containerd/config.go:77 Writing containerd config to file.. {"path": "/etc/containerd/config.toml"}
Jan 13 23:33:47 localhost nodeadm[1582]: info init/init.go:219 Configured daemon {"name": "containerd"}
Jan 13 23:33:47 localhost nodeadm[1582]: info init/init.go:215 Configuring daemon... {"name": "kubelet"}
Jan 13 23:33:47 localhost nodeadm[1582]: info kubelet/config.go:171 Setting up outpost..
Jan 13 23:33:47 localhost nodeadm[1582]: info kubelet/config.go:221 Setup IP for node {"ip": "10.1.1.153"}
Jan 13 23:33:47 localhost nodeadm[1582]: info kubelet/config.go:374 Writing kubelet config to file.. {"path": "/etc/kubernetes/kubelet/config.json"}
Jan 13 23:33:47 localhost nodeadm[1582]: info init/init.go:219 Configured daemon {"name": "kubelet"}
Jan 13 23:33:47 localhost nodeadm[1582]: info init/init.go:154 done! {"duration": 0.410509977}
Jan 13 23:33:47 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Jan 13 23:33:47 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.

I also tested the retrying logic, by passing a broken API URL:

× nodeadm-config.service - EKS Nodeadm Config
     Loaded: loaded (/etc/systemd/system/nodeadm-config.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Tue 2026-01-13 23:09:07.140011 UTC; 2min 48s ago
       Docs: https://github.com/awslabs/amazon-eks-ami
   Main PID: 1558 (code=exited, status=1/FAILURE)
        CPU: 67ms

Jan 13 23:09:03 localhost nodeadm[1558]: info kubelet/config.go:171 Setting up outpost..
Jan 13 23:09:03 localhost nodeadm[1558]: info kubelet/config.go:191 Retrying DNS lookup after error {"error": "lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host"}
Jan 13 23:09:04 localhost nodeadm[1558]: info kubelet/config.go:191 Retrying DNS lookup after error {"error": "lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host"}
Jan 13 23:09:04 localhost nodeadm[1558]: info kubelet/config.go:191 Retrying DNS lookup after error {"error": "lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host"}
Jan 13 23:09:05 localhost nodeadm[1558]: info kubelet/config.go:191 Retrying DNS lookup after error {"error": "lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host"}
Jan 13 23:09:07 localhost nodeadm[1558]: info kubelet/config.go:191 Retrying DNS lookup after error {"error": "lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host"}
Jan 13 23:09:07 localhost nodeadm[1558]: fatal cli/main.go:35 Command failed {"error": "All attempts fail:\n#1: lookup broken-url.b5005t.rv3.us-west-2.eks.amazonaws.com on 10.1.0.2:53: no such host\n#2: lookup broken-url.b5>
Jan 13 23:09:07 localhost systemd[1]: nodeadm-config.service: Main process exited, code=exited, status=1/FAILURE

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@mselim00 mselim00 merged commit 2076e98 into awslabs:main Jan 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants