Skip to content

Commit f264c8a

Browse files
authored
Add kubectl debug node and more troubleshooting for Auto Mode (#850)
* Add `kubectl debug node` and more troubleshooting for Auto Mode * Fix spelling, formatting in Auto Mode troubleshooting. Add 'reachability' to custom dictionary. * Add pricing note to VPC Reachability Analyzer * Add manual ToCs and titles to Auto Mode troubleshooting * fix unnecessary pluses in auto-troubleshoot
1 parent ae691bf commit f264c8a

File tree

2 files changed

+175
-18
lines changed

2 files changed

+175
-18
lines changed

latest/ug/automode/auto-troubleshoot.adoc

Lines changed: 172 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,24 +10,35 @@ With {eam}, {aws} assumes more {resp} for {e2i}s in {yaa}. EKS assumes {resp} fo
1010

1111
You must use {aws} and {k8s} APIs to troubleshoot nodes. You can:
1212

13-
* Use a Kubernetes `NodeDiagnostic` resource to {ret} node logs.
14-
* Use the {aws} EC2 CLI command `get-console-output` to {ret} console output from nodes.
13+
* Use a Kubernetes `NodeDiagnostic` resource to {ret} node logs by using the <<auto-node-monitoring-agent>>. For more steps, see <<auto-get-logs>>.
14+
* Use the {aws} EC2 CLI command `get-console-output` to {ret} console output from nodes. For more steps, see <<auto-node-console>>.
15+
* Use Kubernetes _debugging containers_ to {ret} node logs. For more steps, see <<auto-node-debug-logs>>.
1516
1617
[NOTE]
1718
====
1819
{eam} uses {emi}s. You cannot directly access {emi}s, including by SSH.
1920
====
2021

21-
If you have a problem with a controller, you should research:
22+
You might have the following problems that have solutions specific to EKS Auto Mode components:
2223

23-
* If the resources associated with that controller are properly formatted and valid.
24-
* If the {aws} IAM and Kubernetes RBAC resources are properly configured for your cluster. For more information, see <<auto-learn-iam>>.
24+
* Pods stuck in the `Pending` state, that aren't being scheduled onto Auto Mode nodes. For solutions see <<auto-troubleshoot-schedule>>.
25+
* EC2 managed instances that don't join the cluster as Kubernetes nodes. For solutions see <<auto-troubleshoot-join>>.
26+
* Errors and issues with the `NodePools`, `PersistentVolumes`, and `Services` that use the controllers that are included in EKS Auto Mode. For solutions see <<auto-troubleshoot-controllers>>.
27+
28+
You can use the following methods to troubleshoot EKS Auto Mode components:
29+
30+
* <<auto-node-console>>
31+
* <<auto-node-debug-logs>>
32+
* <<auto-node-ec2-web>>
33+
* <<auto-node-iam>>
34+
* <<auto-node-reachability>>
2535
2636
[[auto-node-monitoring-agent,auto-node-monitoring-agent.title]]
2737
== Node monitoring agent
2838

2939
{eam} includes the Amazon EKS node monitoring agent. You can use this agent to view troubleshooting and debugging information about nodes. The node monitoring agent publishes Kubernetes `events` and node `conditions`. For more information, see <<node-health>>.
3040

41+
[[auto-node-console,auto-node-console.title]]
3142
== Get console output from an {emi} by using the {aws} EC2 CLI
3243

3344
This procedure helps with troubleshooting boot-time or kernel-level issues.
@@ -54,10 +65,61 @@ kubectl get pod <pod-name> -o wide
5465
aws ec2 get-console-output --instance-id <instance id> --latest --output text
5566
----
5667

57-
== Get node logs by using the kubectl CLI
68+
[[auto-node-debug-logs,auto-node-debug-logs.title]]
69+
== Get node logs by using __debug containers__ and the `kubectl` CLI
70+
71+
The recommended way of retrieving logs from an EKS Auto Mode node is to use `NodeDiagnostic` resource. For these steps, see <<auto-get-logs>>.
72+
73+
However, you can stream logs live from an instance by using the `kubectl debug node` command. This command launches a new Pod on the node that you want to debug which you can then interactively use.
74+
75+
. Launch a debug container. The following command uses `i-01234567890123456` for the instance ID of the node, `-it` allocates a `tty` and attach `stdin` for interactive usage, and uses the `sysadmin` profile from the kubeconfig file.
76+
+
77+
[source,cli]
78+
----
79+
kubectl debug node/i-01234567890123456 -it --profile=sysadmin --image=public.ecr.aws/amazonlinux/amazonlinux:2023
80+
----
81+
+
82+
An example output is as follows.
83+
+
84+
[source,none]
85+
----
86+
Creating debugging pod node-debugger-i-01234567890123456-nxb9c with container debugger on node i-01234567890123456.
87+
If you don't see a command prompt, try pressing enter.
88+
bash-5.2#
89+
----
5890

59-
For information about getting node logs, see <<auto-get-logs>>.
91+
. From the shell, you can now install `util-linux-core` which provides the `nsenter` command. Use `nsenter` to enter the mount namespace of PID 1 (`init`) on the host, and run the `journalctl` command to stream logs from the `kubelet`:
92+
+
93+
[source,bash]
94+
----
95+
yum install -y util-linux-core
96+
nsenter -t 1 -m journalctl -f -u kubelet
97+
----
6098

99+
For security, the Amazon Linux container image doesn't install many binaries by default. You can use the `yum whatprovides` command to identify the package that must be installed to provide a given binary.
100+
101+
[source,cli]
102+
----
103+
yum whatprovides ps
104+
----
105+
106+
[source,none]
107+
----
108+
Last metadata expiration check: 0:03:36 ago on Thu Jan 16 14:49:17 2025.
109+
procps-ng-3.3.17-1.amzn2023.0.2.x86_64 : System and process monitoring utilities
110+
Repo : @System
111+
Matched from:
112+
Filename : /usr/bin/ps
113+
Provide : /bin/ps
114+
115+
procps-ng-3.3.17-1.amzn2023.0.2.x86_64 : System and process monitoring utilities
116+
Repo : amazonlinux
117+
Matched from:
118+
Filename : /usr/bin/ps
119+
Provide : /bin/ps
120+
----
121+
122+
[[auto-node-ec2-web,auto-node-ec2-web.title]]
61123
== View resources associated with {eam} in the {aws} Console
62124

63125
You can use the {aws} console to view the status of resources associated with {yec}.
@@ -69,6 +131,7 @@ You can use the {aws} console to view the status of resources associated with {y
69131
* link:ec2/home#Instances["EC2 Instances",type="console"]
70132
** View EKS Auto Mode instances by searching for the tag key `eks:eks-cluster-name`
71133

134+
[[auto-node-iam,auto-node-iam.title]]
72135
== View IAM Errors in {yaa}
73136

74137
. Navigate to CloudTrail console
@@ -78,23 +141,115 @@ You can use the {aws} console to view the status of resources associated with {y
78141
** UnauthorizedOperation
79142
** InvalidClientTokenId
80143

81-
Look for errors related to your EKS cluster. Use the error messages to update your EKS access entries, Cluster IAM Role, or Node IAM Role. You may need to attach a new policy to these roles with permissions for {eam}.
144+
Look for errors related to your EKS cluster. Use the error messages to update your EKS access entries, cluster IAM role, or node IAM role. You might need to attach a new policy to these roles with permissions for {eam}.
82145

83146
//Ensure you are running the latest version of the {aws} CLI, eksctl, etc.
84147

85-
== Pod failing to schedule onto Auto Mode node
148+
[[auto-troubleshoot-schedule,auto-troubleshoot-schedule.title]]
149+
== Troubleshoot Pod failing to schedule onto Auto Mode node
86150

87-
If pods are not being scheduled onto an auto mode node, verify if your pod/deployment manifest has a **nodeSelector**. If a nodeSelector is present, please ensure it is using `eks.amazonaws.com/compute-type: auto` to allow it to be scheduled. See <<associate-workload>>.
151+
If pods staying in the `Pending` state and aren't being scheduled onto an auto mode node, verify if your pod or deployment manifest has a `nodeSelector`. If a `nodeSelector` is present, ensure that it is using `eks.amazonaws.com/compute-type: auto` to be scheduled on nodes that are made by EKS Auto Mode. For more information about the node labels that are used by EKS Auto Mode, see <<associate-workload>>.
88152

89-
== Node not joining cluster
153+
[[auto-troubleshoot-join,auto-troubleshoot-join.title]]
154+
== Troubleshoot node not joining the cluster
90155

91-
Run `kubectl get nodeclaim` to check for nodeclaims that are `Ready = False`.
156+
EKS Auto Mode automatically configures new EC2 instances with the correct information to join the cluster, including the cluster endpoint and cluster certificate authority (CA). However, these instances can still fail to join the EKS cluster as a node. Run the following commands to identify instances that didn't join the cluster:
92157

93-
Proceed to run `kubectl describe nodeclaim <node_claim>` and look under *Status* to find any issues preventing the node from joining the cluster.
158+
. Run `kubectl get nodeclaim` to check for `NodeClaims` that are `Ready = False`.
159+
+
160+
[source,cli]
161+
----
162+
kubectl get nodeclaim
163+
----
164+
165+
. Run `kubectl describe nodeclaim <node_claim>` and look under *Status* to find any issues preventing the node from joining the cluster.
166+
+
167+
[source,cli]
168+
----
169+
kubectl describe nodeclaim <node_claim>
170+
----
94171

95172
*Common error messages:*
96173

97-
* "Error getting launch template configs"
98-
** You may receive this error if you are setting custom tags in the NodeClass with the default cluster IAM role permissions. See <<auto-learn-iam>>.
99-
* "Error creating fleet"
100-
** There may be some authorization issue with calling the RunInstances API call. Check CloudTrail for errors and see <<auto-cluster-iam-role>> for the required IAM permissions.
174+
`Error getting launch template configs`::
175+
You might receive this error if you are setting custom tags in the `NodeClass` with the default cluster IAM role permissions. See <<auto-learn-iam>>.
176+
177+
`Error creating fleet`::
178+
There might be some authorization issue with calling the `RunInstances` call from the EC2 API. Check {aws-cloudtrail} for errors and see <<auto-cluster-iam-role>> for the required IAM permissions.
179+
180+
181+
[[auto-node-reachability,auto-node-reachability.title]]
182+
=== Detect node connectivity issues with the `VPC Reachability Analyzer`
183+
184+
[NOTE]
185+
====
186+
You are charged for each analysis that is run the VPC Reachability Analyzer. For pricing details, see link:vpc/pricing/[{amazon-vpc} Pricing,type="marketing"].
187+
====
188+
189+
One reason that an instance didn't join the cluster is a network connectivity issue that prevents them from reaching the API server. To diagnose this issue, you can use the link:vpc/latest/reachability/what-is-reachability-analyzer.html[VPC Reachability Analyzer,type="documentation"] to perform an analysis of the connectivity between a node that is failing to join the cluster and the API server. You will need two pieces of information:
190+
191+
* *instance ID* of a node that can't join the cluster
192+
* IP address of the *Kubernetes API server endpoint*
193+
194+
To get the *instance ID*, you will need to create a workload on the cluster to cause EKS Auto Mode to launch an EC2 instance. This also creates a `NodeClaim` object in your cluster that will have the instance ID. Run `kubectl get nodeclaim -o yaml` to print all of the `NodeClaims` in your cluster. Each `NodeClaim` contains the instance ID as a field and again in the providerID:
195+
196+
[source,cli]
197+
----
198+
kubectl get nodeclaim -o yaml
199+
----
200+
201+
An example output is as follows.
202+
203+
[source,bash,subs="verbatim,attributes"]
204+
----
205+
nodeName: i-01234567890123456
206+
providerID: aws:///us-west-2a/i-01234567890123456
207+
----
208+
209+
You can determine your *Kubernetes API server endpoint* by running `kubectl get endpoint kubernetes -o yaml`. The addresses are in the addresses field:
210+
211+
[source,cli]
212+
----
213+
kubectl get endpoints kubernetes -o yaml
214+
----
215+
216+
An example output is as follows.
217+
218+
[source,bash,subs="verbatim,attributes"]
219+
----
220+
apiVersion: v1
221+
kind: Endpoints
222+
metadata:
223+
name: kubernetes
224+
namespace: default
225+
subsets:
226+
- addresses:
227+
- ip: 10.0.143.233
228+
- ip: 10.0.152.17
229+
ports:
230+
- name: https
231+
port: 443
232+
protocol: TCP
233+
----
234+
235+
With these two pieces of information, you can perform the s analysis. First navigate to the VPC Reachability Analyzer in the{aws-management-console}.
236+
237+
. Click "Create and Analyze Path"
238+
. Provide a name for the analysis (e.g. "Node Join Failure")
239+
. For the "Source Type" select "Instances"
240+
. Enter the instance ID of the failing Node as the "Source"
241+
. For the "Path Destination" select "IP Address"
242+
. Enter one of the IP addresses for the API server as the "Destination Address"
243+
. Expand the "Additional Packet Header Configuration Section"
244+
. Enter a "Destination Port" of 443
245+
. Select "Protocol" as TCP if it is not already selected
246+
. Click "Create and Analyze Path"
247+
. The analysis might take a few minutes to complete. If the analysis results indicates failed reachability, it will indicate where the failure was in the network path so you can resolve the issue.
248+
249+
[[auto-troubleshoot-controllers,auto-troubleshoot-controllers.title]]
250+
== Troubleshoot included controllers in Auto Mode
251+
252+
If you have a problem with a controller, you should research:
253+
254+
* If the resources associated with that controller are properly formatted and valid.
255+
* If the {aws} IAM and Kubernetes RBAC resources are properly configured for your cluster. For more information, see <<auto-learn-iam>>.

vale/styles/config/vocabularies/EksDocsVocab/accept.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,6 @@ StorageClass
77
PersistentVolume
88
CSI
99
Karpenter
10-
VPC
10+
VPC
11+
VPC Reachability Analyzer
12+
reachability

0 commit comments

Comments
 (0)