Skip to content

Commit 95077d4

Browse files
authored
Merge pull request #63736 from xenolinux/debug-nodes-hosted-control-planes
OSDOCS#5335: Hosted control planes: Debug nodes
2 parents bb723b5 + 7a09234 commit 95077d4

File tree

3 files changed

+119
-0
lines changed

3 files changed

+119
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2259,6 +2259,8 @@ Topics:
22592259
File: hcp-managing
22602260
- Name: Backup, restore, and disaster recovery for hosted control planes
22612261
File: hcp-backup-restore-dr
2262+
- Name: Troubleshooting worker node issues
2263+
File: hcp-debugging-nodes
22622264
---
22632265
Name: Nodes
22642266
Dir: nodes
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
:_content-type: ASSEMBLY
2+
[id="hcp-debugging-nodes"]
3+
= Troubleshooting worker node issues
4+
include::_attributes/common-attributes.adoc[]
5+
:context: hcp-debugging-nodes
6+
7+
toc::[]
8+
9+
If your control plane API endpoint is available, but worker nodes did not join the hosted cluster on AWS, you can debug worker node issues.
10+
11+
include::modules/debug-nodes-hcp.adoc[leveloffset=+1]

modules/debug-nodes-hcp.adoc

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-debugging-nodes.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="debug-nodes-hcp_{context}"]
7+
= Checking why worker nodes did not join the hosted cluster
8+
9+
To troubleshoot why worker nodes did not join the hosted cluster on AWS, you can check the following information.
10+
11+
.Prerequisites
12+
13+
* You have link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.8/html/clusters/cluster_mce_overview#hosting-service-cluster-configure-aws[configured the hosting cluster on AWS].
14+
* Your control plane API endpoint is available.
15+
16+
.Procedure
17+
18+
. Address any error messages in the status of the `HostedCluster` and `NodePool` resources:
19+
20+
.. Check the status of the `HostedCluster` resource by running the following command:
21+
+
22+
[source,terminal]
23+
----
24+
$ oc get hc -n <hosted_cluster_namespace> <hosted_cluster_name> -o jsonpath='{.status}'
25+
----
26+
27+
.. Check the status of the `NodePool` resource by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc get hc -n <hosted_cluster_namespace> <hosted_cluster_name> -o jsonpath='{.status}'
32+
----
33+
+
34+
If you did not find any error messages in the status of the `HostedCluster` and `NodePool` resources, proceed to the next step.
35+
36+
. Check if your worker machines are created by running the following commands, replacing values as necessary:
37+
+
38+
[source,terminal]
39+
----
40+
$ HC_NAMESPACE="clusters"
41+
$ HC_NAME="cluster_name"
42+
$ CONTROL_PLANE_NAMESPACE="${HC_NAMESPACE}-${HC_NAME}"
43+
$ oc get machines.cluster.x-k8s.io -n $CONTROL_PLANE_NAMESPACE
44+
$ oc get awsmachines -n $CONTROL_PLANE_NAMESPACE
45+
----
46+
47+
. If worker machines do not exist, check if the `machinedeployment` and `machineset` resources are created by running the following commands:
48+
+
49+
[source,terminal]
50+
----
51+
$ oc get machinedeployment -n $CONTROL_PLANE_NAMESPACE
52+
$ oc get machineset -n $CONTROL_PLANE_NAMESPACE
53+
----
54+
55+
. If the `machinedeployment` and `machineset` resources do not exist, check logs of the HyperShift Operator by running the following command:
56+
+
57+
[source,terminal]
58+
----
59+
$ oc logs deployment/operator -n hypershift
60+
----
61+
62+
. If worker machines exist but are not provisioned in the hosted cluster, check the log of the cluster API provider by running the following command:
63+
+
64+
[source,terminal]
65+
----
66+
$ oc logs deployment/capi-provider -c manager -n $CONTROL_PLANE_NAMESPACE
67+
----
68+
69+
. If worker machines exist and are provisioned in the cluster, ensure that machines are initialized through Ignition successfully by checking the system console logs. Check the system console logs of every machine by using the `console-logs` utility by running the following command:
70+
+
71+
[source,terminal]
72+
----
73+
$ ./bin/hypershift console-logs aws --name $HC_NAME --aws-creds ~/.aws/credentials --output-dir /tmp/console-logs
74+
----
75+
+
76+
You can access the system console logs in the `/tmp/console-logs` directory. The control plane exposes the Ignition endpoint. If you see an error related to the Ignition endpoint, then the Ignition endpoint is not accessible from the worker nodes through `https`.
77+
78+
. If worker machines are provisioned and initialized through Ignition successfully, you can extract and access the journal logs of every worker machine by creating a bastion machine. A bastion machine allows you to access worker machines by using SSH.
79+
80+
.. Create a bastion machine by running the following command:
81+
+
82+
[source,terminal]
83+
----
84+
$ ./bin/hypershift create bastion aws --aws-creds ~/.aws/credentials --name $CLUSTER_NAME --ssh-key-file /tmp/ssh/id_rsa.pub
85+
----
86+
87+
.. Optional: If you used the `--generate-ssh` flag when creating the cluster, you can extract the public and private key for the cluster by running the following commands:
88+
+
89+
[souce,terminal]
90+
----
91+
$ mkdir /tmp/ssh
92+
$ oc get secret -n clusters ${HC_NAME}-ssh-key -o jsonpath='{ .data.id_rsa }' | base64 -d > /tmp/ssh/id_rsa
93+
$ oc get secret -n clusters ${HC_NAME}-ssh-key -o jsonpath='{ .data.id_rsa\.pub }' | base64 -d > /tmp/ssh/id_rsa.pub
94+
----
95+
96+
.. Extract journal logs from the every worker machine by running the following commands:
97+
+
98+
[source,terminal]
99+
----
100+
$ mkdir /tmp/journals
101+
$ INFRAID="$(oc get hc -n clusters $CLUSTER_NAME -o jsonpath='{ .spec.infraID }')"
102+
$ SSH_PRIVATE_KEY=/tmp/ssh/id_rsa
103+
$ ./test/e2e/util/dump/copy-machine-journals.sh /tmp/journals
104+
----
105+
+
106+
You must place journal logs in the `/tmp/journals` directory in a compressed format. Check for the error that indicates why kubelet did not join the cluster.

0 commit comments

Comments
 (0)