Merge pull request #63736 from xenolinux/debug-nodes-hosted-control-planes

opayne1 · web-flow · commit 95077d4b7049 · 2023-09-01T08:26:31.000-04:00
OSDOCS#5335: Hosted control planes: Debug nodes
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -2259,6 +2259,8 @@ Topics:
   File: hcp-managing
 - Name: Backup, restore, and disaster recovery for hosted control planes
   File: hcp-backup-restore-dr
+- Name: Troubleshooting worker node issues
+  File: hcp-debugging-nodes
 ---
 Name: Nodes
 Dir: nodes
diff --git a/hosted_control_planes/hcp-debugging-nodes.adoc b/hosted_control_planes/hcp-debugging-nodes.adoc
@@ -0,0 +1,11 @@
+:_content-type: ASSEMBLY
+[id="hcp-debugging-nodes"]
+= Troubleshooting worker node issues
+include::_attributes/common-attributes.adoc[]
+:context: hcp-debugging-nodes
+
+toc::[]
+
+If your control plane API endpoint is available, but worker nodes did not join the hosted cluster on AWS, you can debug worker node issues.
+
+include::modules/debug-nodes-hcp.adoc[leveloffset=+1]
diff --git a/modules/debug-nodes-hcp.adoc b/modules/debug-nodes-hcp.adoc
@@ -0,0 +1,106 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-debugging-nodes.adoc
+
+:_content-type: PROCEDURE
+[id="debug-nodes-hcp_{context}"]
+= Checking why worker nodes did not join the hosted cluster
+
+To troubleshoot why worker nodes did not join the hosted cluster on AWS, you can check the following information.
+
+.Prerequisites
+
+* You have link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.8/html/clusters/cluster_mce_overview#hosting-service-cluster-configure-aws[configured the hosting cluster on AWS].
+* Your control plane API endpoint is available.
+
+.Procedure
+
+. Address any error messages in the status of the `HostedCluster` and `NodePool` resources:
+
+.. Check the status of the `HostedCluster` resource by running the following command:
++
+[source,terminal]
+----
+$ oc get hc -n <hosted_cluster_namespace> <hosted_cluster_name> -o jsonpath='{.status}'
+----
+
+.. Check the status of the `NodePool` resource by running the following command:
++
+[source,terminal]
+----
+$ oc get hc -n <hosted_cluster_namespace> <hosted_cluster_name> -o jsonpath='{.status}'
+----
++
+If you did not find any error messages in the status of the `HostedCluster` and `NodePool` resources, proceed to the next step.
+
+. Check if your worker machines are created by running the following commands, replacing values as necessary:
++
+[source,terminal]
+----
+$ HC_NAMESPACE="clusters"
+$ HC_NAME="cluster_name"
+$ CONTROL_PLANE_NAMESPACE="${HC_NAMESPACE}-${HC_NAME}"
+$ oc get machines.cluster.x-k8s.io -n $CONTROL_PLANE_NAMESPACE
+$ oc get awsmachines -n $CONTROL_PLANE_NAMESPACE
+----
+
+. If worker machines do not exist, check if the `machinedeployment` and `machineset` resources are created by running the following commands:
++
+[source,terminal]
+----
+$ oc get machinedeployment -n $CONTROL_PLANE_NAMESPACE
+$ oc get machineset -n $CONTROL_PLANE_NAMESPACE
+----
+
+. If the `machinedeployment` and `machineset` resources do not exist, check logs of the HyperShift Operator by running the following command:
++
+[source,terminal]
+----
+$ oc logs deployment/operator -n hypershift
+----
+
+. If worker machines exist but are not provisioned in the hosted cluster, check the log of the cluster API provider by running the following command:
++
+[source,terminal]
+----
+$ oc logs deployment/capi-provider -c manager -n $CONTROL_PLANE_NAMESPACE
+----
+
+. If worker machines exist and are provisioned in the cluster, ensure that machines are initialized through Ignition successfully by checking the system console logs. Check the system console logs of every machine by using the `console-logs` utility by running the following command:
++
+[source,terminal]
+----
+$ ./bin/hypershift console-logs aws --name $HC_NAME --aws-creds ~/.aws/credentials --output-dir /tmp/console-logs
+----
++
+You can access the system console logs in the `/tmp/console-logs` directory. The control plane exposes the Ignition endpoint. If you see an error related to the Ignition endpoint, then the Ignition endpoint is not accessible from the worker nodes through `https`.
+
+. If worker machines are provisioned and initialized through Ignition successfully, you can extract and access the journal logs of every worker machine by creating a bastion machine. A bastion machine allows you to access worker machines by using SSH.
+
+.. Create a bastion machine by running the following command:
++
+[source,terminal]
+----
+$ ./bin/hypershift create bastion aws --aws-creds ~/.aws/credentials --name $CLUSTER_NAME --ssh-key-file /tmp/ssh/id_rsa.pub
+----
+
+.. Optional: If you used the `--generate-ssh` flag when creating the cluster, you can extract the public and private key for the cluster by running the following commands:
++
+[souce,terminal]
+----
+$ mkdir /tmp/ssh
+$ oc get secret -n clusters ${HC_NAME}-ssh-key -o jsonpath='{ .data.id_rsa }' | base64 -d > /tmp/ssh/id_rsa
+$ oc get secret -n clusters ${HC_NAME}-ssh-key -o jsonpath='{ .data.id_rsa\.pub }' | base64 -d > /tmp/ssh/id_rsa.pub
+----
+
+.. Extract journal logs from the every worker machine by running the following commands:
++
+[source,terminal]
+----
+$ mkdir /tmp/journals
+$ INFRAID="$(oc get hc -n clusters $CLUSTER_NAME -o jsonpath='{ .spec.infraID }')"
+$ SSH_PRIVATE_KEY=/tmp/ssh/id_rsa
+$ ./test/e2e/util/dump/copy-machine-journals.sh /tmp/journals
+----
++
+You must place journal logs in the `/tmp/journals` directory in a compressed format. Check for the error that indicates why kubelet did not join the cluster.