Skip to content

Commit 0d2defd

Browse files
authored
Merge pull request #73672 from johnwilkins/TELCODOCS-1388
TELCODOCS-1388: Add IPI troubleshooting information to our documentation
2 parents 85bd65e + 7528a12 commit 0d2defd

11 files changed

+445
-32
lines changed

installing/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,35 +7,41 @@ include::_attributes/common-attributes.adoc[]
77
toc::[]
88

99

10-
== Troubleshooting the installer workflow
10+
== Troubleshooting the installation program workflow
1111

1212
Prior to troubleshooting the installation environment, it is critical to understand the overall flow of the installer-provisioned installation on bare metal. The diagrams below provide a troubleshooting flow with a step-by-step breakdown for the environment.
1313

1414
image:flow1.png[Flow-Diagram-1]
1515

16-
_Workflow 1 of 4_ illustrates a troubleshooting workflow when the `install-config.yaml` file has errors or the {op-system-first} images are inaccessible. Troubleshooting suggestions can be found at xref:ipi-install-troubleshooting-install-config_{context}[Troubleshooting `install-config.yaml`].
16+
_Workflow 1 of 4_ illustrates a troubleshooting workflow when the `install-config.yaml` file has errors or the {op-system-first} images are inaccessible. Troubleshooting suggestions can be found at xref:ipi-install-troubleshooting-install-config_ipi-install-troubleshooting[Troubleshooting `install-config.yaml`].
1717

1818
image:flow2.png[Flow-Diagram-2]
1919

20-
_Workflow 2 of 4_ illustrates a troubleshooting workflow for xref:ipi-install-troubleshooting-bootstrap-vm_{context}[ bootstrap VM issues], xref:ipi-install-troubleshooting-bootstrap-vm-cannot-boot_{context}[ bootstrap VMs that cannot boot up the cluster nodes], and xref:ipi-install-troubleshooting-bootstrap-vm-inspecting-logs_{context}[ inspecting logs]. When installing an {product-title} cluster without the `provisioning` network, this workflow does not apply.
20+
_Workflow 2 of 4_ illustrates a troubleshooting workflow for xref:ipi-install-troubleshooting-bootstrap-vm_ipi-install-troubleshooting[ bootstrap VM issues], xref:ipi-install-troubleshooting-bootstrap-vm-cannot-boot_ipi-install-troubleshooting[ bootstrap VMs that cannot boot up the cluster nodes], and xref:ipi-install-troubleshooting-bootstrap-vm-inspecting-logs_ipi-install-troubleshooting[ inspecting logs]. When installing an {product-title} cluster without the `provisioning` network, this workflow does not apply.
2121

2222
image:flow3.png[Flow-Diagram-3]
2323

24-
_Workflow 3 of 4_ illustrates a troubleshooting workflow for xref:ipi-install-troubleshooting-cluster-nodes-will-not-pxe_{context}[ cluster nodes that will not PXE boot]. If installing using RedFish Virtual Media, each node must meet minimum firmware requirements for the installer to deploy the node. See *Firmware requirements for installing with virtual media* in the *Prerequisites* section for additional details.
24+
_Workflow 3 of 4_ illustrates a troubleshooting workflow for xref:ipi-install-troubleshooting-cluster-nodes-will-not-pxe_ipi-install-troubleshooting[ cluster nodes that will not PXE boot]. If installing using RedFish Virtual Media, each node must meet minimum firmware requirements for the installation program to deploy the node. See *Firmware requirements for installing with virtual media* in the *Prerequisites* section for additional details.
2525

2626
image:flow4.png[Flow-Diagram-4]
2727

2828
_Workflow 4 of 4_ illustrates a troubleshooting workflow from
29-
xref:ipi-install-troubleshooting-api-not-accessible_{context}[ a non-accessible API] to a xref:ipi-install-troubleshooting-reviewing-the-installation_{context}[validated installation].
29+
xref:investigating-an-unavailable-kubernetes-api_ipi-install-troubleshooting[ a non-accessible API] to a xref:ipi-install-troubleshooting-reviewing-the-installation_ipi-install-troubleshooting[validated installation].
3030

3131

3232
include::modules/ipi-install-troubleshooting-install-config.adoc[leveloffset=+1]
3333
include::modules/ipi-install-troubleshooting-bootstrap-vm.adoc[leveloffset=+1]
3434
include::modules/ipi-install-troubleshooting-bootstrap-vm-cannot-boot.adoc[leveloffset=+2]
3535
include::modules/ipi-install-troubleshooting-bootstrap-vm-inspecting-logs.adoc[leveloffset=+2]
36+
include::modules/ipi-install-troubleshooting-investigating-an-unavailable-kubernetes-api.adoc[leveloffset=+1]
37+
include::modules/ipi-install-troubleshooting-troubleshooting-failure-to-initialize-the-cluster.adoc[leveloffset=+1]
38+
include::modules/ipi-install-troubleshooting-troubleshooting-failure-to-fetch-the-console-url.adoc[leveloffset=+1]
39+
include::modules/ipi-install-troubleshooting-troubleshooting-failure-to-add-the-ingress-certificate-to-kubeconfig.adoc[leveloffset=+1]
40+
include::modules/ipi-install-troubleshooting-troubleshooting-ssh-access-to-cluster-nodes.adoc[leveloffset=+1]
3641
include::modules/ipi-install-troubleshooting-cluster-nodes-will-not-pxe.adoc[leveloffset=+1]
42+
include::modules/ipi-install-troubleshooting-installing-creates-no-worker-nodes.adoc[leveloffset=+1]
43+
include::modules/ipi-install-troubleshooting-troubleshooting-the-cluster-network-operator.adoc[leveloffset=+1]
3744
include::modules/ipi-install-troubleshooting_unable-to-discover-new-bare-metal-hosts-using-the-bmc.adoc[leveloffset=+1]
38-
include::modules/ipi-install-troubleshooting-api-not-accessible.adoc[leveloffset=+1]
3945
include::modules/ipi-install-troubleshooting_proc_worker-nodes-cannot-join-the-cluster.adoc[leveloffset=+1]
4046
include::modules/ipi-install-troubleshooting-cleaning-up-previous-installations.adoc[leveloffset=+1]
4147
include::modules/ipi-install-troubleshooting-registry-issues.adoc[leveloffset=+1]

modules/ipi-install-troubleshooting-bootstrap-vm.adoc

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
:_mod-docs-content-type: PROCEDURE
55
[id="ipi-install-troubleshooting-bootstrap-vm_{context}"]
66

7-
= Bootstrap VM issues
7+
= Troubleshooting bootstrap VM issues
88

99
The {product-title} installation program spawns a bootstrap node virtual machine, which handles provisioning the {product-title} cluster nodes.
1010

@@ -28,10 +28,8 @@ $ sudo virsh list
2828
====
2929
The name of the bootstrap VM is always the cluster name followed by a random set of characters and ending in the word "bootstrap."
3030
====
31-
+
32-
If the bootstrap VM is not running after 10-15 minutes, troubleshoot why it is not running. Possible issues include:
3331

34-
. Verify `libvirtd` is running on the system:
32+
. If the bootstrap VM is not running after 10-15 minutes, verify `libvirtd` is running on the system by executing the following command:
3533
+
3634
[source,terminal]
3735
----
@@ -79,7 +77,6 @@ localhost login:
7977
When deploying an {product-title} cluster without the `provisioning` network, you must use a public IP address and not a private IP address like `172.22.0.2`.
8078
====
8179

82-
8380
. After you obtain the IP address, log in to the bootstrap VM using the `ssh` command:
8481
+
8582
[NOTE]
@@ -95,12 +92,8 @@ $ ssh [email protected]
9592
If you are not successful logging in to the bootstrap VM, you have likely encountered one of the following scenarios:
9693

9794
* You cannot reach the `172.22.0.0/24` network. Verify the network connectivity between the provisioner and the `provisioning` network bridge. This issue might occur if you are using a `provisioning` network.
98-
`
99-
* You cannot reach the bootstrap VM through the public network. When attempting
100-
to SSH via `baremetal` network, verify connectivity on the
95+
96+
* You cannot reach the bootstrap VM through the public network. When attempting to SSH via `baremetal` network, verify connectivity on the
10197
`provisioner` host specifically around the `baremetal` network bridge.
10298

103-
* You encountered `Permission denied (publickey,password,keyboard-interactive)`. When
104-
attempting to access the bootstrap VM, a `Permission denied` error
105-
might occur. Verify that the SSH key for the user attempting to log
106-
in to the VM is set within the `install-config.yaml` file.
99+
* You encountered `Permission denied (publickey,password,keyboard-interactive)`. When attempting to access the bootstrap VM, a `Permission denied` error might occur. Verify that the SSH key for the user attempting to log in to the VM is set within the `install-config.yaml` file.
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// installing/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="installing-creates-no-worker-nodes_{context}"]
7+
= Installing creates no worker nodes
8+
9+
The installation program does not provision worker nodes directly. Instead, the Machine API Operator scales nodes up and down on supported platforms. If worker nodes are not created after 15 to 20 minutes, depending on the speed of the cluster's internet connection, investigate the Machine API Operator.
10+
11+
.Procedure
12+
13+
. Check the Machine API Operator by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig \
18+
--namespace=openshift-machine-api get deployments
19+
----
20+
+
21+
If `${INSTALL_DIR}` is not set in your environment, replace the value with the name of the installation directory.
22+
+
23+
.Example output
24+
[source,terminal]
25+
----
26+
NAME READY UP-TO-DATE AVAILABLE AGE
27+
cluster-autoscaler-operator 1/1 1 1 86m
28+
cluster-baremetal-operator 1/1 1 1 86m
29+
machine-api-controllers 1/1 1 1 85m
30+
machine-api-operator 1/1 1 1 86m
31+
----
32+
33+
. Check the machine controller logs by running the following command:
34+
+
35+
[source,terminal]
36+
----
37+
$ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig \
38+
--namespace=openshift-machine-api logs deployments/machine-api-controllers \
39+
--container=machine-controller
40+
----

modules/ipi-install-troubleshooting-api-not-accessible.adoc renamed to modules/ipi-install-troubleshooting-investigating-an-unavailable-kubernetes-api.adoc

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,32 @@
1-
// Module included in the following assemblies:
2-
// //installing/installing_bare_metal_ipi/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc
1+
// This module is included in the following assemblies:
2+
//
3+
// installing/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc
34

45
:_mod-docs-content-type: PROCEDURE
5-
[id="ipi-install-troubleshooting-api-not-accessible_{context}"]
6+
[id="investigating-an-unavailable-kubernetes-api_{context}"]
7+
= Investigating an unavailable Kubernetes API
68

7-
= The API is not accessible
8-
9-
When the cluster is running and clients cannot access the API, domain name resolution issues might impede access to the API.
9+
When the Kubernetes API is unavailable, check the control plane nodes to ensure that they are running the correct components. Also, check the hostname resolution.
1010

1111
.Procedure
1212

13-
. **Hostname Resolution:** Check the cluster nodes to ensure they have a fully qualified domain name, and not just `localhost.localdomain`. For example:
13+
. Ensure that `etcd` is running on each of the control plane nodes by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet)
18+
----
19+
20+
. If the previous command fails, ensure that Kublet created the `etcd` pods by running the following command:
21+
+
22+
[source,terminal]
23+
----
24+
$ sudo crictl pods --name=etcd-member
25+
----
26+
+
27+
If there are no pods, investigate `etcd`.
28+
29+
. Check the cluster nodes to ensure they have a fully qualified domain name, and not just `localhost.localdomain`, by using the following command:
1430
+
1531
[source,terminal]
1632
----
@@ -21,10 +37,10 @@ If a hostname is not set, set the correct hostname. For example:
2137
+
2238
[source,terminal]
2339
----
24-
$ hostnamectl set-hostname <hostname>
40+
$ sudo hostnamectl set-hostname <hostname>
2541
----
2642

27-
. **Incorrect Name Resolution:** Ensure that each node has the correct name resolution in the DNS server using `dig` and `nslookup`. For example:
43+
. Ensure that each node has the correct name resolution in the DNS server using the `dig` command:
2844
+
2945
[source,terminal]
3046
----

modules/ipi-install-troubleshooting-misc-issues.adoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ After the deployment of a cluster you might receive the following error:
1414
`runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network`
1515
----
1616

17-
The Cluster Network Operator is responsible for deploying the networking components in response to a special object created by the installer. It runs very early in the installation process, after the control plane (master) nodes have come up, but before the bootstrap control plane has been torn down. It can be indicative of more subtle installer issues, such as long delays in bringing up control plane (master) nodes or issues with `apiserver` communication.
17+
The Cluster Network Operator is responsible for deploying the networking components in response to a special object created by the installation program. It runs very early in the installation process, after the control plane (master) nodes have come up, but before the bootstrap control plane has been torn down. It can be indicative of more subtle installation program issues, such as long delays in bringing up control plane (master) nodes or issues with `apiserver` communication.
1818

1919
.Procedure
2020

@@ -54,7 +54,7 @@ spec:
5454
networkType: OVNKubernetes
5555
----
5656
+
57-
If it does not exist, the installer did not create it. To determine why the installer did not create it, execute the following:
57+
If it does not exist, the installation program did not create it. To determine why the installation program did not create it, execute the following:
5858
+
5959
[source,terminal]
6060
----
@@ -75,7 +75,7 @@ $ kubectl -n openshift-network-operator get pods
7575
$ kubectl -n openshift-network-operator logs -l "name=network-operator"
7676
----
7777
+
78-
On high availability clusters with three or more control plane (master) nodes, the Operator will perform leader election and all other Operators will sleep. For additional details, see https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md[Troubleshooting].
78+
On high availability clusters with three or more control plane nodes, the Operator will perform leader election and all other Operators will sleep. For additional details, see https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md[Troubleshooting].
7979

8080
== Addressing the "No disk found with matching rootDeviceHints" error message
8181

modules/ipi-install-troubleshooting-reviewing-the-installation.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
= Reviewing the installation
88

9-
After installation, ensure the installer deployed the nodes and pods successfully.
9+
After installation, ensure the installation program deployed the nodes and pods successfully.
1010

1111
.Procedure
1212

@@ -25,7 +25,7 @@ master-1.example.com Ready master,worker 4h v1.28.5
2525
master-2.example.com Ready master,worker 4h v1.28.5
2626
----
2727

28-
. Confirm the installer deployed all pods successfully. The following command
28+
. Confirm the installation program deployed all pods successfully. The following command
2929
removes any pods that are still running or have completed as part of the output.
3030
+
3131
[source,terminal]
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// installing/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="troubleshooting-failure-to-add-the-ingress-certificate-to-kubeconfig_{context}"]
7+
= Troubleshooting a failure to add the ingress certificate to kubeconfig
8+
9+
The installation program adds the default ingress certificate to the list of trusted client certificate authorities in `${INSTALL_DIR}/auth/kubeconfig`. If the installation program fails to add the ingress certificate to the `kubeconfig` file, you can retrieve the certificate from the cluster and add it.
10+
11+
.Procedure
12+
13+
. Retrieve the certificate from the cluster using the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get configmaps default-ingress-cert \
18+
-n openshift-config-managed -o=jsonpath='{.data.ca-bundle\.crt}'
19+
----
20+
+
21+
[source,terminal]
22+
----
23+
-----BEGIN CERTIFICATE-----
24+
MIIC/TCCAeWgAwIBAgIBATANBgkqhkiG9w0BAQsFADAuMSwwKgYDVQQDDCNjbHVz
25+
dGVyLWluZ3Jlc3Mtb3BlcmF0b3JAMTU1MTMwNzU4OTAeFw0xOTAyMjcyMjQ2Mjha
26+
Fw0yMTAyMjYyMjQ2MjlaMC4xLDAqBgNVBAMMI2NsdXN0ZXItaW5ncmVzcy1vcGVy
27+
YXRvckAxNTUxMzA3NTg5MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
28+
uCA4fQ+2YXoXSUL4h/mcvJfrgpBfKBW5hfB8NcgXeCYiQPnCKblH1sEQnI3VC5Pk
29+
2OfNCF3PUlfm4i8CHC95a7nCkRjmJNg1gVrWCvS/ohLgnO0BvszSiRLxIpuo3C4S
30+
EVqqvxValHcbdAXWgZLQoYZXV7RMz8yZjl5CfhDaaItyBFj3GtIJkXgUwp/5sUfI
31+
LDXW8MM6AXfuG+kweLdLCMm3g8WLLfLBLvVBKB+4IhIH7ll0buOz04RKhnYN+Ebw
32+
tcvFi55vwuUCWMnGhWHGEQ8sWm/wLnNlOwsUz7S1/sW8nj87GFHzgkaVM9EOnoNI
33+
gKhMBK9ItNzjrP6dgiKBCQIDAQABoyYwJDAOBgNVHQ8BAf8EBAMCAqQwEgYDVR0T
34+
AQH/BAgwBgEB/wIBADANBgkqhkiG9w0BAQsFAAOCAQEAq+vi0sFKudaZ9aUQMMha
35+
CeWx9CZvZBblnAWT/61UdpZKpFi4eJ2d33lGcfKwHOi2NP/iSKQBebfG0iNLVVPz
36+
vwLbSG1i9R9GLdAbnHpPT9UG6fLaDIoKpnKiBfGENfxeiq5vTln2bAgivxrVlyiq
37+
+MdDXFAWb6V4u2xh6RChI7akNsS3oU9PZ9YOs5e8vJp2YAEphht05X0swA+X8V8T
38+
C278FFifpo0h3Q0Dbv8Rfn4UpBEtN4KkLeS+JeT+0o2XOsFZp7Uhr9yFIodRsnNo
39+
H/Uwmab28ocNrGNiEVaVH6eTTQeeZuOdoQzUbClElpVmkrNGY0M42K0PvOQ/e7+y
40+
AQ==
41+
-----END CERTIFICATE-----
42+
----
43+
44+
. Add the certificate to the `client-certificate-authority-data` field in the `${INSTALL_DIR}/auth/kubeconfig` file.
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// installing/installing_bare_metal_ipi/ipi-install-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="troubleshooting-failure-to-fetch-the-console-url_{context}"]
7+
= Troubleshooting a failure to fetch the console URL
8+
9+
The installation program retrieves the URL for the {product-title} console by using `[route][route-object]` within the `openshift-console` namespace. If the installation program fails the retrieve the URL for the console, use the following procedure.
10+
11+
.Procedure
12+
13+
. Check if the console router is in the `Available` or `Failing` state by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get clusteroperator console -oyaml
18+
----
19+
+
20+
[source,yaml]
21+
----
22+
apiVersion: config.openshift.io/v1
23+
kind: ClusterOperator
24+
metadata:
25+
creationTimestamp: 2019-02-27T22:46:57Z
26+
generation: 1
27+
name: console
28+
resourceVersion: "19682"
29+
selfLink: /apis/config.openshift.io/v1/clusteroperators/console
30+
uid: 960364aa-3ae1-11e9-bad4-0a97b6ba9358
31+
spec: {}
32+
status:
33+
conditions:
34+
- lastTransitionTime: 2019-02-27T22:46:58Z
35+
status: "False"
36+
type: Failing
37+
- lastTransitionTime: 2019-02-27T22:50:12Z
38+
status: "False"
39+
type: Progressing
40+
- lastTransitionTime: 2019-02-27T22:50:12Z
41+
status: "True"
42+
type: Available
43+
- lastTransitionTime: 2019-02-27T22:46:57Z
44+
status: "True"
45+
type: Upgradeable
46+
extension: null
47+
relatedObjects:
48+
- group: operator.openshift.io
49+
name: cluster
50+
resource: consoles
51+
- group: config.openshift.io
52+
name: cluster
53+
resource: consoles
54+
- group: oauth.openshift.io
55+
name: console
56+
resource: oauthclients
57+
- group: ""
58+
name: openshift-console-operator
59+
resource: namespaces
60+
- group: ""
61+
name: openshift-console
62+
resource: namespaces
63+
versions: null
64+
----
65+
66+
. Manually retrieve the console URL by executing the following command:
67+
+
68+
[source,terminal]
69+
----
70+
$ oc --kubeconfig=${INSTALL_DIR}/auth/kubeconfig get route console -n openshift-console \
71+
-o=jsonpath='{.spec.host}' console-openshift-console.apps.adahiya-1.devcluster.openshift.com
72+
----

0 commit comments

Comments
 (0)