Skip to content

Commit d57a49b

Browse files
authored
Merge pull request #63394 from johnwilkins/TELCODOCS-1111
TELCODOCS-1111: D/S: OCPVE-218 LVMO recover from failure
2 parents 4bacdf5 + 8b9228a commit d57a49b

7 files changed

+323
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1530,6 +1530,8 @@ Topics:
15301530
File: persistent-storage-hostpath
15311531
- Name: Persistent storage using LVM Storage
15321532
File: persistent-storage-using-lvms
1533+
- Name: Troubleshooting local persistent storage using LVMS
1534+
File: troubleshooting-local-persistent-storage-using-lvms
15331535
- Name: Using Container Storage Interface (CSI)
15341536
Dir: container_storage_interface
15351537
Distros: openshift-enterprise,openshift-origin
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="investigating-a-pvc-stuck-in-the-pending-state_{context}"]
7+
= Investigating a PVC stuck in the Pending state
8+
9+
A persistent volume claim (PVC) can get stuck in a `Pending` state for a number of reasons. For example:
10+
11+
- Insufficient computing resources
12+
- Network problems
13+
- Mismatched storage class or node selector
14+
- No available volumes
15+
- The node with the persistent volume (PV) is in a `Not Ready` state
16+
17+
Identify the cause by using the `oc describe` command to review details about the stuck PVC.
18+
19+
.Procedure
20+
21+
. Retrieve the list of PVCs by running the following command:
22+
+
23+
[source,terminal]
24+
----
25+
$ oc get pvc
26+
----
27+
+
28+
.Example output
29+
[source,terminal]
30+
----
31+
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
32+
lvms-test Pending lvms-vg1 11s
33+
----
34+
35+
. Inspect the events associated with a PVC stuck in the `Pending` state by running the following command:
36+
+
37+
[source,terminal]
38+
----
39+
$ oc describe pvc <pvc_name> <1>
40+
----
41+
<1> Replace `<pvc_name>` with the name of the PVC. For example, `lvms-vg1`.
42+
+
43+
.Example output
44+
[source,terminal]
45+
----
46+
Type Reason Age From Message
47+
---- ------ ---- ---- -------
48+
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller storageclass.storage.k8s.io "lvms-vg1" not found
49+
----
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="performing-a-forced-cleanup_{context}"]
7+
= Performing a forced cleanup
8+
9+
If disk- or node-related problems persist after you complete the troubleshooting procedures, it might be necessary to perform a forced cleanup procedure. A forced cleanup is used to comprehensively address persistent issues and ensure the proper functioning of the LVMS.
10+
11+
.Prerequisites
12+
13+
. All of the persistent volume claims (PVCs) created using the logical volume manager storage (LVMS) driver have been removed.
14+
15+
. The pods using those PVCs have been stopped.
16+
17+
18+
.Procedure
19+
20+
. Switch to the `openshift-storage` namespace by running the following command:
21+
+
22+
[source,terminal]
23+
----
24+
$ oc project openshift-storage
25+
----
26+
27+
. Ensure there is no `Logical Volume` custom resource (CR) remaining by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc get logicalvolume
32+
----
33+
+
34+
.Example output
35+
[source,terminal]
36+
----
37+
No resources found
38+
----
39+
40+
.. If there are any `LogicalVolume` CRs remaining, remove their finalizers by running the following command:
41+
+
42+
[source,terminal]
43+
----
44+
$ oc patch logicalvolume <name> -p '{"metadata":{"finalizers":[]}}' --type=merge <1>
45+
----
46+
<1> Replace `<name>` with the name of the CR.
47+
48+
.. After removing their finalizers, delete the CRs by running the following command:
49+
+
50+
[source,terminal]
51+
----
52+
$ oc delete logicalvolume <name> <1>
53+
----
54+
<1> Replace `<name>` with the name of the CR.
55+
56+
. Make sure there are no `LVMVolumeGroup` CRs left by running the following command:
57+
+
58+
[source,terminal]
59+
----
60+
$ oc get lvmvolumegroup
61+
----
62+
+
63+
.Example output
64+
[source,terminal]
65+
----
66+
No resources found
67+
----
68+
69+
.. If there are any `LVMVolumeGroup` CRs left, remove their finalizers by running the following command:
70+
+
71+
[source,terminal]
72+
----
73+
$ oc patch lvmvolumegroup <name> -p '{"metadata":{"finalizers":[]}}' --type=merge <1>
74+
----
75+
<1> Replace `<name>` with the name of the CR.
76+
77+
.. After removing their finalizers, delete the CRs by running the following command:
78+
+
79+
[source,terminal]
80+
----
81+
$ oc delete lvmvolumegroup <name> <1>
82+
----
83+
<1> Replace `<name>` with the name of the CR.
84+
85+
. Remove any `LVMVolumeGroupNodeStatus` CRs by running the following command:
86+
+
87+
[source,terminal]
88+
----
89+
$ oc delete lvmvolumegroupnodestatus --all
90+
----
91+
92+
. Remove the `LVMCluster` CR by running the following command:
93+
+
94+
[source,terminal]
95+
----
96+
$ oc delete lvmcluster --all
97+
----
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="recovering-from-disk-failure_{context}"]
7+
= Recovering from disk failure
8+
9+
If you see a failure message while inspecting the events associated with the persistent volume claim (PVC), there might be a problem with the underlying volume or disk. Disk and volume provisioning issues often result with a generic error first, such as `Failed to provision volume with StorageClass <storage_class_name>`. A second, more specific error message usually follows.
10+
11+
.Procedure
12+
13+
. Inspect the events associated with a PVC by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc describe pvc <pvc_name> <1>
18+
----
19+
<1> Replace `<pvc_name>` with the name of the PVC. Here are some examples of disk or volume failure error messages and their causes:
20+
+
21+
- *Failed to check volume existence:* Indicates a problem in verifying whether the volume already exists. Volume verification failure can be caused by network connectivity problems or other failures.
22+
+
23+
- *Failed to bind volume:* Failure to bind a volume can happen if the persistent volume (PV) that is available does not match the requirements of the PVC.
24+
+
25+
- *FailedMount or FailedUnMount:* This error indicates problems when trying to mount the volume to a node or unmount a volume from a node. If the disk has failed, this error might appear when a pod tries to use the PVC.
26+
+
27+
- *Volume is already exclusively attached to one node and can't be attached to another:* This error can appear with storage solutions that do not support `ReadWriteMany` access modes.
28+
29+
. Establish a direct connection to the host where the problem is occurring.
30+
31+
. Resolve the disk issue.
32+
33+
After you have resolved the issue with the disk, you might need to perform the forced cleanup procedure if failure messages persist or reoccur.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="recovering-from-missing-lvms-or-operator-components_{context}"]
7+
= Recovering from missing LVMS or Operator components
8+
9+
If you encounter a storage class "not found" error, check the `LVMCluster` resource and ensure that all the logical volume manager storage (LVMS) pods are running. You can create an `LVMCluster` resource if it does not exist.
10+
11+
.Procedure
12+
13+
. Verify the presence of the LVMCluster resource by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc get lvmcluster -n openshift-storage
18+
----
19+
+
20+
.Example output
21+
[source,terminal]
22+
----
23+
NAME AGE
24+
my-lvmcluster 65m
25+
----
26+
27+
. If the cluster doesn't have an `LVMCluster` resource, create one by running the following command:
28+
+
29+
[source,terminal]
30+
----
31+
$ oc create -n openshift-storage -f <custom_resource> <1>
32+
----
33+
<1> Replace `<custom_resource>` with a custom resource URL or file tailored to your requirements.
34+
+
35+
.Example custom resource
36+
[source,yaml,options="nowrap",role="white-space-pre"]
37+
----
38+
apiVersion: lvm.topolvm.io/v1alpha1
39+
kind: LVMCluster
40+
metadata:
41+
name: my-lvmcluster
42+
spec:
43+
storage:
44+
deviceClasses:
45+
- name: vg1
46+
default: true
47+
thinPoolConfig:
48+
name: thin-pool-1
49+
sizePercent: 90
50+
overprovisionRatio: 10
51+
----
52+
53+
. Check that all the pods from LVMS are in the `Running` state in the `openshift-storage` namespace by running the following command:
54+
+
55+
[source,terminal]
56+
----
57+
$ oc get pods -n openshift-storage
58+
----
59+
+
60+
.Example output
61+
[source,terminal]
62+
----
63+
NAME READY STATUS RESTARTS AGE
64+
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
65+
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
66+
topolvm-node-dr26h 4/4 Running 0 66m
67+
vg-manager-r6zdv 1/1 Running 0 66m
68+
----
69+
+
70+
The expected output is one running instance of `lvms-operator` and `vg-manager`. One instance of `topolvm-controller` and `topolvm-node` is expected for each node.
71+
+
72+
If `topolvm-node` is stuck in the `Init` state, there is a failure to locate an available disk for LVMS to use. To retrieve the information necessary to troubleshoot, review the logs of the `vg-manager` pod by running the following command:
73+
+
74+
[source,terminal]
75+
----
76+
$ oc logs -l app.kubernetes.io/component=vg-manager -n openshift-storage
77+
----
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
// This module is included in the following assemblies:
2+
//
3+
// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="recovering-from-node-failure_{context}"]
7+
= Recovering from node failure
8+
9+
Sometimes a persistent volume claim (PVC) is stuck in a `Pending` state because a particular node in the cluster has failed. To identify the failed node, you can examine the restart count of the `topolvm-node` pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting.
10+
11+
.Procedure
12+
13+
* Examine the restart count of the `topolvm-node` pod instances by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc get pods -n openshift-storage
18+
----
19+
+
20+
.Example output
21+
[source,terminal]
22+
----
23+
NAME READY STATUS RESTARTS AGE
24+
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
25+
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
26+
topolvm-node-dr26h 4/4 Running 0 66m
27+
topolvm-node-54as8 4/4 Running 0 66m
28+
topolvm-node-78fft 4/4 Running 17 (8s ago) 66m
29+
vg-manager-r6zdv 1/1 Running 0 66m
30+
vg-manager-990ut 1/1 Running 0 66m
31+
vg-manager-an118 1/1 Running 0 66m
32+
----
33+
+
34+
After you resolve any issues with the node, you might need to perform the forced cleanup procedure if the PVC is still stuck in a `Pending` state.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
:_content-type: ASSEMBLY
2+
[id="troubleshooting-local-persistent-storage"]
3+
= Troubleshooting local persistent storage using LVMS
4+
include::_attributes/common-attributes.adoc[]
5+
:context: troubleshooting-local-persistent-storage-using-lvms
6+
7+
toc::[]
8+
9+
Because {product-title} does not scope a persistent volume (PV) to a single project, it can be shared across the cluster and claimed by any project using a persistent volume claim (PVC). This can lead to a number of issues that require troubleshooting.
10+
11+
include::modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc[leveloffset=+1]
12+
13+
include::modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc[leveloffset=+1]
14+
15+
include::modules/lvms-troubleshooting-recovering-from-node-failure.adoc[leveloffset=+1]
16+
17+
[role="_additional-resources"]
18+
[id="additional-resources-forced-cleanup-1"]
19+
.Additional resources
20+
21+
* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup]
22+
23+
include::modules/lvms-troubleshooting-recovering-from-disk-failure.adoc[leveloffset=+1]
24+
25+
[role="_additional-resources"]
26+
[id="additional-resources-forced-cleanup-2"]
27+
.Additional resources
28+
29+
* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup]
30+
31+
include::modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc[leveloffset=+1]

0 commit comments

Comments
 (0)