Add details for Ceph drive replacement

oneswig · oneswig · commit d9027b519571 · 2022-10-10T22:37:03.000+01:00
The process for removing and replacing a Ceph drive is described.
Ceph-Ansible invocation for redeployment is variable between
deployments.
diff --git a/source/ceph_storage.rst b/source/ceph_storage.rst
@@ -16,14 +16,15 @@ Ceph Storage
 
    The Ceph deployment is not managed by StackHPC Ltd.
 
+Ceph Operations and Troubleshooting
+===================================
+
+.. include:: include/ceph_troubleshooting.rst
+
 .. ifconfig:: deployment['ceph_ansible']
 
    Ceph Ansible
    ============
 
    .. include:: include/ceph_ansible.rst
 
-   Ceph Troubleshooting
-   ====================
-
-   .. include:: include/ceph_troubleshooting.rst
diff --git a/source/include/ceph_ansible.rst b/source/include/ceph_ansible.rst
@@ -1,5 +1,45 @@
 Making a Ceph-Ansible Checkout
-------------------------------
+==============================
 
 Invoking Ceph-Ansible
----------------------
+=====================
+
+Replacing a Failed Ceph Drive
+=============================
+
+Once an OSD has been identified as having a hardware failure,
+the affected drive will need to be replaced.
+
+.. note::
+
+   Hot-swapping a failed device will change the device enumeration
+   and this could confuse the device addressing in Kayobe LVM
+   configuration.
+
+   In kayobe-config, use ``/dev/disk/by-path`` device references to
+   avoid this issue.
+
+   Alternatively, always reboot a server when swapping drives.
+
+If rebooting a Ceph node, first set ``noout`` to prevent excess data
+movement:
+
+.. code-block:: console
+
+   ceph# ceph osd set noout
+
+Apply LVM configuration using Kayobe for the replaced device (here on ``storage-0``):
+
+.. code-block:: console
+
+   kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0
+
+Before running Ceph-Ansible, also remove vestigial state directory
+from ``/var/lib/ceph/osd`` for the purged OSD
+
+Reapply Ceph-Asnible in the usual manner.
+
+.. note::
+
+   Ceph-Ansible runs can fail to complete if there are background activities
+   such as backfilling underway when the Ceph-Ansible playbook is invoked.
diff --git a/source/include/ceph_troubleshooting.rst b/source/include/ceph_troubleshooting.rst
@@ -1,3 +1,66 @@
+Investigating a Failed Ceph Drive
+---------------------------------
+
+After deployment, when a drive fails it may cause OSD crashes in Ceph.
+If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state.
+Ceph can report details about failed OSDs by running:
+
+.. code-block:: console
+
+   ceph# ceph health detail
+
+A failed OSD will also be reported as down by running:
+
+.. code-block:: console
+
+   ceph# ceph osd tree
+
+Note the ID of the failed OSD.
+
+The failed hardware device is logged by the Linux kernel:
+
+.. code-block:: console
+
+   storage-0# dmesg -T
+
+Cross-reference the hardware device and OSD ID to ensure they match.
+(Using `pvs` and `lvs` may help make this connection).
+
+Removing a Failed Ceph Drive
+----------------------------
+
+If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
+from the cluster:
+
+.. code-block:: console
+
+   storage-0# systemctl stop ceph-osd@4.service
+   storage-0# systemctl disable ceph-osd@4.service
+   ceph# ceph osd out osd.4
+
+.. ifconfig:: deployment['ceph_ansible']
+
+   Before running Ceph-Ansible, also remove vestigial state directory
+   from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
+
+   .. code-block:: console
+
+      storage-0# rm -rf /var/lib/ceph/osd/ceph-4
+
+Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
+backfill all the data when we reintroduce the drive).
+
+.. code-block:: console
+
+   ceph# ceph osd purge --yes-i-really-mean-it 4
+
+Unset noout for osds when hardware maintenance has concluded - eg.
+while waiting for the replacement disk:
+
+.. code-block:: console
+
+   ceph# ceph osd unset noout
+
 Inspecting a Ceph Block Device for a VM
 ---------------------------------------