Skip to content

Commit d9027b5

Browse files
committed
Add details for Ceph drive replacement
The process for removing and replacing a Ceph drive is described. Ceph-Ansible invocation for redeployment is variable between deployments.
1 parent d821ede commit d9027b5

File tree

3 files changed

+110
-6
lines changed

3 files changed

+110
-6
lines changed

source/ceph_storage.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,15 @@ Ceph Storage
1616

1717
The Ceph deployment is not managed by StackHPC Ltd.
1818

19+
Ceph Operations and Troubleshooting
20+
===================================
21+
22+
.. include:: include/ceph_troubleshooting.rst
23+
1924
.. ifconfig:: deployment['ceph_ansible']
2025

2126
Ceph Ansible
2227
============
2328

2429
.. include:: include/ceph_ansible.rst
2530

26-
Ceph Troubleshooting
27-
====================
28-
29-
.. include:: include/ceph_troubleshooting.rst

source/include/ceph_ansible.rst

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,45 @@
11
Making a Ceph-Ansible Checkout
2-
------------------------------
2+
==============================
33

44
Invoking Ceph-Ansible
5-
---------------------
5+
=====================
6+
7+
Replacing a Failed Ceph Drive
8+
=============================
9+
10+
Once an OSD has been identified as having a hardware failure,
11+
the affected drive will need to be replaced.
12+
13+
.. note::
14+
15+
Hot-swapping a failed device will change the device enumeration
16+
and this could confuse the device addressing in Kayobe LVM
17+
configuration.
18+
19+
In kayobe-config, use ``/dev/disk/by-path`` device references to
20+
avoid this issue.
21+
22+
Alternatively, always reboot a server when swapping drives.
23+
24+
If rebooting a Ceph node, first set ``noout`` to prevent excess data
25+
movement:
26+
27+
.. code-block:: console
28+
29+
ceph# ceph osd set noout
30+
31+
Apply LVM configuration using Kayobe for the replaced device (here on ``storage-0``):
32+
33+
.. code-block:: console
34+
35+
kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0
36+
37+
Before running Ceph-Ansible, also remove vestigial state directory
38+
from ``/var/lib/ceph/osd`` for the purged OSD
39+
40+
Reapply Ceph-Asnible in the usual manner.
41+
42+
.. note::
43+
44+
Ceph-Ansible runs can fail to complete if there are background activities
45+
such as backfilling underway when the Ceph-Ansible playbook is invoked.

source/include/ceph_troubleshooting.rst

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,66 @@
1+
Investigating a Failed Ceph Drive
2+
---------------------------------
3+
4+
After deployment, when a drive fails it may cause OSD crashes in Ceph.
5+
If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state.
6+
Ceph can report details about failed OSDs by running:
7+
8+
.. code-block:: console
9+
10+
ceph# ceph health detail
11+
12+
A failed OSD will also be reported as down by running:
13+
14+
.. code-block:: console
15+
16+
ceph# ceph osd tree
17+
18+
Note the ID of the failed OSD.
19+
20+
The failed hardware device is logged by the Linux kernel:
21+
22+
.. code-block:: console
23+
24+
storage-0# dmesg -T
25+
26+
Cross-reference the hardware device and OSD ID to ensure they match.
27+
(Using `pvs` and `lvs` may help make this connection).
28+
29+
Removing a Failed Ceph Drive
30+
----------------------------
31+
32+
If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
33+
from the cluster:
34+
35+
.. code-block:: console
36+
37+
storage-0# systemctl stop [email protected]
38+
storage-0# systemctl disable [email protected]
39+
ceph# ceph osd out osd.4
40+
41+
.. ifconfig:: deployment['ceph_ansible']
42+
43+
Before running Ceph-Ansible, also remove vestigial state directory
44+
from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
45+
46+
.. code-block:: console
47+
48+
storage-0# rm -rf /var/lib/ceph/osd/ceph-4
49+
50+
Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
51+
backfill all the data when we reintroduce the drive).
52+
53+
.. code-block:: console
54+
55+
ceph# ceph osd purge --yes-i-really-mean-it 4
56+
57+
Unset noout for osds when hardware maintenance has concluded - eg.
58+
while waiting for the replacement disk:
59+
60+
.. code-block:: console
61+
62+
ceph# ceph osd unset noout
63+
164
Inspecting a Ceph Block Device for a VM
265
---------------------------------------
366

0 commit comments

Comments
 (0)