1
- ==========================
2
- Managing Ceph with Cephadm
3
- ==========================
1
+ ===========================
2
+ Managing and Operating Ceph
3
+ ===========================
4
+
5
+ Working with Cephadm
6
+ ====================
4
7
5
8
cephadm configuration location
6
- ==============================
9
+ ------------------------------
7
10
8
11
In kayobe-config repository, under ``etc/kayobe/cephadm.yml `` (or in a specific
9
12
Kayobe environment when using multiple environment, e.g.
10
- ``etc/kayobe/environments/production /cephadm.yml ``)
13
+ ``etc/kayobe/environments/<Environment Name> /cephadm.yml ``)
11
14
12
15
StackHPC's cephadm Ansible collection relies on multiple inventory groups:
13
16
@@ -19,7 +22,7 @@ StackHPC's cephadm Ansible collection relies on multiple inventory groups:
19
22
Those groups are usually defined in ``etc/kayobe/inventory/groups ``.
20
23
21
24
Running cephadm playbooks
22
- =========================
25
+ -------------------------
23
26
24
27
In kayobe-config repository, under ``etc/kayobe/ansible `` there is a set of
25
28
cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
@@ -36,7 +39,7 @@ cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
36
39
- ``cephadm-pools.yml `` - defines Ceph pools\
37
40
38
41
Running Ceph commands
39
- =====================
42
+ ---------------------
40
43
41
44
Ceph commands are usually run inside a ``cephadm shell `` utility container:
42
45
@@ -47,12 +50,12 @@ Ceph commands are usually run inside a ``cephadm shell`` utility container:
47
50
48
51
Operating a cluster requires a keyring with an admin access to be available for Ceph
49
52
commands. Cephadm will copy such keyring to the nodes carrying
50
- `_admin <https://docs.ceph.com/en/quincy /cephadm/host-management/#special-host-labels >`__
53
+ `_admin <https://docs.ceph.com/en/latest /cephadm/host-management/#special-host-labels >`__
51
54
label - present on MON servers by default when using
52
55
`StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm >`__.
53
56
54
57
Adding a new storage node
55
- =========================
58
+ -------------------------
56
59
57
60
Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml ``
58
61
playbook.
@@ -62,7 +65,7 @@ playbook.
62
65
``-e cephadm_bootstrap=True `` on playbook run.
63
66
64
67
Removing a storage node
65
- =======================
68
+ -----------------------
66
69
67
70
First drain the node
68
71
@@ -85,7 +88,7 @@ Additional options/commands may be found in
85
88
`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/ >`_
86
89
87
90
Replacing a Failed Ceph Drive
88
- =============================
91
+ -----------------------------
89
92
90
93
Once an OSD has been identified as having a hardware failure,
91
94
the affected drive will need to be replaced.
@@ -121,3 +124,156 @@ If this is not your desired action plan - it's best to modify the drivegroup
121
124
spec before (``cephadm_osd_spec `` variable in ``etc/kayobe/cephadm.yml ``).
122
125
Either set ``unmanaged: true `` to stop cephadm from picking up new disks or
123
126
modify it in some way that it no longer matches the drives you want to remove.
127
+
128
+
129
+ Operations
130
+ ==========
131
+
132
+ Replacing drive
133
+ ---------------
134
+
135
+ See upstream documentation:
136
+ https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
137
+
138
+ In case where disk holding DB and/or WAL fails, it is necessary to recreate
139
+ (using replacement procedure above) all OSDs that are associated with this
140
+ disk - usually NVMe drive. The following single command is sufficient to
141
+ identify which OSDs are tied to which physical disks:
142
+
143
+ .. code-block :: console
144
+
145
+ ceph# ceph device ls
146
+
147
+ Host maintenance
148
+ ----------------
149
+
150
+ https://docs.ceph.com/en/latest/cephadm/host-management/#maintenance-mode
151
+
152
+ Upgrading
153
+ ---------
154
+
155
+ https://docs.ceph.com/en/latest/cephadm/upgrade/
156
+
157
+
158
+ Troubleshooting
159
+ ===============
160
+
161
+ Investigating a Failed Ceph Drive
162
+ ---------------------------------
163
+
164
+ A failing drive in a Ceph cluster will cause OSD daemon to crash.
165
+ In this case Ceph will go into `HEALTH_WARN ` state.
166
+ Ceph can report details about failed OSDs by running:
167
+
168
+ .. code-block :: console
169
+
170
+ ceph# ceph health detail
171
+
172
+ .. note ::
173
+
174
+ Remember to run ceph/rbd commands from within ``cephadm shell``
175
+ (preferred method) or after installing Ceph client. Details in the
176
+ official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
177
+ It is also required that the host where commands are executed has admin
178
+ Ceph keyring present - easiest to achieve by applying
179
+ `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
180
+ label (Ceph MON servers have it by default when using
181
+ `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
182
+
183
+ A failed OSD will also be reported as down by running:
184
+
185
+ .. code-block :: console
186
+
187
+ ceph# ceph osd tree
188
+
189
+ Note the ID of the failed OSD.
190
+
191
+ The failed disk is usually logged by the Linux kernel too:
192
+
193
+ .. code-block :: console
194
+
195
+ storage-0# dmesg -T
196
+
197
+ Cross-reference the hardware device and OSD ID to ensure they match.
198
+ (Using `pvs ` and `lvs ` may help make this connection).
199
+
200
+ Inspecting a Ceph Block Device for a VM
201
+ ---------------------------------------
202
+
203
+ To find out what block devices are attached to a VM, go to the hypervisor that
204
+ it is running on (an admin-level user can see this from ``openstack server
205
+ show ``).
206
+
207
+ On this hypervisor, enter the libvirt container:
208
+
209
+ .. code-block :: console
210
+ :substitutions:
211
+
212
+ |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash
213
+
214
+ Find the VM name using libvirt:
215
+
216
+ .. code-block :: console
217
+ :substitutions:
218
+
219
+ (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list
220
+ Id Name State
221
+ ------------------------------------
222
+ 1 instance-00000001 running
223
+
224
+ Now inspect the properties of the VM using ``virsh dumpxml ``:
225
+
226
+ .. code-block :: console
227
+ :substitutions:
228
+
229
+ (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd
230
+ <source protocol='rbd' name='|nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk'>
231
+
232
+ On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW
233
+ block image:
234
+
235
+ .. code-block :: console
236
+ :substitutions:
237
+
238
+ ceph# rbd ls |nova_rbd_pool|
239
+ ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw
240
+
241
+ The raw block device (blob.raw above) can be mounted using the loopback device.
242
+
243
+ Inspecting a QCOW Image using LibGuestFS
244
+ ----------------------------------------
245
+
246
+ The virtual machine's root image can be inspected by installing
247
+ libguestfs-tools and using the guestfish command:
248
+
249
+ .. code-block :: console
250
+
251
+ ceph# export LIBGUESTFS_BACKEND=direct
252
+ ceph# guestfish -a blob.qcow
253
+ ><fs> run
254
+ 100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00
255
+ ><fs> list-filesystems
256
+ /dev/sda1: ext4
257
+ ><fs> mount /dev/sda1 /
258
+ ><fs> ls /
259
+ bin
260
+ boot
261
+ dev
262
+ etc
263
+ home
264
+ lib
265
+ lib64
266
+ lost+found
267
+ media
268
+ mnt
269
+ opt
270
+ proc
271
+ root
272
+ run
273
+ sbin
274
+ srv
275
+ sys
276
+ tmp
277
+ usr
278
+ var
279
+ ><fs> quit
0 commit comments