Skip to content

Commit 09c4ea8

Browse files
authored
Merge pull request ceph#54303 from zdover23/wip-doc-2023-11-02-cephadm-troubleshooting-2-of-x
doc/cephadm: edit troubleshooting.rst (2 of x) Reviewed-by: John Mulligan <[email protected]>
2 parents 2be7a25 + b096c21 commit 09c4ea8

File tree

1 file changed

+125
-93
lines changed

1 file changed

+125
-93
lines changed

doc/cephadm/troubleshooting.rst

Lines changed: 125 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ Each Ceph daemon provides an admin socket that bypasses the MONs (See
266266
Running Various Ceph Tools
267267
--------------------------------
268268

269-
To run Ceph tools like ``ceph-objectstore-tool`` or
269+
To run Ceph tools such as ``ceph-objectstore-tool`` or
270270
``ceph-monstore-tool``, invoke the cephadm CLI with
271271
``cephadm shell --name <daemon-name>``. For example::
272272

@@ -283,98 +283,114 @@ To run Ceph tools like ``ceph-objectstore-tool`` or
283283
election_strategy: 1
284284
0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
285285

286-
The cephadm shell sets up the environment in a way that is suitable
287-
for extended daemon maintenance and running daemons interactively.
286+
The cephadm shell sets up the environment in a way that is suitable for
287+
extended daemon maintenance and for the interactive running of daemons.
288288

289289
.. _cephadm-restore-quorum:
290290

291291
Restoring the Monitor Quorum
292292
----------------------------
293293

294-
If the Ceph monitor daemons (mons) cannot form a quorum, cephadm will not be
295-
able to manage the cluster until quorum is restored.
294+
If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
295+
be able to manage the cluster until quorum is restored.
296296

297297
In order to restore the quorum, remove unhealthy monitors
298298
form the monmap by following these steps:
299299

300-
1. Stop all mons. For each mon host::
300+
1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
301+
while connected to the Monitor's host use ``cephadm`` to stop the Monitor
302+
daemon:
303+
304+
.. prompt:: bash
305+
306+
ssh {mon-host}
307+
cephadm unit --name {mon.hostname} stop
301308

302-
ssh {mon-host}
303-
cephadm unit --name mon.`hostname` stop
304309

310+
2. Identify a surviving Monitor and log in to its host:
305311

306-
2. Identify a surviving monitor and log in to that host::
312+
.. prompt:: bash
307313

308-
ssh {mon-host}
309-
cephadm enter --name mon.`hostname`
314+
ssh {mon-host}
315+
cephadm enter --name {mon.hostname}
310316

311-
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
317+
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.
312318

313319
.. _cephadm-manually-deploy-mgr:
314320

315321
Manually Deploying a Manager Daemon
316322
-----------------------------------
317-
At least one manager (mgr) daemon is required by cephadm in order to manage the
318-
cluster. If the last mgr in a cluster has been removed, follow these steps in
319-
order to deploy a manager called (for example)
320-
``mgr.hostname.smfvfd`` on a random host of your cluster manually.
323+
At least one Manager (``mgr``) daemon is required by cephadm in order to manage
324+
the cluster. If the last remaining Manager has been removed from the Ceph
325+
cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
326+
host in your cluster. In this example, the freshly-deployed Manager daemon is
327+
called ``mgr.hostname.smfvfd``.
328+
329+
#. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
330+
the new Manager. See :ref:`cephadm-enable-cli`:
331+
332+
.. prompt:: bash #
321333

322-
Disable the cephadm scheduler, in order to prevent cephadm from removing the new
323-
manager. See :ref:`cephadm-enable-cli`::
334+
ceph config-key set mgr/cephadm/pause true
324335

325-
ceph config-key set mgr/cephadm/pause true
336+
#. Retrieve or create the "auth entry" for the new Manager:
326337

327-
Then get or create the auth entry for the new manager::
338+
.. prompt:: bash #
328339

329-
ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
340+
ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
330341
331-
Get the ceph.conf::
342+
#. Retrieve the Monitor's configuration:
332343

333-
ceph config generate-minimal-conf
344+
.. prompt:: bash #
334345

335-
Get the container image::
346+
ceph config generate-minimal-conf
336347

337-
ceph config get "mgr.hostname.smfvfd" container_image
348+
#. Retrieve the container image:
338349

339-
Create a file ``config-json.json`` which contains the information necessary to deploy
340-
the daemon:
350+
.. prompt:: bash #
341351

342-
.. code-block:: json
352+
ceph config get "mgr.hostname.smfvfd" container_image
343353

344-
{
345-
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
346-
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
347-
}
354+
#. Create a file called ``config-json.json``, which contains the information
355+
necessary to deploy the daemon:
348356

349-
Deploy the daemon::
357+
.. code-block:: json
350358
351-
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
359+
{
360+
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
361+
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
362+
}
363+
364+
#. Deploy the Manager daemon:
365+
366+
.. prompt:: bash #
367+
368+
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
352369

353370
Capturing Core Dumps
354371
---------------------
355372

356-
A Ceph cluster that uses cephadm can be configured to capture core dumps.
357-
Initial capture and processing of the coredump is performed by
358-
`systemd-coredump <https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
373+
A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
374+
The initial capture and processing of the coredump is performed by
375+
`systemd-coredump
376+
<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
359377

360378

361-
To enable coredump handling, run:
379+
To enable coredump handling, run the following command
362380

363381
.. prompt:: bash #
364382

365-
ulimit -c unlimited
383+
ulimit -c unlimited
366384

367-
Core dumps will be written to ``/var/lib/systemd/coredump``.
368-
This will persist until the system is rebooted.
369385

370386
.. note::
371387

372-
Core dumps are not namespaced by the kernel, which means
373-
they will be written to ``/var/lib/systemd/coredump`` on
374-
the container host.
388+
Core dumps are not namespaced by the kernel. This means that core dumps are
389+
written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
390+
-c unlimited`` setting will persist only until the system is rebooted.
375391

376-
Now, wait for the crash to happen again. To simulate the crash of a daemon, run
377-
e.g. ``killall -3 ceph-mon``.
392+
Wait for the crash to happen again. To simulate the crash of a daemon, run for
393+
example ``killall -3 ceph-mon``.
378394

379395

380396
Running the Debugger with cephadm
@@ -383,45 +399,58 @@ Running the Debugger with cephadm
383399
Running a single debugging session
384400
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
385401

386-
One can initiate a debugging session using the ``cephadm shell`` command.
402+
Initiate a debugging session by using the ``cephadm shell`` command.
387403
From within the shell container we need to install the debugger and debuginfo
388404
packages. To debug a core file captured by systemd, run the following:
389405

390-
.. prompt:: bash #
391406

392-
# start the shell session
393-
cephadm shell --mount /var/lib/system/coredump
394-
# within the shell:
395-
dnf install ceph-debuginfo gdb zstd
407+
#. Start the shell session:
408+
409+
.. prompt:: bash #
410+
411+
cephadm shell --mount /var/lib/system/coredump
412+
413+
#. From within the shell session, run the following commands:
414+
415+
.. prompt:: bash #
416+
417+
dnf install ceph-debuginfo gdb zstd
418+
419+
.. prompt:: bash #
420+
396421
unzstd /var/lib/systemd/coredump/core.ceph-*.zst
422+
423+
.. prompt:: bash #
424+
397425
gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
398426
399-
You can then run debugger commands at gdb's prompt.
427+
#. Run debugger commands at gdb's prompt:
428+
429+
.. prompt:: bash (gdb)
400430

401-
.. prompt::
431+
bt
432+
433+
::
402434

403-
(gdb) bt
404-
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
405-
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
406-
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
407-
#3 0x0000563085ca3d7e in main ()
435+
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
436+
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
437+
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
438+
#3 0x0000563085ca3d7e in main ()
408439

409440

410441
Running repeated debugging sessions
411442
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
412443

413-
When using ``cephadm shell``, like in the example above, the changes made to
414-
the container the shell command spawned are ephemeral. Once the shell session
415-
exits all of the files that were downloaded and installed are no longer
416-
available. One can simply re-run the same commands every time ``cephadm shell``
417-
is invoked, but in order to save time and resources one can create a new
418-
container image and use it for repeated debugging sessions.
444+
When using ``cephadm shell``, as in the example above, any changes made to the
445+
container that is spawned by the shell command are ephemeral. After the shell
446+
session exits, the files that were downloaded and installed cease to be
447+
available. You can simply re-run the same commands every time ``cephadm
448+
shell`` is invoked, but in order to save time and resources one can create a
449+
new container image and use it for repeated debugging sessions.
419450

420-
In the following example we create a simple file for constructing the
421-
container image. The command below uses podman but it should work correctly
422-
if ``podman`` is replaced with ``docker``.
423-
424-
.. prompt:: bash
451+
In the following example, we create a simple file that will construct the
452+
container image. The command below uses podman but it is expected to work
453+
correctly even if ``podman`` is replaced with ``docker``::
425454

426455
cat >Containerfile <<EOF
427456
ARG BASE_IMG=quay.io/ceph/ceph:v18
@@ -432,16 +461,17 @@ if ``podman`` is replaced with ``docker``.
432461
podman build -t ceph:debugging -f Containerfile .
433462
# pass --build-arg=BASE_IMG=<your image> to customize the base image
434463

435-
The result should be a new local image named ``ceph:debugging``. This image can
436-
be used on the same machine that built it. Later, the image could be pushed to
437-
a container repository, or saved and copied to a node runing other ceph
438-
containers. Please consult the documentation for ``podman`` or ``docker`` for
439-
more details on the general container workflow.
464+
The above file creates a new local image named ``ceph:debugging``. This image
465+
can be used on the same machine that built it. The image can also be pushed to
466+
a container repository or saved and copied to a node runing other Ceph
467+
containers. Consult the ``podman`` or ``docker`` documentation for more
468+
information about the container workflow.
440469

441-
Once the image has been built it can be used to initiate repeat debugging
442-
sessions without having to re-install the debug tools and debuginfo packages.
443-
To debug a core file using this image, in the same way as previously described,
444-
run:
470+
After the image has been built, it can be used to initiate repeat debugging
471+
sessions. By using an image in this way, you avoid the trouble of having to
472+
re-install the debug tools and debuginfo packages every time you need to run a
473+
debug session. To debug a core file using this image, in the same way as
474+
previously described, run:
445475

446476
.. prompt:: bash #
447477

@@ -451,29 +481,31 @@ run:
451481
Debugging live processes
452482
~~~~~~~~~~~~~~~~~~~~~~~~
453483

454-
The gdb debugger has the ability to attach to running processes to debug them.
455-
For a containerized process this can be accomplished by using the debug image
456-
and attaching it to the same PID namespace as the process to be debugged.
484+
The gdb debugger can attach to running processes to debug them. This can be
485+
achieved with a containerized process by using the debug image and attaching it
486+
to the same PID namespace in which the process to be debugged resides.
457487

458-
This requires running a container command with some custom arguments. We can generate a script that can debug a process in a running container.
488+
This requires running a container command with some custom arguments. We can
489+
generate a script that can debug a process in a running container.
459490

460491
.. prompt:: bash #
461492

462493
cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
463494

464-
This creates a script with the container command cephadm would use to create a
465-
shell. Now, modify the script by removing the ``--init`` argument and replace
466-
that with the argument to join to the namespace used for a running running
467-
container. For example, let's assume we want to debug the MGR, and have
468-
determnined that the MGR is running in a container named
469-
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. The new
470-
argument
495+
This creates a script that includes the container command that ``cephadm``
496+
would use to create a shell. Modify the script by removing the ``--init``
497+
argument and replace it with the argument that joins to the namespace used for
498+
a running running container. For example, assume we want to debug the Manager
499+
and have determnined that the Manager is running in a container named
500+
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
501+
the argument
471502
``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
472503
should be used.
473504

474-
Now, we can run our debugging container with ``sh /tmp/debug.sh``. Within the shell
475-
we can run commands such as ``ps`` to get the PID of the MGR process. In the following
476-
example this will be ``2``. Running gdb, we can now attach to the running process:
505+
We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
506+
we can run commands such as ``ps`` to get the PID of the Manager process. In
507+
the following example this is ``2``. While running gdb, we can attach to the
508+
running process:
477509

478510
.. prompt:: bash (gdb)
479511

0 commit comments

Comments
 (0)