Skip to content

Commit 4a426ea

Browse files
authored
Merge pull request ceph#54022 from zdover23/wip-doc-2023-10-15-rados-troubleshooting-troubleshooting-osd-3-of-x
doc/rados: Edit troubleshooting-osd (3 of x) Reviewed-by: Anthony D'Atri <[email protected]>
2 parents 9b94342 + 01b5aa5 commit 4a426ea

File tree

1 file changed

+123
-94
lines changed

1 file changed

+123
-94
lines changed

doc/rados/troubleshooting/troubleshooting-osd.rst

Lines changed: 123 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -544,33 +544,39 @@ Recovery Throttling
544544
-------------------
545545

546546
Depending upon your configuration, Ceph may reduce recovery rates to maintain
547-
performance or it may increase recovery rates to the point that recovery
548-
impacts OSD performance. Check to see if the OSD is recovering.
547+
client or OSD performance, or it may increase recovery rates to the point that
548+
recovery impacts client or OSD performance. Check to see if the client or OSD
549+
is recovering.
550+
549551

550552
Kernel Version
551553
--------------
552554

553-
Check the kernel version you are running. Older kernels may not receive
554-
new backports that Ceph depends upon for better performance.
555+
Check the kernel version that you are running. Older kernels may lack updates
556+
that improve Ceph performance.
557+
555558

556559
Kernel Issues with SyncFS
557560
-------------------------
558561

559-
Try running one OSD per host to see if performance improves. Old kernels
560-
might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
562+
If you have kernel issues with SyncFS, try running one OSD per host to see if
563+
performance improves. Old kernels might not have a recent enough version of
564+
``glibc`` to support ``syncfs(2)``.
565+
561566

562567
Filesystem Issues
563568
-----------------
564569

565-
Currently, we recommend deploying clusters with the BlueStore back end.
566-
When running a pre-Luminous release or if you have a specific reason to deploy
567-
OSDs with the previous Filestore backend, we recommend ``XFS``.
570+
In post-Luminous releases, we recommend deploying clusters with the BlueStore
571+
back end. When running a pre-Luminous release, or if you have a specific
572+
reason to deploy OSDs with the previous Filestore backend, we recommend
573+
``XFS``.
568574

569575
We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has
570-
many attractive features, but bugs may lead to
571-
performance issues and spurious ENOSPC errors. We do not recommend
572-
``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long
573-
object names, which are needed for RGW.
576+
many attractive features, but bugs may lead to performance issues and spurious
577+
ENOSPC errors. We do not recommend ``ext4`` for Filestore OSDs because
578+
``xattr`` limitations break support for long object names, which are needed for
579+
RGW.
574580

575581
For more information, see `Filesystem Recommendations`_.
576582

@@ -579,31 +585,32 @@ For more information, see `Filesystem Recommendations`_.
579585
Insufficient RAM
580586
----------------
581587

582-
We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up
583-
from 6-8GB. You may notice that during normal operations, ``ceph-osd``
584-
processes only use a fraction of that amount.
585-
Unused RAM makes it tempting to use the excess RAM for co-resident
586-
applications or to skimp on each node's memory capacity. However,
587-
when OSDs experience recovery their memory utilization spikes. If
588-
there is insufficient RAM available, OSD performance will slow considerably
589-
and the daemons may even crash or be killed by the Linux ``OOM Killer``.
588+
We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding
589+
up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd``
590+
processes use only a fraction of that amount. You might be tempted to use the
591+
excess RAM for co-resident applications or to skimp on each node's memory
592+
capacity. However, when OSDs experience recovery their memory utilization
593+
spikes. If there is insufficient RAM available during recovery, OSD performance
594+
will slow considerably and the daemons may even crash or be killed by the Linux
595+
``OOM Killer``.
596+
590597

591598
Blocked Requests or Slow Requests
592599
---------------------------------
593600

594-
If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged
595-
noting ops that are taking too long. The warning threshold
601+
When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log
602+
receives messages reporting ops that are taking too long. The warning threshold
596603
defaults to 30 seconds and is configurable via the ``osd_op_complaint_time``
597-
setting. When this happens, the cluster log will receive messages.
604+
setting.
598605

599606
Legacy versions of Ceph complain about ``old requests``::
600607

601-
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
608+
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
602609

603-
New versions of Ceph complain about ``slow requests``::
610+
Newer versions of Ceph complain about ``slow requests``::
604611

605-
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
606-
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
612+
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
613+
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
607614

608615
Possible causes include:
609616

@@ -623,123 +630,143 @@ Possible solutions:
623630
Debugging Slow Requests
624631
-----------------------
625632

626-
If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``,
627-
you will see a set of operations and a list of events each operation went
628-
through. These are briefly described below.
633+
If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id>
634+
dump_ops_in_flight``, you will see a set of operations and a list of events
635+
each operation went through. These are briefly described below.
629636

630637
Events from the Messenger layer:
631638

632-
- ``header_read``: When the messenger first started reading the message off the wire.
633-
- ``throttled``: When the messenger tried to acquire memory throttle space to read
639+
- ``header_read``: The time that the messenger first started reading the message off the wire.
640+
- ``throttled``: The time that the messenger tried to acquire memory throttle space to read
634641
the message into memory.
635-
- ``all_read``: When the messenger finished reading the message off the wire.
636-
- ``dispatched``: When the messenger gave the message to the OSD.
642+
- ``all_read``: The time that the messenger finished reading the message off the wire.
643+
- ``dispatched``: The time that the messenger gave the message to the OSD.
637644
- ``initiated``: This is identical to ``header_read``. The existence of both is a
638645
historical oddity.
639646

640647
Events from the OSD as it processes ops:
641648

642649
- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
643-
- ``reached_pg``: The PG has started doing the op.
644-
- ``waiting for \*``: The op is waiting for some other work to complete before it
645-
can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to
646-
finish peering; all as specified in the message).
650+
- ``reached_pg``: The PG has started performing the op.
651+
- ``waiting for \*``: The op is waiting for some other work to complete before
652+
it can proceed (for example, a new OSDMap; the scrubbing of its object
653+
target; the completion of a PG's peering; all as specified in the message).
647654
- ``started``: The op has been accepted as something the OSD should do and
648655
is now being performed.
649656
- ``waiting for subops from``: The op has been sent to replica OSDs.
650657

651658
Events from ```Filestore```:
652659

653660
- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
654-
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting
661+
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting
655662
to be persisted (as the next disk write).
656663
- ``journaled_completion_queued``: The op was journaled to disk and its callback
657-
queued for invocation.
664+
has been queued for invocation.
658665

659666
Events from the OSD after data has been given to underlying storage:
660667

661-
- ``op_commit``: The op has been committed (i.e. written to journal) by the
668+
- ``op_commit``: The op has been committed (that is, written to journal) by the
662669
primary OSD.
663-
- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary.
670+
- ``op_applied``: The op has been `write()'en
671+
<https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is,
672+
applied in memory but not flushed out to disk) on the primary.
664673
- ``sub_op_applied``: ``op_applied``, but for a replica's "subop".
665674
- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools).
666675
- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it
667676
hears about the above, but for a particular replica (i.e. ``<X>``).
668677
- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops).
669678

670-
Many of these events are seemingly redundant, but cross important boundaries in
671-
the internal code (such as passing data across locks into new threads).
679+
Some of these events may appear redundant, but they cross important boundaries
680+
in the internal code (such as passing data across locks into new threads).
681+
672682

673683
Flapping OSDs
674684
=============
675685

676-
When OSDs peer and check heartbeats, they use the cluster (back-end)
677-
network when it's available. See `Monitor/OSD Interaction`_ for details.
686+
"Flapping" is the term for the phenomenon of an OSD being repeatedly marked
687+
``up`` and then ``down`` in rapid succession. This section explains how to
688+
recognize flapping, and how to mitigate it.
678689

679-
We have traditionally recommended separate *public* (front-end) and *private*
680-
(cluster / back-end / replication) networks:
690+
When OSDs peer and check heartbeats, they use the cluster (back-end) network
691+
when it is available. See `Monitor/OSD Interaction`_ for details.
681692

682-
#. Segregation of heartbeat and replication / recovery traffic (private)
683-
from client and OSD <-> mon traffic (public). This helps keep one
684-
from DoS-ing the other, which could in turn result in a cascading failure.
693+
The upstream Ceph community has traditionally recommended separate *public*
694+
(front-end) and *private* (cluster / back-end / replication) networks. This
695+
provides the following benefits:
696+
697+
#. Segregation of (1) heartbeat traffic and replication/recovery traffic
698+
(private) from (2) traffic from clients and between OSDs and monitors
699+
(public). This helps keep one stream of traffic from DoS-ing the other,
700+
which could in turn result in a cascading failure.
685701

686702
#. Additional throughput for both public and private traffic.
687703

688-
When common networking technologies were 100Mb/s and 1Gb/s, this separation
689-
was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s
690-
networks, the above capacity concerns are often diminished or even obviated.
691-
For example, if your OSD nodes have two network ports, dedicating one to
692-
the public and the other to the private network means no path redundancy.
693-
This degrades your ability to weather network maintenance and failures without
694-
significant cluster or client impact. Consider instead using both links
695-
for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR)
696-
you reap the benefits of increased throughput headroom, fault tolerance, and
697-
reduced OSD flapping.
704+
In the past, when common networking technologies were measured in a range
705+
encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with
706+
today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns
707+
are often diminished or even obviated. For example, if your OSD nodes have two
708+
network ports, dedicating one to the public and the other to the private
709+
network means that you have no path redundancy. This degrades your ability to
710+
endure network maintenance and network failures without significant cluster or
711+
client impact. In situations like this, consider instead using both links for
712+
only a public network: with bonding (LACP) or equal-cost routing (for example,
713+
FRR) you reap the benefits of increased throughput headroom, fault tolerance,
714+
and reduced OSD flapping.
698715

699716
When a private network (or even a single host link) fails or degrades while the
700-
public network operates normally, OSDs may not handle this situation well. What
701-
happens is that OSDs use the public network to report each other ``down`` to
702-
the monitors, while marking themselves ``up``. The monitors then send out,
703-
again on the public network, an updated cluster map with affected OSDs marked
704-
`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle
705-
repeats. We call this scenario 'flapping`, and it can be difficult to isolate
706-
and remediate. With no private network, this irksome dynamic is avoided:
707-
OSDs are generally either ``up`` or ``down`` without flapping.
708-
709-
If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and
717+
public network continues operating normally, OSDs may not handle this situation
718+
well. In such situations, OSDs use the public network to report each other
719+
``down`` to the monitors, while marking themselves ``up``. The monitors then
720+
send out-- again on the public network--an updated cluster map with the
721+
affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead
722+
yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be
723+
difficult to isolate and remediate. Without a private network, this irksome
724+
dynamic is avoided: OSDs are generally either ``up`` or ``down`` without
725+
flapping.
726+
727+
If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and
710728
then ``up`` again), you can force the monitors to halt the flapping by
711-
temporarily freezing their states::
729+
temporarily freezing their states:
712730

713-
ceph osd set noup # prevent OSDs from getting marked up
714-
ceph osd set nodown # prevent OSDs from getting marked down
731+
.. prompt:: bash
715732

716-
These flags are recorded in the osdmap::
733+
ceph osd set noup # prevent OSDs from getting marked up
734+
ceph osd set nodown # prevent OSDs from getting marked down
717735

718-
ceph osd dump | grep flags
719-
flags no-up,no-down
736+
These flags are recorded in the osdmap:
720737

721-
You can clear the flags with::
738+
.. prompt:: bash
722739

723-
ceph osd unset noup
724-
ceph osd unset nodown
740+
ceph osd dump | grep flags
725741

726-
Two other flags are supported, ``noin`` and ``noout``, which prevent
727-
booting OSDs from being marked ``in`` (allocated data) or protect OSDs
728-
from eventually being marked ``out`` (regardless of what the current value for
729-
``mon_osd_down_out_interval`` is).
742+
::
730743

731-
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
732-
sense that once the flags are cleared, the action they were blocking
733-
should occur shortly after. The ``noin`` flag, on the other hand,
734-
prevents OSDs from being marked ``in`` on boot, and any daemons that
735-
started while the flag was set will remain that way.
744+
flags no-up,no-down
736745

737-
.. note:: The causes and effects of flapping can be somewhat mitigated through
738-
careful adjustments to the ``mon_osd_down_out_subtree_limit``,
746+
You can clear these flags with:
747+
748+
.. prompt:: bash
749+
750+
ceph osd unset noup
751+
ceph osd unset nodown
752+
753+
Two other flags are available, ``noin`` and ``noout``, which prevent booting
754+
OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually
755+
being marked ``out`` (regardless of the current value of
756+
``mon_osd_down_out_interval``).
757+
758+
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that
759+
after the flags are cleared, the action that they were blocking should be
760+
possible shortly thereafter. But the ``noin`` flag prevents OSDs from being
761+
marked ``in`` on boot, and any daemons that started while the flag was set
762+
will remain that way.
763+
764+
.. note:: The causes and effects of flapping can be mitigated somewhat by
765+
making careful adjustments to ``mon_osd_down_out_subtree_limit``,
739766
``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
740767
Derivation of optimal settings depends on cluster size, topology, and the
741-
Ceph release in use. Their interactions are subtle and beyond the scope of
742-
this document.
768+
Ceph release in use. The interaction of all of these factors is subtle and
769+
is beyond the scope of this document.
743770

744771

745772
.. _iostat: https://en.wikipedia.org/wiki/Iostat
@@ -749,7 +776,9 @@ from eventually being marked ``out`` (regardless of what the current value for
749776
.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
750777
.. _Monitor Config Reference: ../../configuration/mon-config-ref
751778
.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
779+
752780
.. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds
781+
753782
.. _subscribe to the ceph-devel email list: mailto:[email protected]?body=subscribe+ceph-devel
754783
.. _unsubscribe from the ceph-devel email list: mailto:[email protected]?body=unsubscribe+ceph-devel
755784
.. _subscribe to the ceph-users email list: mailto:[email protected]

0 commit comments

Comments
 (0)