You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Depending upon your configuration, Ceph may reduce recovery rates to maintain
547
-
performance or it may increase recovery rates to the point that recovery
548
-
impacts OSD performance. Check to see if the OSD is recovering.
547
+
client or OSD performance, or it may increase recovery rates to the point that
548
+
recovery impacts client or OSD performance. Check to see if the client or OSD
549
+
is recovering.
550
+
549
551
550
552
Kernel Version
551
553
--------------
552
554
553
-
Check the kernel version you are running. Older kernels may not receive
554
-
new backports that Ceph depends upon for better performance.
555
+
Check the kernel version that you are running. Older kernels may lack updates
556
+
that improve Ceph performance.
557
+
555
558
556
559
Kernel Issues with SyncFS
557
560
-------------------------
558
561
559
-
Try running one OSD per host to see if performance improves. Old kernels
560
-
might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
562
+
If you have kernel issues with SyncFS, try running one OSD per host to see if
563
+
performance improves. Old kernels might not have a recent enough version of
564
+
``glibc`` to support ``syncfs(2)``.
565
+
561
566
562
567
Filesystem Issues
563
568
-----------------
564
569
565
-
Currently, we recommend deploying clusters with the BlueStore back end.
566
-
When running a pre-Luminous release or if you have a specific reason to deploy
567
-
OSDs with the previous Filestore backend, we recommend ``XFS``.
570
+
In post-Luminous releases, we recommend deploying clusters with the BlueStore
571
+
back end. When running a pre-Luminous release, or if you have a specific
572
+
reason to deploy OSDs with the previous Filestore backend, we recommend
573
+
``XFS``.
568
574
569
575
We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has
570
-
many attractive features, but bugs may lead to
571
-
performance issues and spurious ENOSPC errors. We do not recommend
572
-
``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long
573
-
object names, which are needed for RGW.
576
+
many attractive features, but bugs may lead to performance issues and spurious
577
+
ENOSPC errors. We do not recommend ``ext4`` for Filestore OSDs because
578
+
``xattr`` limitations break support for long object names, which are needed for
579
+
RGW.
574
580
575
581
For more information, see `Filesystem Recommendations`_.
576
582
@@ -579,31 +585,32 @@ For more information, see `Filesystem Recommendations`_.
579
585
Insufficient RAM
580
586
----------------
581
587
582
-
We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up
583
-
from 6-8GB. You may notice that during normal operations, ``ceph-osd``
584
-
processes only use a fraction of that amount.
585
-
Unused RAM makes it tempting to use the excess RAM for co-resident
586
-
applications or to skimp on each node's memory capacity. However,
587
-
when OSDs experience recovery their memory utilization spikes. If
588
-
there is insufficient RAM available, OSD performance will slow considerably
589
-
and the daemons may even crash or be killed by the Linux ``OOM Killer``.
588
+
We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding
589
+
up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd``
590
+
processes use only a fraction of that amount. You might be tempted to use the
591
+
excess RAM for co-resident applications or to skimp on each node's memory
592
+
capacity. However, when OSDs experience recovery their memory utilization
593
+
spikes. If there is insufficient RAM available during recovery, OSD performance
594
+
will slow considerably and the daemons may even crash or be killed by the Linux
595
+
``OOM Killer``.
596
+
590
597
591
598
Blocked Requests or Slow Requests
592
599
---------------------------------
593
600
594
-
If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged
595
-
noting ops that are taking too long. The warning threshold
601
+
When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log
602
+
receives messages reporting ops that are taking too long. The warning threshold
596
603
defaults to 30 seconds and is configurable via the ``osd_op_complaint_time``
597
-
setting. When this happens, the cluster log will receive messages.
604
+
setting.
598
605
599
606
Legacy versions of Ceph complain about ``old requests``::
600
607
601
-
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
608
+
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
602
609
603
-
New versions of Ceph complain about ``slow requests``::
610
+
Newer versions of Ceph complain about ``slow requests``::
604
611
605
-
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
606
-
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
612
+
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
613
+
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
607
614
608
615
Possible causes include:
609
616
@@ -623,123 +630,143 @@ Possible solutions:
623
630
Debugging Slow Requests
624
631
-----------------------
625
632
626
-
If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id> dump_ops_in_flight``,
627
-
you will see a set of operations and a list of events each operation went
628
-
through. These are briefly described below.
633
+
If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id>
634
+
dump_ops_in_flight``, you will see a set of operations and a list of events
635
+
each operation went through. These are briefly described below.
629
636
630
637
Events from the Messenger layer:
631
638
632
-
- ``header_read``: When the messenger first started reading the message off the wire.
633
-
- ``throttled``: When the messenger tried to acquire memory throttle space to read
639
+
- ``header_read``: The time that the messenger first started reading the message off the wire.
640
+
- ``throttled``: The time that the messenger tried to acquire memory throttle space to read
634
641
the message into memory.
635
-
- ``all_read``: When the messenger finished reading the message off the wire.
636
-
- ``dispatched``: When the messenger gave the message to the OSD.
642
+
- ``all_read``: The time that the messenger finished reading the message off the wire.
643
+
- ``dispatched``: The time that the messenger gave the message to the OSD.
637
644
- ``initiated``: This is identical to ``header_read``. The existence of both is a
638
645
historical oddity.
639
646
640
647
Events from the OSD as it processes ops:
641
648
642
649
- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
643
-
- ``reached_pg``: The PG has started doing the op.
644
-
- ``waiting for \*``: The op is waiting for some other work to complete before it
645
-
can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to
646
-
finish peering; all as specified in the message).
650
+
- ``reached_pg``: The PG has started performing the op.
651
+
- ``waiting for \*``: The op is waiting for some other work to complete before
652
+
it can proceed (for example, a new OSDMap; the scrubbing of its object
653
+
target; the completion of a PG's peering; all as specified in the message).
647
654
- ``started``: The op has been accepted as something the OSD should do and
648
655
is now being performed.
649
656
- ``waiting for subops from``: The op has been sent to replica OSDs.
650
657
651
658
Events from ```Filestore```:
652
659
653
660
- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
654
-
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting
661
+
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting
655
662
to be persisted (as the next disk write).
656
663
- ``journaled_completion_queued``: The op was journaled to disk and its callback
657
-
queued for invocation.
664
+
has been queued for invocation.
658
665
659
666
Events from the OSD after data has been given to underlying storage:
660
667
661
-
- ``op_commit``: The op has been committed (i.e. written to journal) by the
668
+
- ``op_commit``: The op has been committed (that is, written to journal) by the
662
669
primary OSD.
663
-
- ``op_applied``: The op has been `write()'en <https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary.
670
+
- ``op_applied``: The op has been `write()'en
671
+
<https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is,
672
+
applied in memory but not flushed out to disk) on the primary.
664
673
- ``sub_op_applied``: ``op_applied``, but for a replica's "subop".
665
674
- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools).
666
675
- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it
667
676
hears about the above, but for a particular replica (i.e. ``<X>``).
668
677
- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops).
669
678
670
-
Many of these events are seemingly redundant, but cross important boundaries in
671
-
the internal code (such as passing data across locks into new threads).
679
+
Some of these events may appear redundant, but they cross important boundaries
680
+
in the internal code (such as passing data across locks into new threads).
681
+
672
682
673
683
Flapping OSDs
674
684
=============
675
685
676
-
When OSDs peer and check heartbeats, they use the cluster (back-end)
677
-
network when it's available. See `Monitor/OSD Interaction`_ for details.
686
+
"Flapping" is the term for the phenomenon of an OSD being repeatedly marked
687
+
``up`` and then ``down`` in rapid succession. This section explains how to
688
+
recognize flapping, and how to mitigate it.
678
689
679
-
We have traditionally recommended separate *public* (front-end) and *private*
680
-
(cluster / back-end / replication) networks:
690
+
When OSDs peer and check heartbeats, they use the cluster (back-end) network
691
+
when it is available. See `Monitor/OSD Interaction`_ for details.
681
692
682
-
#. Segregation of heartbeat and replication / recovery traffic (private)
683
-
from client and OSD <-> mon traffic (public). This helps keep one
684
-
from DoS-ing the other, which could in turn result in a cascading failure.
693
+
The upstream Ceph community has traditionally recommended separate *public*
694
+
(front-end) and *private* (cluster / back-end / replication) networks. This
695
+
provides the following benefits:
696
+
697
+
#. Segregation of (1) heartbeat traffic and replication/recovery traffic
698
+
(private) from (2) traffic from clients and between OSDs and monitors
699
+
(public). This helps keep one stream of traffic from DoS-ing the other,
700
+
which could in turn result in a cascading failure.
685
701
686
702
#. Additional throughput for both public and private traffic.
687
703
688
-
When common networking technologies were 100Mb/s and 1Gb/s, this separation
689
-
was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s
690
-
networks, the above capacity concerns are often diminished or even obviated.
691
-
For example, if your OSD nodes have two network ports, dedicating one to
692
-
the public and the other to the private network means no path redundancy.
693
-
This degrades your ability to weather network maintenance and failures without
694
-
significant cluster or client impact. Consider instead using both links
695
-
for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR)
696
-
you reap the benefits of increased throughput headroom, fault tolerance, and
697
-
reduced OSD flapping.
704
+
In the past, when common networking technologies were measured in a range
705
+
encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with
706
+
today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns
707
+
are often diminished or even obviated. For example, if your OSD nodes have two
708
+
network ports, dedicating one to the public and the other to the private
709
+
network means that you have no path redundancy. This degrades your ability to
710
+
endure network maintenance and network failures without significant cluster or
711
+
client impact. In situations like this, consider instead using both links for
712
+
only a public network: with bonding (LACP) or equal-cost routing (for example,
713
+
FRR) you reap the benefits of increased throughput headroom, fault tolerance,
714
+
and reduced OSD flapping.
698
715
699
716
When a private network (or even a single host link) fails or degrades while the
700
-
public network operates normally, OSDs may not handle this situation well. What
701
-
happens is that OSDs use the public network to report each other ``down`` to
702
-
the monitors, while marking themselves ``up``. The monitors then send out,
703
-
again on the public network, an updated cluster map with affected OSDs marked
704
-
`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle
705
-
repeats. We call this scenario 'flapping`, and it can be difficult to isolate
706
-
and remediate. With no private network, this irksome dynamic is avoided:
707
-
OSDs are generally either ``up`` or ``down`` without flapping.
708
-
709
-
If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and
717
+
public network continues operating normally, OSDs may not handle this situation
718
+
well. In such situations, OSDs use the public network to report each other
719
+
``down`` to the monitors, while marking themselves ``up``. The monitors then
720
+
send out-- again on the public network--an updated cluster map with the
721
+
affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead
722
+
yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be
723
+
difficult to isolate and remediate. Without a private network, this irksome
724
+
dynamic is avoided: OSDs are generally either ``up`` or ``down`` without
725
+
flapping.
726
+
727
+
If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and
710
728
then ``up`` again), you can force the monitors to halt the flapping by
711
-
temporarily freezing their states::
729
+
temporarily freezing their states:
712
730
713
-
ceph osd set noup # prevent OSDs from getting marked up
714
-
ceph osd set nodown # prevent OSDs from getting marked down
731
+
.. prompt:: bash
715
732
716
-
These flags are recorded in the osdmap::
733
+
ceph osd set noup # prevent OSDs from getting marked up
734
+
ceph osd set nodown # prevent OSDs from getting marked down
717
735
718
-
ceph osd dump | grep flags
719
-
flags no-up,no-down
736
+
These flags are recorded in the osdmap:
720
737
721
-
You can clear the flags with::
738
+
.. prompt:: bash
722
739
723
-
ceph osd unset noup
724
-
ceph osd unset nodown
740
+
ceph osd dump | grep flags
725
741
726
-
Two other flags are supported, ``noin`` and ``noout``, which prevent
727
-
booting OSDs from being marked ``in`` (allocated data) or protect OSDs
728
-
from eventually being marked ``out`` (regardless of what the current value for
729
-
``mon_osd_down_out_interval`` is).
742
+
::
730
743
731
-
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
732
-
sense that once the flags are cleared, the action they were blocking
733
-
should occur shortly after. The ``noin`` flag, on the other hand,
734
-
prevents OSDs from being marked ``in`` on boot, and any daemons that
735
-
started while the flag was set will remain that way.
744
+
flags no-up,no-down
736
745
737
-
.. note:: The causes and effects of flapping can be somewhat mitigated through
738
-
careful adjustments to the ``mon_osd_down_out_subtree_limit``,
746
+
You can clear these flags with:
747
+
748
+
.. prompt:: bash
749
+
750
+
ceph osd unset noup
751
+
ceph osd unset nodown
752
+
753
+
Two other flags are available, ``noin`` and ``noout``, which prevent booting
754
+
OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually
755
+
being marked ``out`` (regardless of the current value of
756
+
``mon_osd_down_out_interval``).
757
+
758
+
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that
759
+
after the flags are cleared, the action that they were blocking should be
760
+
possible shortly thereafter. But the ``noin`` flag prevents OSDs from being
761
+
marked ``in`` on boot, and any daemons that started while the flag was set
762
+
will remain that way.
763
+
764
+
.. note:: The causes and effects of flapping can be mitigated somewhat by
765
+
making careful adjustments to ``mon_osd_down_out_subtree_limit``,
739
766
``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
740
767
Derivation of optimal settings depends on cluster size, topology, and the
741
-
Ceph release in use. Their interactions are subtle and beyond the scope of
742
-
this document.
768
+
Ceph release in use. The interaction of all of these factors is subtle and
769
+
is beyond the scope of this document.
743
770
744
771
745
772
.. _iostat: https://en.wikipedia.org/wiki/Iostat
@@ -749,7 +776,9 @@ from eventually being marked ``out`` (regardless of what the current value for
0 commit comments