@@ -738,3 +738,50 @@ Print active connections and their TCP round trip time and retransmission counte
738738
739739 248 89 1 mgr.0 863 1677 0
740740 3 86 2 mon.0 230 278 0
741+
742+ Tracking Data Availability Score of a Cluster
743+ =============================================
744+
745+ Ceph internally tracks the data availability of each pool in a cluster.
746+ To check the data availability score of each pool in a cluster,
747+ the following command can be invoked:
748+
749+
750+ .. prompt :: bash $
751+
752+ ceph osd pool availability-status
753+
754+ Example output:
755+
756+ .. prompt :: bash $
757+
758+ POOL UPTIME DOWNTIME NUMFAILURES MTBF MTTR SCORE AVAILABLE
759+ rbd 2m 21s 1 2m 21s 0.888889 1
760+ .mgr 86s 0s 0 0s 0s 1 1
761+ cephfs.a.meta 77s 0s 0 0s 0s 1 1
762+ cephfs.a.data 76s 0s 0 0s 0s 1 1
763+
764+ A pool is considered ``unavailable `` when at least one PG in the pool
765+ becomes inactive or there is at least one unfound object in the pool.
766+ Otherwise the pool is considered ``available ``. Depending on the
767+ current and previous state of the pool we update ``uptime `` and
768+ ``downtime `` values:
769+
770+ ================ =============== =============== =================
771+ Previous State Current State Uptime Update Downtime Update
772+ ================ =============== =============== =================
773+ Available Available +diff time no update
774+ Available Unavailable +diff time no update
775+ Unavailable Available +diff time no update
776+ Unavailable Unavailable no update +diff time
777+ ================ =============== =============== =================
778+
779+ From the updated ``uptime `` and ``downtime `` values, we calculate
780+ the Mean Time Between Failures (MTBF) and Mean Time To Recover (MTTR)
781+ for each pool. The availability score is then calculated by finding
782+ the ratio of MTBF to the total time.
783+
784+ The score is updated every five seconds. This interval is currently
785+ not configurable. Any intermittent changes to the pools that
786+ occur between this duration but are reset before we recheck the pool
787+ status will not be captured by this feature.
0 commit comments