Merge pull request ceph#63181 from shraddhaag/wip-shraddhaag-availability-docs

anthonyeleven · web-flow · commit b2e9f38fdddd · 2025-05-20T10:57:17.000-04:00
docs: add release notes and docs for availability score feature
diff --git a/PendingReleaseNotes b/PendingReleaseNotes
@@ -147,6 +147,14 @@
   `s3:GetObjectRetention` are also considered when fetching the source object.
   Replication of tags is controlled by the `s3:GetObject(Version)Tagging` permission.
 
+* RADOS: A new command, `ceph osd pool availability-status`, has been added that allows
+  users to view the availability score for each pool in a cluster. A pool is considered 
+  unavailable if any PG in the pool is not in active state or if there are unfound 
+  objects. Otherwise the pool is considered available. The score is updated every 
+  5 seconds. This feature is in tech preview. 
+  Related trackers:
+   - https://tracker.ceph.com/issues/67777
+
 >=19.2.1
 
 * CephFS: Command `fs subvolume create` now allows tagging subvolumes through option
diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst
@@ -738,3 +738,50 @@ Print active connections and their TCP round trip time and retransmission counte
 
 	248     89      1       mgr.0   863     1677    0
 	3       86      2       mon.0   230     278     0
+
+Tracking Data Availability Score of a Cluster
+=============================================
+
+Ceph internally tracks the data availability of each pool in a cluster.
+To check the data availability score of each pool in a cluster, 
+the following command can be invoked: 
+
+
+.. prompt:: bash $
+
+   ceph osd pool availability-status
+
+Example output:  
+
+.. prompt:: bash $
+
+   POOL       	UPTIME  DOWNTIME  NUMFAILURES  MTBF  MTTR  SCORE 	AVAILABLE
+   rbd             2m     21s        	1	     2m   21s  0.888889      	1
+   .mgr          	86s    	0s        	0	     0s	  0s     	1      	1
+   cephfs.a.meta 	77s    	0s        	0	     0s	  0s     	1      	1
+   cephfs.a.data 	76s    	0s        	0	     0s	  0s     	1      	1
+
+A pool is considered ``unavailable`` when at least one PG in the pool 
+becomes inactive or there is at least one unfound object in the pool. 
+Otherwise the pool is considered ``available``. Depending on the 
+current and previous state of the pool we update ``uptime`` and 
+``downtime`` values: 
+
+================ =============== =============== =================
+ Previous State   Current State   Uptime Update   Downtime Update 
+================ =============== =============== =================
+ Available        Available       +diff time      no update    
+ Available        Unavailable     +diff time      no update
+ Unavailable      Available       +diff time      no update 
+ Unavailable      Unavailable     no update       +diff time 
+================ =============== =============== =================
+
+From the updated ``uptime`` and ``downtime`` values, we calculate 
+the Mean Time Between Failures (MTBF) and Mean Time To Recover (MTTR)
+for each pool. The availability score is then calculated by finding 
+the ratio of MTBF to the total time.  
+
+The score is updated every five seconds. This interval is currently 
+not configurable. Any intermittent changes to the pools that 
+occur between this duration but are reset before we recheck the pool 
+status will not be captured by this feature.