Merge pull request #28280 from chaitanyaenr/etcd-network-latency

ahardin-rh · web-flow · commit 36419dd4e94e · 2020-12-22T09:51:33.000-05:00
Add etcd network peer latency recommendation
diff --git a/modules/recommended-etcd-practices.adoc b/modules/recommended-etcd-practices.adoc
@@ -54,3 +54,11 @@ $ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale
 
 The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile 
 of the fsync metric captured from the run to see if it is less than 10ms.
+
+Etcd replicates the requests among all the members, so its performance strongly depends on network 
+input/output (IO) latency. High network latencies result in etcd heartbeats taking longer than the 
+election timeout, which leads to leader elections that are disruptive to the cluster. A key metric 
+to monitor on a deployed {product-title} cluster is the 99th percentile of etcd network peer latency 
+on each etcd cluster member. Use Prometheus to track the metric. `histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))` 
+reports the round trip time for etcd to finish replicating the client requests between the members; 
+it should be less than 50 ms.