Merge pull request #47415 from tmalove/etcd-3716-hardware-reco

lpettyjo · web-flow · commit 20e42e0d276b · 2022-07-13T08:02:53.000-04:00
[OSDOCS-3716]: Add etcd hardware recommendations
diff --git a/modules/recommended-etcd-practices.adoc b/modules/recommended-etcd-practices.adoc
@@ -18,6 +18,20 @@ For more information about defragmenting etcd, see the "Defragmenting etcd data"
 
 Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance. Slow disks and disk activity from other processes can cause long fsync latencies. Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. Run etcd on machines that are backed by SSD or NVMe disks with low latency and high throughput. Consider single-level cell (SLC) solid-state drives (SSDs), which provide 1 bit per memory cell, are durable and reliable, and are ideal for write-intensive workloads.
 
+The following hard disk features provide optimal etcd performance:
+
+* Low latency to support fast read operation.
+* High-bandwidth writes for faster compactions and defragmentation.
+* High-bandwidth reads for faster recovery from failures.
+* Solid state drives as a minimum selection, however NVMe drives are preferred.
+* Server-grade hardware from various manufacturers for increased reliability.
+* RAID 0 technology for increased performance.
+* Dedicated etcd drives. Do not place log files or other heavy workloads on etcd drives. 
+
+Avoid NAS or SAN setups, and spinning drives. Always benchmark using utilities such as `fio`. Continuously monitor the cluster performance as it increases.
+
+IMPORTANT: Avoid using the Network File System (NFS) protocol.
+
 Some key metrics to monitor on a deployed {product-title} cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics.
 
 * The `etcd_disk_wal_fsync_duration_seconds_bucket` metric reports the etcd disk fsync duration.
diff --git a/scalability_and_performance/recommended-host-practices.adoc b/scalability_and_performance/recommended-host-practices.adoc
@@ -29,6 +29,10 @@ include::modules/increasing-aws-flavor-size.adoc[leveloffset=+2]
 
 include::modules/recommended-etcd-practices.adoc[leveloffset=+1]
 
+[role="_additional-resources"]
+.Additional resources
+* link:https://access.redhat.com/solutions/4885641[How to use `fio` to check etcd disk performance in {product-title}] 
+
 include::modules/etcd-defrag.adoc[leveloffset=+1]
 
 include::modules/infrastructure-components.adoc[leveloffset=+1]