Add more info re: disk slowness detected log msg (#21070)

rmloveland · web-flow · commit 66da8874ef21 · 2025-11-12T22:03:48.000Z
* Add more info re: `disk slowness detected` log msg Fixes DOC-8286
diff --git a/src/current/v24.1/cluster-setup-troubleshooting.md b/src/current/v24.1/cluster-setup-troubleshooting.md
@@ -431,6 +431,7 @@ Symptoms of disk stalls include:
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
 - [Node liveness issues](#node-liveness-issues).
 - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state.
+- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
 
 Causes of disk stalls include:
 
@@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows:
 
 - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process.
 
+If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
+
+- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
+
+Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
+
 #### Disk utilization is different across nodes in the cluster
 
 This is expected behavior.
diff --git a/src/current/v24.3/cluster-setup-troubleshooting.md b/src/current/v24.3/cluster-setup-troubleshooting.md
@@ -431,6 +431,7 @@ Symptoms of disk stalls include:
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
 - [Node liveness issues](#node-liveness-issues).
 - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state.
+- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
 
 Causes of disk stalls include:
 
@@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows:
 
 - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process.
 
+If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
+
+    - `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
+
+Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
+
 #### Disk utilization is different across nodes in the cluster
 
 This is expected behavior.
diff --git a/src/current/v25.2/cluster-setup-troubleshooting.md b/src/current/v25.2/cluster-setup-troubleshooting.md
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
 
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
 - [Node liveness issues](#node-liveness-issues).
+- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
 
 Causes of disk stalls include:
 
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
 
 - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
 
+If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
+
+- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
+
+Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
+
 {% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
 
 #### Disk utilization is different across nodes in the cluster
diff --git a/src/current/v25.3/cluster-setup-troubleshooting.md b/src/current/v25.3/cluster-setup-troubleshooting.md
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
 
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
 - [Node liveness issues](#node-liveness-issues).
+- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
 
 Causes of disk stalls include:
 
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
 
 - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
 
+If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
+
+- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
+
+Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
+
 {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
 
 #### Disk utilization is different across nodes in the cluster
diff --git a/src/current/v25.4/cluster-setup-troubleshooting.md b/src/current/v25.4/cluster-setup-troubleshooting.md
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
 
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
 - [Node liveness issues](#node-liveness-issues).
+- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
 
 Causes of disk stalls include:
 
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
 
 - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
 
+If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
+
+- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
+
+Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
+
 {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
 
 #### Disk utilization is different across nodes in the cluster