From f8a9e70e86e96ff3014fb10448793843fffc2d56 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Mon, 10 Nov 2025 13:39:59 -0500 Subject: [PATCH 1/2] Add more info re: `disk slowness detected` log msg Fixes DOC-8286 --- src/current/v25.4/cluster-setup-troubleshooting.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/current/v25.4/cluster-setup-troubleshooting.md b/src/current/v25.4/cluster-setup-troubleshooting.md index c6f774750cd..9972c334d98 100644 --- a/src/current/v25.4/cluster-setup-troubleshooting.md +++ b/src/current/v25.4/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster From 85b81b4b8a16f48abf6fc331862e79935696f72f Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 12 Nov 2025 16:37:46 -0500 Subject: [PATCH 2/2] Backport changes to supported versions v24.1+ --- src/current/v24.1/cluster-setup-troubleshooting.md | 7 +++++++ src/current/v24.3/cluster-setup-troubleshooting.md | 7 +++++++ src/current/v25.2/cluster-setup-troubleshooting.md | 7 +++++++ src/current/v25.3/cluster-setup-troubleshooting.md | 7 +++++++ 4 files changed, 28 insertions(+) diff --git a/src/current/v24.1/cluster-setup-troubleshooting.md b/src/current/v24.1/cluster-setup-troubleshooting.md index 8286bae7c86..d2ffad2d8d8 100644 --- a/src/current/v24.1/cluster-setup-troubleshooting.md +++ b/src/current/v24.1/cluster-setup-troubleshooting.md @@ -431,6 +431,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state. +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + #### Disk utilization is different across nodes in the cluster This is expected behavior. diff --git a/src/current/v24.3/cluster-setup-troubleshooting.md b/src/current/v24.3/cluster-setup-troubleshooting.md index 8286bae7c86..eef6ebcd543 100644 --- a/src/current/v24.3/cluster-setup-troubleshooting.md +++ b/src/current/v24.3/cluster-setup-troubleshooting.md @@ -431,6 +431,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state. +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + + - `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + #### Disk utilization is different across nodes in the cluster This is expected behavior. diff --git a/src/current/v25.2/cluster-setup-troubleshooting.md b/src/current/v25.2/cluster-setup-troubleshooting.md index 9df807f685a..971298022c0 100644 --- a/src/current/v25.2/cluster-setup-troubleshooting.md +++ b/src/current/v25.2/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster diff --git a/src/current/v25.3/cluster-setup-troubleshooting.md b/src/current/v25.3/cluster-setup-troubleshooting.md index c6f774750cd..9972c334d98 100644 --- a/src/current/v25.3/cluster-setup-troubleshooting.md +++ b/src/current/v25.3/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster