Skip to content

Commit 66da887

Browse files
authored
Add more info re: disk slowness detected log msg (#21070)
* Add more info re: `disk slowness detected` log msg Fixes DOC-8286
1 parent db94c73 commit 66da887

File tree

5 files changed

+35
-0
lines changed

5 files changed

+35
-0
lines changed

src/current/v24.1/cluster-setup-troubleshooting.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,7 @@ Symptoms of disk stalls include:
431431
- Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
432432
- [Node liveness issues](#node-liveness-issues).
433433
- Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state.
434+
- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
434435
435436
Causes of disk stalls include:
436437
@@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows:
448449
449450
- During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process.
450451
452+
If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
453+
454+
- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
455+
456+
Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
457+
451458
#### Disk utilization is different across nodes in the cluster
452459
453460
This is expected behavior.

src/current/v24.3/cluster-setup-troubleshooting.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,7 @@ Symptoms of disk stalls include:
431431
- Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
432432
- [Node liveness issues](#node-liveness-issues).
433433
- Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state.
434+
- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
434435
435436
Causes of disk stalls include:
436437
@@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows:
448449
449450
- During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process.
450451
452+
If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
453+
454+
- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
455+
456+
Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
457+
451458
#### Disk utilization is different across nodes in the cluster
452459
453460
This is expected behavior.

src/current/v25.2/cluster-setup-troubleshooting.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
415415
416416
- Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
417417
- [Node liveness issues](#node-liveness-issues).
418+
- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
418419
419420
Causes of disk stalls include:
420421
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
432433
433434
- During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
434435
436+
If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
437+
438+
- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
439+
440+
Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
441+
435442
{% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
436443
437444
#### Disk utilization is different across nodes in the cluster

src/current/v25.3/cluster-setup-troubleshooting.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
415415
416416
- Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
417417
- [Node liveness issues](#node-liveness-issues).
418+
- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
418419
419420
Causes of disk stalls include:
420421
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
432433
433434
- During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
434435
436+
If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
437+
438+
- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
439+
440+
Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
441+
435442
{% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
436443
437444
#### Disk utilization is different across nodes in the cluster

src/current/v25.4/cluster-setup-troubleshooting.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,7 @@ Symptoms of disk stalls include:
415415
416416
- Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.
417417
- [Node liveness issues](#node-liveness-issues).
418+
- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
418419
419420
Causes of disk stalls include:
420421
@@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows:
432433
433434
- During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk.
434435
436+
If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent:
437+
438+
- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s`
439+
440+
Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem.
441+
435442
{% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %}
436443
437444
#### Disk utilization is different across nodes in the cluster

0 commit comments

Comments
 (0)