You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
143775: structlogging: update hot range logging for app tenants r=angles-n-daemons a=angles-n-daemons
structlogging: update hot range logging for app tenants
Hot range logging currently runs as a task on each SQL node in the cluster. Its operative is to call its status server, collect the hot ranges from it, and log those ranges.
This works well for a traditional deployment, since each sql node corresponds to a status server within the same process, so all of the status servers, and therefore all of the nodes with stores area guaranteed to be covered by this operation.
```
*legend*
[] = node
sql = sql server
status = status server
[sql ──────> status]
[sql ──────> status]
[sql ──────> status]
```
For multi-tenant deployments, this assumption goes out the window. In this topology, sql pods randomly connect to kv pods when making calls to the status server, which means that we can no longer guarantee that sql nodes will exhaustively cover the space they're required to cover.
```
[sql] ──────> [status]
[sql] ─┐ [status]
└────> [status]
[status]
[status]
```
To fix this, the changeset proposed changes the hot range logger to a job, which performs a fanout to all the nodes. This solution, while not scaling quite as well, avoids the lossy problem of missing nodes when logging hot ranges.
```
[sql] ┌────> [status]
[sql] ─┼────> [status]
├────> [status]
├────> [status]
└────> [status]
```
Fixes: #143527
Epic: CRDB-43150
Release note (bug fix): fixes an issue where multi-tenant hot range logging did not log all the hot ranges.
143933: storage: use block properties implementation defined in cockroachkvs r=jbowens a=jbowens
The MVCC time interval block properties implementation has been copied into pebble/cockroachkvs so that the Pebble metamorphic tests may make use of it. Use that implementation and remove the pkg/storage copy.
Epic: none
Release note: none
Co-authored-by: Brian Dillmann <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Copy file name to clipboardExpand all lines: docs/generated/metrics/metrics.html
+12Lines changed: 12 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1306,6 +1306,18 @@
1306
1306
<tr><td>APPLICATION</td><td>jobs.history_retention.resume_completed</td><td>Number of history_retention jobs which successfully resumed to completion</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1307
1307
<tr><td>APPLICATION</td><td>jobs.history_retention.resume_failed</td><td>Number of history_retention jobs which failed with a non-retriable error</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1308
1308
<tr><td>APPLICATION</td><td>jobs.history_retention.resume_retry_error</td><td>Number of history_retention jobs which failed with a retriable error</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1309
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.currently_idle</td><td>Number of hot_ranges_logger jobs currently considered Idle and can be freely shut down</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1310
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.currently_paused</td><td>Number of hot_ranges_logger jobs currently considered Paused</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1311
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.currently_running</td><td>Number of hot_ranges_logger jobs currently running in Resume or OnFailOrCancel state</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1312
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.expired_pts_records</td><td>Number of expired protected timestamp records owned by hot_ranges_logger jobs</td><td>records</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1313
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.fail_or_cancel_completed</td><td>Number of hot_ranges_logger jobs which successfully completed their failure or cancelation process</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1314
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.fail_or_cancel_failed</td><td>Number of hot_ranges_logger jobs which failed with a non-retriable error on their failure or cancelation process</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1315
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.fail_or_cancel_retry_error</td><td>Number of hot_ranges_logger jobs which failed with a retriable error on their failure or cancelation process</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1316
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.protected_age_sec</td><td>The age of the oldest PTS record protected by hot_ranges_logger jobs</td><td>seconds</td><td>GAUGE</td><td>SECONDS</td><td>AVG</td><td>NONE</td></tr>
1317
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.protected_record_count</td><td>Number of protected timestamp records held by hot_ranges_logger jobs</td><td>records</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1318
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.resume_completed</td><td>Number of hot_ranges_logger jobs which successfully resumed to completion</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1319
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.resume_failed</td><td>Number of hot_ranges_logger jobs which failed with a non-retriable error</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1320
+
<tr><td>APPLICATION</td><td>jobs.hot_ranges_logger.resume_retry_error</td><td>Number of hot_ranges_logger jobs which failed with a retriable error</td><td>jobs</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
1309
1321
<tr><td>APPLICATION</td><td>jobs.import.currently_idle</td><td>Number of import jobs currently considered Idle and can be freely shut down</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1310
1322
<tr><td>APPLICATION</td><td>jobs.import.currently_paused</td><td>Number of import jobs currently considered Paused</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
1311
1323
<tr><td>APPLICATION</td><td>jobs.import.currently_running</td><td>Number of import jobs currently running in Resume or OnFailOrCancel state</td><td>jobs</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
0 commit comments