You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
144518: kvserver/closedts: add metrics for policy refresher r=arulajmani a=wenyihu6
**kvserver: add kv.closed_timestamp.policy_change**
Previously, it was difficult to measure how often policies changed for ranges,
which is important because such changes can trigger additional range updates
sent in side transport.
This commit adds a metric to track the number of policy changes on replicas.
Part of: #143890
Release note: none
---
**kvserver: add more metrics for policies**
Previously, it was difficult to determine how many ranges fell into each latency
bucket policy. This commit adds 18 new metrics to StoreMetrics to track the
number of ranges per policy bucket for every store.
Part of: #143890
Release note: none
---
**kvserver: add kv.closed_timestamp.policy_latency_info_missing**
When a replica refreshes its policies, it looks up its peer replicas latency
info via a map passed by PolicyRefresher, which in turn periodically pulls node
latency info from RPCContext. If latency data for a node is missing, a default
hardcoded max RTT of 150ms is used.
Previously, it was hard to tell when this is happening. This commit adds metrics
to track how often the closed timestamp policy refresh falls back to the default
RTT due to missing node latency info. A high count might indicate the latency
cache isn’t refreshed frequently enough, suggesting we should consider lowering
kv.closed_timestamp.policy_latency_refresh_interval.
Resolves: #143890
Release note: none
Co-authored-by: wenyihu6 <[email protected]>
Copy file name to clipboardExpand all lines: docs/generated/metrics/metrics.html
+20Lines changed: 20 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -193,6 +193,26 @@
193
193
<tr><td>STORAGE</td><td>kv.allocator.load_based_replica_rebalancing.missing_stats_for_existing_store</td><td>The number times the allocator was missing the qps stats for the existing store</td><td>Attempts</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
194
194
<tr><td>STORAGE</td><td>kv.allocator.load_based_replica_rebalancing.should_transfer</td><td>The number times the allocator determined that the replica should be rebalanced to another store for better load distribution</td><td>Attempts</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
195
195
<tr><td>STORAGE</td><td>kv.closed_timestamp.max_behind_nanos</td><td>Largest latency between realtime and replica max closed timestamp</td><td>Nanoseconds</td><td>GAUGE</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
196
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lag_by_cluster_setting</td><td>Number of ranges with LAG_BY_CLUSTER_SETTING closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
197
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_equal_or_greater_than_300ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
198
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_100ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_100MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
199
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_120ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_120MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
200
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_140ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_140MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
201
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_160ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_160MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
202
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_180ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_180MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
203
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_200ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_200MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
204
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_20ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
205
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_220ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_220MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
206
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_240ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_240MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
207
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_260ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_260MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
208
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_280ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_280MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
209
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_300ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_300MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
210
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_40ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_40MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
211
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_60ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_60MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
212
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_latency_less_than_80ms</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_80MS closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
213
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy.lead_for_global_reads_with_no_latency_info</td><td>Number of ranges with LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO closed timestamp policy</td><td>Ranges</td><td>GAUGE</td><td>COUNT</td><td>AVG</td><td>NONE</td></tr>
214
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy_change</td><td>Number of times closed timestamp policy change occurred on ranges</td><td>Events</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
215
+
<tr><td>STORAGE</td><td>kv.closed_timestamp.policy_latency_info_missing</td><td>Number of times closed timestamp policy refresh had to use hardcoded network RTT due to missing node latency info for one or more replicas</td><td>Events</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
196
216
<tr><td>STORAGE</td><td>kv.concurrency.avg_lock_hold_duration_nanos</td><td>Average lock hold duration across locks currently held in lock tables. Does not include replicated locks (intents) that are not held in memory</td><td>Nanoseconds</td><td>GAUGE</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
197
217
<tr><td>STORAGE</td><td>kv.concurrency.avg_lock_wait_duration_nanos</td><td>Average lock wait duration across requests currently waiting in lock wait-queues</td><td>Nanoseconds</td><td>GAUGE</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
198
218
<tr><td>STORAGE</td><td>kv.concurrency.latch_conflict_wait_durations</td><td>Durations in nanoseconds spent on latch acquisition waiting for conflicts with other latches</td><td>Nanoseconds</td><td>HISTOGRAM</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
0 commit comments