Skip to content

Commit 62c9d63

Browse files
craig[bot]wenyihu6
andcommitted
Merge #142721
142721: kvserver/closedts: auto-tune closed ts r=arulajmani a=wenyihu6 **kvserver/closedts: add latency based closed timestamp policies** This patch adds latency based closed timestamp policies to ctpb.LatencyBasedRangeClosedTimestampPolicy. Part of: #59680 Release note: none --- **kvserver/closedts: auto-tune closed ts** For global table ranges, we aim to ensure follower replicas can serve present-time reads by calculating a target closed timestamp that leads the current time (using the `LEAD_FOR_GLOBAL_READS` policy). The goal of this estimation is to be confident that by the time that all follower have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder. Note that this is an estimated goal, not the exact lead time. It is still possible that some follower nodes are really behind. Calculation: ``` raft_propagation_time = max_network_rtt*1.5 + raft_overhead(20ms) side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval(200ms by default) closed_ts_propagation = max(raft_propagation_time,side_propagation_time) closed_ts_at_sender = now(sender’s current clock) + maxClockOffset(500ms by default) + buffer (25ms) + closed_ts_propagation ``` Currently, `max_network_rtt` is hardcoded at 150ms, leading to a default 800ms lead. Problems with the current implementation: Tuning trade-offs: A higher lead timestamp leads to high commit wait times, which inturn impacts write latency. A lower lead timestamp may result in a closed timestamp that lags present time, causing follower replicas to redirect, which inturn impacts read latency. Cluster wide setting: kv.closed_timestamp.lead_for_global_reads_override only applies to the entire cluster, not per range. For geographically distant replicas, a short lead time may not be enough for full application. However, increasing the lead time cluster wide can harm write performance for replicas that are close together and don’t require as much time to replicate. This commit adds newly added latency based policies to the side transport senders. For side transport senders, nodes create a map of closed timestamps for each policy without consulting leaseholders. Senders pass this map to leaseholders, who decide whether to "bump" their closed timestamp and which policy to use. Senders then send this information to receivers to opt in ranges and pick the closed timestamps based on the policies ranges opt in. Currently, side transport senders create only two policies, offering limited control over closed timestamps based on range specific latencies. The new design introduces 16 additional policies, each representing a different network latency bucket ranges fall under. Instead of relying on a hardcoded 150ms latency for closed timestamp calculations, ranges will now use the appropriate bucket based on its observed network conditions. Periodically, the policy refresher iterates over all leaseholders on a node, updating cached closed timestamp policies based on the observed latency between the leaseholder replica and the furthest replica. This policy is then used by side transport and by raft side closed timestamp computation. Instead of using the hardcoded 150ms network roundtrip latency, a more dynamic latency would be used based on the latency based closed timestamp policy the range holds. kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled can now be used to auto-tune the lead time that global table ranges use to publish close timestamps. The kv.closed_timestamp.lead_for_global_reads_override cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time. Resolves: #59680 Release note: none Co-authored-by: wenyihu6 <[email protected]>
2 parents 160b0e5 + 5dd4f88 commit 62c9d63

File tree

14 files changed

+955
-33
lines changed

14 files changed

+955
-33
lines changed

pkg/ccl/multiregionccl/BUILD.bazel

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,8 @@ go_test(
6161
"//pkg/kv/kvpb",
6262
"//pkg/kv/kvserver",
6363
"//pkg/kv/kvserver/allocator/allocatorimpl",
64+
"//pkg/kv/kvserver/closedts",
65+
"//pkg/kv/kvserver/closedts/ctpb",
6466
"//pkg/roachpb",
6567
"//pkg/rpc",
6668
"//pkg/security/securityassets",

pkg/ccl/multiregionccl/multiregion_test.go

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,23 @@
66
package multiregionccl_test
77

88
import (
9+
"context"
910
"testing"
11+
"time"
1012

1113
"github.com/cockroachdb/cockroach/pkg/base"
1214
"github.com/cockroachdb/cockroach/pkg/ccl"
1315
"github.com/cockroachdb/cockroach/pkg/ccl/multiregionccl/multiregionccltestutils"
16+
"github.com/cockroachdb/cockroach/pkg/keys"
17+
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts"
18+
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts/ctpb"
19+
"github.com/cockroachdb/cockroach/pkg/roachpb"
20+
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
21+
"github.com/cockroachdb/cockroach/pkg/testutils"
1422
"github.com/cockroachdb/cockroach/pkg/testutils/skip"
1523
"github.com/cockroachdb/cockroach/pkg/util/leaktest"
1624
"github.com/cockroachdb/cockroach/pkg/util/log"
25+
"github.com/cockroachdb/errors"
1726
"github.com/stretchr/testify/require"
1827
)
1928

@@ -150,3 +159,104 @@ func TestGlobalReadsAfterEnterpriseDisabled(t *testing.T) {
150159
_, err = sqlDB.Exec(`ALTER TABLE t2 CONFIGURE ZONE USING global_reads = false`)
151160
require.NoError(t, err)
152161
}
162+
163+
// This test verifies closed timestamp policy behavior with and without the
164+
// auto-tuning. With the auto-tuning enabled, we expect latency-based policies
165+
// on global tables. Without it, no latency-based policies should exist.
166+
func TestReplicaClosedTSPolicyWithPolicyRefresher(t *testing.T) {
167+
defer leaktest.AfterTest(t)()
168+
defer log.Scope(t).Close(t)
169+
ctx := context.Background()
170+
st := cluster.MakeTestingClusterSettings()
171+
172+
// Helper function to check if a policy is a newly introduced latency-based policy.
173+
isLatencyBasedPolicy := func(policy ctpb.RangeClosedTimestampPolicy) bool {
174+
return policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
175+
policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS
176+
}
177+
178+
// Set small intervals for faster testing.
179+
closedts.LeadForGlobalReadsAutoTuneEnabled.Override(ctx, &st.SV, true)
180+
closedts.RangeClosedTimestampPolicyRefreshInterval.Override(ctx, &st.SV, 5*time.Millisecond)
181+
closedts.RangeClosedTimestampPolicyLatencyRefreshInterval.Override(ctx, &st.SV, 5*time.Millisecond)
182+
183+
// Create a multi-region cluster with manual replication.
184+
numServers := 3
185+
tc, sqlDB, cleanup := multiregionccltestutils.TestingCreateMultiRegionCluster(
186+
t, numServers, base.TestingKnobs{}, multiregionccltestutils.WithReplicationMode(base.ReplicationManual),
187+
multiregionccltestutils.WithSettings(st),
188+
)
189+
defer cleanup()
190+
191+
// Create a multi-region database and global table.
192+
_, err := sqlDB.Exec(`CREATE DATABASE t PRIMARY REGION "us-east1" REGIONS "us-east2", "us-east3"`)
193+
require.NoError(t, err)
194+
_, err = sqlDB.Exec(`CREATE TABLE t.test_table (k INT PRIMARY KEY) LOCALITY GLOBAL`)
195+
require.NoError(t, err)
196+
197+
// Look up the table ID and get its key prefix.
198+
var tableID uint32
199+
err = sqlDB.QueryRow(`SELECT id from system.namespace WHERE name='test_table'`).Scan(&tableID)
200+
require.NoError(t, err)
201+
tablePrefix := keys.MustAddr(keys.SystemSQLCodec.TablePrefix(tableID))
202+
// Split the range at the table prefix and replicate it across all nodes.
203+
tc.SplitRangeOrFatal(t, tablePrefix.AsRawKey())
204+
tc.AddVotersOrFatal(t, tablePrefix.AsRawKey(), tc.Target(1), tc.Target(2))
205+
206+
// Get the store and replica for testing.
207+
store := tc.GetFirstStoreFromServer(t, 0)
208+
replica := store.LookupReplica(roachpb.RKey(tablePrefix.AsRawKey()))
209+
require.NotNil(t, replica)
210+
211+
// Wait for the closed timestamp policies to stabilize and verify the
212+
// expected behavior.
213+
testutils.SucceedsSoon(t, func() error {
214+
snapshot := store.GetStoreConfig().ClosedTimestampSender.GetSnapshot()
215+
if len(snapshot.ClosedTimestamps) != int(ctpb.MAX_CLOSED_TIMESTAMP_POLICY) {
216+
return errors.Errorf("expected %d closed timestamps, got %d",
217+
ctpb.MAX_CLOSED_TIMESTAMP_POLICY, len(snapshot.ClosedTimestamps))
218+
}
219+
220+
// Ensure there are ranges to check.
221+
if len(snapshot.AddedOrUpdated) == 0 {
222+
return errors.Errorf("no ranges")
223+
}
224+
225+
// With policy refresher enabled: expect to find latency-based policy.
226+
for _, policy := range snapshot.AddedOrUpdated {
227+
if policy.RangeID != replica.GetRangeID() {
228+
continue
229+
}
230+
if isLatencyBasedPolicy(policy.Policy) {
231+
return nil
232+
}
233+
}
234+
return errors.Errorf("no ranges with latency based policy")
235+
})
236+
237+
// Without policy refresher: expect to NOT find latency-based policy.
238+
_, err = sqlDB.Exec(`SET CLUSTER SETTING kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled = false`)
239+
require.NoError(t, err)
240+
241+
// Wait for policy refresher to disable and verify no latency-based policies exist.
242+
testutils.SucceedsSoon(t, func() error {
243+
snapshot := store.GetStoreConfig().ClosedTimestampSender.GetSnapshot()
244+
if len(snapshot.ClosedTimestamps) != int(ctpb.MAX_CLOSED_TIMESTAMP_POLICY) {
245+
return errors.Errorf("expected %d closed timestamps, got %d",
246+
ctpb.MAX_CLOSED_TIMESTAMP_POLICY, len(snapshot.ClosedTimestamps))
247+
}
248+
249+
// Ensure there are ranges to check.
250+
if len(snapshot.AddedOrUpdated) == 0 {
251+
return errors.Errorf("no ranges")
252+
}
253+
254+
// Verify no latency-based policies remain.
255+
for _, policy := range snapshot.AddedOrUpdated {
256+
if isLatencyBasedPolicy(policy.Policy) {
257+
return errors.Errorf("range %d has latency based policy %s", policy.RangeID, policy.Policy)
258+
}
259+
}
260+
return nil
261+
})
262+
}

pkg/kv/kvserver/closedts/BUILD.bazel

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ go_library(
44
name = "closedts",
55
srcs = [
66
"policy.go",
7+
"policy_calculation.go",
78
"setting.go",
89
],
910
importpath = "github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts",
@@ -17,7 +18,10 @@ go_library(
1718

1819
go_test(
1920
name = "closedts_test",
20-
srcs = ["policy_test.go"],
21+
srcs = [
22+
"policy_calculation_test.go",
23+
"policy_test.go",
24+
],
2125
embed = [":closedts"],
2226
deps = [
2327
"//pkg/kv/kvserver/closedts/ctpb",

pkg/kv/kvserver/closedts/ctpb/service.proto

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,29 @@ enum RangeClosedTimestampPolicy {
3636
// configured to lead present time such that all followers of the range are
3737
// able to serve consistent, present time reads.
3838
//
39-
// Lead policy for global reads with no locality information.
39+
// Lead policy for global reads with no locality information. It uses a
40+
// hardcoded network latency 150ms by default.
4041
LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO = 1;
42+
// Lead policy for ranges with global reads policy. The following policies are
43+
// selected based on max leaseholder-to-follower latency.
44+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS = 2; // [0,20)ms
45+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_40MS = 3; // [20,40)ms
46+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_60MS = 4; // [40,60)ms
47+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_80MS = 5; // [60,80)ms
48+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_100MS = 6; // [80,100)ms
49+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_120MS = 7; // [100,120)ms
50+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_140MS = 8; // [120,140)ms
51+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_160MS = 9; // [140,160)ms
52+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_180MS = 10; // [160,180)ms
53+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_200MS = 11; // [180,200)ms
54+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_220MS = 12; // [200,220)ms
55+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_240MS = 13; // [220,240)ms
56+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_260MS = 14; // [240,260)ms
57+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_280MS = 15; // [260,280)ms
58+
LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_300MS = 16; // [280,300)ms
59+
LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS = 17; // >=300ms
4160
// Sentinel value for slice sizing.
42-
MAX_CLOSED_TIMESTAMP_POLICY = 2;
61+
MAX_CLOSED_TIMESTAMP_POLICY = 18;
4362
}
4463

4564
// Update contains information about (the advancement of) closed timestamps for

pkg/kv/kvserver/closedts/policy.go

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ import (
1313
)
1414

1515
const (
16-
DefaultMaxNetworkRTT = 150 * time.Millisecond
16+
DefaultMaxNetworkRTT = 150 * time.Millisecond
17+
closedTimestampPolicyBucketWidth = 20 * time.Millisecond
1718
)
1819

1920
// computeLeadTimeForGlobalReads calculates how far ahead of the current time a
@@ -108,17 +109,18 @@ func TargetForPolicy(
108109
policy ctpb.RangeClosedTimestampPolicy,
109110
) hlc.Timestamp {
110111
var targetOffsetTime time.Duration
111-
switch policy {
112-
case ctpb.LAG_BY_CLUSTER_SETTING:
112+
switch {
113+
case policy == ctpb.LAG_BY_CLUSTER_SETTING:
113114
// Simple calculation: lag now by desired duration.
114115
targetOffsetTime = -lagTargetDuration
115-
case ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO:
116+
case policy >= ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO &&
117+
policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS:
116118
// Override entirely with cluster setting, if necessary.
117119
if leadTargetOverride != 0 {
118120
targetOffsetTime = leadTargetOverride
119121
break
120122
}
121-
targetOffsetTime = computeLeadTimeForGlobalReads(DefaultMaxNetworkRTT,
123+
targetOffsetTime = computeLeadTimeForGlobalReads(computeNetworkRTTBasedOnPolicy(policy),
122124
maxClockOffset, sideTransportCloseInterval)
123125
default:
124126
panic("unexpected RangeClosedTimestampPolicy")
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
// Copyright 2025 The Cockroach Authors.
2+
//
3+
// Use of this software is governed by the CockroachDB Software License
4+
// included in the /LICENSE file.
5+
6+
package closedts
7+
8+
import (
9+
"fmt"
10+
"math"
11+
"time"
12+
13+
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts/ctpb"
14+
)
15+
16+
// FindBucketBasedOnNetworkRTT maps a network RTT to a closed timestamp policy
17+
// bucket.
18+
func FindBucketBasedOnNetworkRTT(networkRTT time.Duration) ctpb.RangeClosedTimestampPolicy {
19+
// If maxLatency is negative (i.e. no peer latency is provided), return
20+
// LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO
21+
if networkRTT < 0 {
22+
return ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO
23+
}
24+
if networkRTT >= 300*time.Millisecond {
25+
return ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS
26+
}
27+
// Divide RTT by policy interval, add 1 for zero-based indexing, and offset by
28+
// the base policy enum value.
29+
bucketNum := int32(math.Floor(float64(networkRTT)/float64(closedTimestampPolicyBucketWidth))) + 1
30+
return ctpb.RangeClosedTimestampPolicy(bucketNum + int32(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO))
31+
}
32+
33+
// computeNetworkRTTBasedOnPolicy converts a closed timestamp policy to an estimated
34+
// network RTT.
35+
func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {
36+
switch {
37+
case policy == ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO:
38+
// If no latency info is available, return the default max RTT.
39+
return DefaultMaxNetworkRTT
40+
case policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
41+
policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS:
42+
// For known latency buckets, we return the midpoint RTT for the bucket.
43+
// The midpointRTT for bucket N is (N+0.5)*interval.
44+
bucket := int(policy) - int(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO) - 1
45+
return time.Duration((float64(bucket) + 0.5) * float64(closedTimestampPolicyBucketWidth.Nanoseconds()))
46+
default:
47+
panic(fmt.Sprintf("unknown closed timestamp policy: %s", policy))
48+
}
49+
}

0 commit comments

Comments
 (0)