Skip to content

Commit 67b3c08

Browse files
craig[bot]Abhinav1299wenyihu6shriramters
committed
149001: pkg/cli: Implement redaction validation and user confirmation for debug zip uploads r=Abhinav1299 a=Abhinav1299 This change improves the safety and user awareness regarding sensitive data in debug zip uploads to Datadog. - Added `validateRedactionStatus` function to check the redaction status of debug zips based on the `debug_zip_command_flags.txt` file. - Introduced user prompts for confirmation when redaction is unclear or not enabled. - Updated `runDebugZipUpload` to validate redaction status before proceeding with uploads. Part of: CRDB-49203 Epic: None Release Notes: None 149632: asim: port second batch of testing framework changes r=tbg a=wenyihu6 **Note that this is based on top of #149582 Epic: [CRDB-25222](https://cockroachlabs.atlassian.net/browse/CRDB-25222) ---- **asim: remove replica_weights and lease_weights** This commit removes replica_weights and lease_weights entirely since future commits will add replica_placement which serves a similar but better purpose. Epic: none Release note: none --- **asim: add "print" command to output initial cluster state** This commit adds a print command to the data driven test that outputs the cluster state just before the simulation starts, helping future commits verify correctness against the initial state. Epic: none Release note: none --- **asim: add replica_placement option to gen_ranges** This commit adds a placement_type=replica_placement command to gen_ranges. It takes the next line as a ratio of replica placement combinations. For example, {s1:*,s2,s3:NON_VOTER}:1 {s4:*,s5,s6}:1 means half the ranges will have s1 as the leaseholder, s2, and s3 as a non-voter; the other half of ranges will have s4 as leaseholder, s5, and s6. Epic: none Release note: none --- **asim: refactor the code to use rebalancing snapshot rate** Previously, asim used ReplicaAddRate for SimulationSettings, which was an artificial approximation and differed from the real cluster behaviour with snapshot-based calculation. This patch refactors it to use the actual snapshot rate instead. There is no behavioral change, since SnapshotRate = AddRate / (1024*1024). So (RangeSize/(1024*1024))/(AddRate)=RangeSize/(1024*1024*AddRate)=RangeSize/SnapshotRate. NB: there's an existing bug here; the calculation is missing a final multiplication by time.Second, and snapshot rate is defined per second. A future commit will address this. Epic: none Release note: none --- **asim: add time.Second to snapshot rate delay calculation** Previously, the delay calculation used snapshot_rate (per second) but forgot to multiply the result by time.Second, resulting in an incorrectly small delay in nanoseconds. This commit fixes the bug by applying the correct time unit. Epic: none Release note: none --- **asim: change defaultRebalancingSnapshotRate to 32MiB/s** This commit updates defaultRebalancingSnapshotRate from 16MiB/s to 32MiB/s to match with the default value of the kv.snapshot_rebalance.max_rate cluster setting. Epic: none Release note: none --- **asim: use << 20 instead of 1024*1024** This commit updates defaultRangeSizeSplitThreshold to use bit shifting instead of multiplication for convention. Epic: none Release note: none --- **asim: fix bug with missing ticket in controller** Previously, tickets were incorrectly deleted immediately after checking for their existence, even if they weren't completed yet. This caused tickets to be missing on subsequent checks. This commit fixes the issue by keeping the tickets around. Epic: none Release note: none --- **asim: remove unused return value in checkPendingTicket** This commit removes an unused return value from checkPendingTicket(), as it is not used anywhere. Epic: none Release note: none --- **asim: add SetSimulationSettings as a mutation event** This commit adds SetSimulationSettings as a delayed mutation event, enabling data-driven tests to modify simulation settings in the middle of the simulation. Note that this is left unused as of now. This will support future changes like updating the rebalance mode during a run. Epic: none Release note: none --- **asim: add ChangeReplicasOp as an operation** This commit adds ChangeReplicasOp, enabling the simulator’s store rebalancer to enqueue replica change operations. It is currently unused; future commits will integrate it with mma to enqueue replica changes. Epic: none Release note: none --- **asim: support store_byte_capacity option with gen_cluster** This commit adds support for the store_byte_capacity option in the gen_cluster command, enabling data-driven tests to change configuration of store capacity in bytes. Epic: none Release note: none --- **asim: support region and nodes_per_region with gen_cluster** Previously, gen_cluster did not support specifying node localities, making constraint testing difficult; users had to add nodes manually with add_node. This commit adds region and nodes_per_region options to gen_cluster. Note taht len(region) == len(nodes_per_region). For example, region=(a,b,c) and nodes_per_region=(1,1,2) assigns 1 node to a, 1 to b, and 2 to c. No new tests are added yet, but future commits will utilize this feature. Epic: none Release note: none --- **asim: integrate cluster setting properly** Previously, new cluster settings were created ad hoc throughout asim, making it hard to change cluster settings. In addition, it was also hard to distinguish between simulation and cluster settings. This commit adds a cluster settings field to the simulation settings struct, initializing it once and passing it throughout. Note that many simulation settings that should be cluster settings still remain; a TODO is left to clean this up in the future. Epic: none Release note: none --- **asim: fix snapshot rate delay integer division** Previously, integer division was used when computing the delay for adding a replica, resulting in shorter than expected delays. This patch improves accuracy in tracking the delay by calculating a ratio with float. (Note: This issue was just raised in Slack, not from the prototype.) Epic: none Release note: none 149638: sql: prevent auth failure when no roles are granted during sync r=souravcrl a=shriramters Previously, the EnsureUserOnlyBelongsToRoles function would construct and execute a GRANT statement even if the list of roles to grant was empty after filtering for roles that exist in the database. This was inadequate because it caused authentication to fail for users logging in via external providers (LDAP, JWT, OIDC). If a user belonged to external groups that did not map to any existing roles in CockroachDB, the function would generate an invalid SQL statement (GRANT TO <user>), resulting in a syntax error that blocked the user's session. To address this, this patch modifies EnsureUserOnlyBelongsToRoles to only build and execute the GRANT statement if at least one valid, existing role is being granted. This prevents the syntax error and allows users to log in successfully even if their external group memberships do not result in any new role grants. This commit also adds unit tests for `EnsureUserOnlyBelongsToRoles()`. Fixes: #143878 Release note (bug fix): Fixed a bug where database login could fail during LDAP, JWT, or OIDC authentication if the user's external group memberships did not correspond to any existing roles in the database. The login will now succeed, and no roles will be granted or revoked in this scenario. Co-authored-by: Abhinav1299 <[email protected]> Co-authored-by: wenyihu6 <[email protected]> Co-authored-by: Shriram Ravindranathan <[email protected]>
4 parents b5252fb + c47845a + 12e5ed2 + c27794a commit 67b3c08

35 files changed

+861
-196
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--concurrency=15 --cpu-profile-duration=5s --files-from=2025-06-15 08:45:34 --files-until=2025-06-18 08:45:34 --include-goroutine-stacks=true --include-range-info=true --include-running-job-traces=true --redact=true

pkg/cli/zip.go

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ type zipRequest struct {
4848
pathName string
4949
}
5050

51+
const (
52+
debugZipCommandFlagsFileName = "debug_zip_command_flags.txt"
53+
)
54+
5155
type debugZipContext struct {
5256
z *zipper
5357
clusterPrinter *zipReporter
@@ -415,7 +419,7 @@ done
415419
return filter
416420
})
417421

418-
if err := z.createRaw(s, zc.prefix+"/debug_zip_command_flags.txt", []byte(flags)); err != nil {
422+
if err := z.createRaw(s, zc.prefix+"/"+debugZipCommandFlagsFileName, []byte(flags)); err != nil {
419423
return err
420424
}
421425

pkg/cli/zip_test.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1236,7 +1236,7 @@ func TestCommandFlags(t *testing.T) {
12361236
}
12371237

12381238
for _, f := range r.File {
1239-
if f.Name == "debug/debug_zip_command_flags.txt" {
1239+
if f.Name == "debug/"+debugZipCommandFlagsFileName {
12401240
rc, err := f.Open()
12411241
if err != nil {
12421242
t.Fatal(err)
@@ -1253,7 +1253,7 @@ func TestCommandFlags(t *testing.T) {
12531253
return
12541254
}
12551255
}
1256-
assert.Fail(t, "debug/debug_zip_command_flags.txt is not generated")
1256+
assert.Fail(t, "debug/"+debugZipCommandFlagsFileName+" is not generated")
12571257

12581258
if err = r.Close(); err != nil {
12591259
t.Fatal(err)

pkg/cli/zip_upload.go

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
package cli
77

88
import (
9+
"bufio"
910
"bytes"
1011
"compress/gzip"
1112
"context"
@@ -37,6 +38,7 @@ import (
3738
"github.com/cockroachdb/cockroach/pkg/util/system"
3839
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
3940
"github.com/cockroachdb/errors"
41+
"github.com/cockroachdb/errors/oserror"
4042
"github.com/spf13/cobra"
4143
"golang.org/x/oauth2/google"
4244
"google.golang.org/api/option"
@@ -202,13 +204,114 @@ func uploadJSONFile(fileName string, message any, uuid string) error {
202204
// pipeline which enriches the logs with more fields.
203205
var defaultDDTags = []string{"service:CRDB-SH", "env:debug", "source:cockroachdb"}
204206

207+
// buildRedactionWarning creates a warning message about sensitive data in debug zips.
208+
// It includes a common list of sensitive data types that may be present.
209+
func buildRedactionWarning(prefix string) string {
210+
return prefix +
211+
"This means it may contain sensitive data including:\n" +
212+
" • Personally Identifiable Information (PII)\n" +
213+
" • Database credentials and connection strings\n" +
214+
" • Internal cluster details\n" +
215+
" • Potentially sensitive log data\n\n" +
216+
"It is advisable to only upload redacted debug zips to Datadog.\n"
217+
}
218+
219+
// promptUserForConfirmationImpl shows a warning message and prompts the user for confirmation.
220+
// It returns nil if the user confirms, or an error if they decline or if there's an input error.
221+
// In dry-run mode, it shows the warning but skips the prompt.
222+
func promptUserForConfirmationImpl(warningMsg string) error {
223+
// Skip interactive prompt in dry-run mode
224+
if debugZipUploadOpts.dryRun {
225+
fmt.Fprintf(os.Stderr, "%s", warningMsg)
226+
fmt.Fprintf(os.Stderr, "DRY RUN: Would prompt for confirmation here.\n")
227+
return nil
228+
}
229+
230+
fmt.Fprintf(os.Stderr, "%s", warningMsg)
231+
fmt.Fprintf(os.Stderr, "Do you want to continue with the upload? (y/N): ")
232+
233+
reader := bufio.NewReader(os.Stdin)
234+
line, err := reader.ReadString('\n')
235+
if err != nil {
236+
return fmt.Errorf("failed to read user input: %w", err)
237+
}
238+
239+
line = strings.ToLower(strings.TrimSpace(line))
240+
if len(line) == 0 {
241+
line = "n" // Default to 'no' when user presses enter without input
242+
}
243+
244+
if line == "y" || line == "yes" {
245+
fmt.Fprintf(os.Stderr, "Proceeding with upload...\n")
246+
return nil
247+
}
248+
249+
return fmt.Errorf("upload aborted")
250+
}
251+
252+
// promptUserForConfirmation is a variable that can be mocked in tests
253+
var promptUserForConfirmation = promptUserForConfirmationImpl
254+
255+
// validateRedactionStatus checks if the debug zip was created with redaction enabled
256+
// by examining the debugZipCommandFlagsFileName file in the system tenant.
257+
// If redaction is not enabled, it warns the user and prompts for confirmation.
258+
//
259+
// User experience examples:
260+
//
261+
// 1. Redacted zip (--redact=true): Proceeds silently
262+
// 2. Unredacted zip (--redact=false): Shows warning and prompts:
263+
// "⚠️ WARNING: Your debug zip was created WITHOUT redaction..."
264+
// "Do you want to continue with the upload? (y/N): "
265+
// 3. Unknown redaction status: Shows warning and prompts similarly
266+
//
267+
// The function respects dry-run mode by showing warnings without prompting.
268+
func validateRedactionStatus(debugDirPath string) error {
269+
flagsFilePath := path.Join(debugDirPath, debugZipCommandFlagsFileName)
270+
271+
flagsContent, err := os.ReadFile(flagsFilePath)
272+
if err != nil {
273+
if oserror.IsNotExist(err) {
274+
// File doesn't exist - we can't determine redaction status, so warn and prompt
275+
warningMsg := buildRedactionWarning(
276+
"⚠️ WARNING: The debug zip redaction status is unclear.\n")
277+
278+
return promptUserForConfirmation(warningMsg)
279+
}
280+
return fmt.Errorf("⚠️ error: the debug zip redaction status is unclear, err: %w", err)
281+
}
282+
283+
flagsStr := string(flagsContent)
284+
285+
if strings.Contains(flagsStr, "--redact=true") {
286+
return nil
287+
}
288+
289+
var warningMsg string
290+
if strings.Contains(flagsStr, "--redact=false") {
291+
warningMsg = buildRedactionWarning(
292+
"⚠️ WARNING: The debug zip was created WITHOUT redaction.\n",
293+
)
294+
} else {
295+
warningMsg = buildRedactionWarning(
296+
"⚠️ WARNING: The debug zip redaction status is unclear.\n",
297+
)
298+
}
299+
300+
return promptUserForConfirmation(warningMsg)
301+
}
302+
205303
func runDebugZipUpload(cmd *cobra.Command, args []string) error {
206304
runtime.GOMAXPROCS(system.NumCPU())
207305

208306
if err := validateZipUploadReadiness(); err != nil {
209307
return err
210308
}
211309

310+
// Check redaction status before proceeding with upload
311+
if err := validateRedactionStatus(args[0]); err != nil {
312+
return err
313+
}
314+
212315
// a unique ID for this upload session. This should be used to tag all the artifacts uploaded in this session
213316
uploadID := newUploadID(debugZipUploadOpts.clusterName, getCurrentTime())
214317

pkg/cli/zip_upload_test.go

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,11 @@ func TestUploadZipEndToEnd(t *testing.T) {
164164

165165
defer testutils.TestingHook(&gcsLogUpload, writeLogsToGCSHook)()
166166

167+
// Mock the interactive prompt to always accept (return nil) for end-to-end tests
168+
defer testutils.TestingHook(&promptUserForConfirmation, func(warningMsg string) error {
169+
return nil // Always accept in end-to-end tests
170+
})()
171+
167172
datadriven.Walk(t, "testdata/upload", func(t *testing.T, path string) {
168173
datadriven.RunTest(t, path, func(t *testing.T, d *datadriven.TestData) string {
169174
c := NewCLITest(TestCLIParams{})
@@ -285,6 +290,127 @@ func TestAppendUserTags(t *testing.T) {
285290
}
286291
}
287292

293+
func TestValidateRedactionStatus(t *testing.T) {
294+
defer leaktest.AfterTest(t)()
295+
296+
tests := []struct {
297+
name string
298+
flagsContent string
299+
fileExists bool
300+
dryRun bool
301+
expectError bool
302+
errorContains string
303+
description string
304+
}{
305+
{
306+
name: "redacted zip with other flags proceeds silently",
307+
flagsContent: " --concurrency=15 --cpu-profile-duration=5s --files-from=2025-06-15 08:45:34 --redact=true --include-range-info=true",
308+
fileExists: true,
309+
dryRun: false,
310+
expectError: false,
311+
description: "When --redact=true is found, no warnings should be shown and upload proceeds",
312+
},
313+
{
314+
name: "missing file user declines",
315+
flagsContent: "",
316+
fileExists: false,
317+
dryRun: false,
318+
expectError: true,
319+
errorContains: "upload aborted",
320+
description: "When file is missing and user declines, should return cancellation error",
321+
},
322+
{
323+
name: "missing file user accepts",
324+
flagsContent: "",
325+
fileExists: false,
326+
dryRun: false,
327+
expectError: false,
328+
errorContains: "user accepts",
329+
description: "When file is missing and user accepts, should proceed without error",
330+
},
331+
{
332+
name: "unredacted zip user declines",
333+
flagsContent: " --concurrency=1 --redact=false --nodes=1",
334+
fileExists: true,
335+
dryRun: false,
336+
expectError: true,
337+
errorContains: "upload aborted",
338+
description: "When --redact=false and user declines, should return cancellation error",
339+
},
340+
{
341+
name: "unredacted zip user accepts",
342+
flagsContent: " --concurrency=1 --redact=false --nodes=1",
343+
fileExists: true,
344+
dryRun: false,
345+
expectError: false,
346+
errorContains: "user accepts",
347+
description: "When --redact=false and user accepts, should proceed without error",
348+
},
349+
{
350+
name: "unclear redaction status user declines",
351+
flagsContent: " --concurrency=1 --nodes=1",
352+
fileExists: true,
353+
dryRun: false,
354+
expectError: true,
355+
errorContains: "upload aborted",
356+
description: "When no --redact flag and user declines, should return cancellation error",
357+
},
358+
{
359+
name: "unclear redaction status user accepts",
360+
flagsContent: " --concurrency=1 --nodes=1",
361+
fileExists: true,
362+
dryRun: false,
363+
expectError: false,
364+
errorContains: "user accepts",
365+
description: "When no --redact flag and user accepts, should proceed without error",
366+
},
367+
}
368+
369+
for _, test := range tests {
370+
t.Run(test.name, func(t *testing.T) {
371+
tempDir := t.TempDir()
372+
373+
if test.fileExists {
374+
flagsFile := path.Join(tempDir, debugZipCommandFlagsFileName)
375+
if err := os.WriteFile(flagsFile, []byte(test.flagsContent), 0644); err != nil {
376+
t.Fatal(err)
377+
}
378+
}
379+
380+
originalDryRun := debugZipUploadOpts.dryRun
381+
debugZipUploadOpts.dryRun = test.dryRun
382+
defer func() { debugZipUploadOpts.dryRun = originalDryRun }()
383+
384+
if test.errorContains == "upload aborted" || test.errorContains == "user accepts" {
385+
originalPrompt := promptUserForConfirmation
386+
switch test.errorContains {
387+
case "upload aborted":
388+
promptUserForConfirmation = func(warningMsg string) error {
389+
require.Contains(t, warningMsg, "WARNING:")
390+
return fmt.Errorf("upload aborted")
391+
}
392+
case "user accepts":
393+
promptUserForConfirmation = func(warningMsg string) error {
394+
require.Contains(t, warningMsg, "WARNING:")
395+
return nil
396+
}
397+
}
398+
defer func() { promptUserForConfirmation = originalPrompt }()
399+
}
400+
401+
// Test validation
402+
err := validateRedactionStatus(tempDir)
403+
404+
if test.expectError {
405+
require.Error(t, err, fmt.Sprintf("Test case: %s", test.description))
406+
require.Contains(t, err.Error(), test.errorContains, fmt.Sprintf("Test case: %s", test.description))
407+
} else {
408+
require.NoError(t, err, fmt.Sprintf("Test case: %s", test.description))
409+
}
410+
})
411+
}
412+
}
413+
288414
func TestZipUploadArtifactTypes(t *testing.T) {
289415
defer leaktest.AfterTest(t)()
290416

pkg/kv/kvserver/asim/config/BUILD.bazel

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ go_library(
55
srcs = ["settings.go"],
66
importpath = "github.com/cockroachdb/cockroach/pkg/kv/kvserver/asim/config",
77
visibility = ["//visibility:public"],
8+
deps = ["//pkg/settings/cluster"],
89
)

pkg/kv/kvserver/asim/config/settings.go

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,19 @@
55

66
package config
77

8-
import "time"
8+
import (
9+
"time"
10+
11+
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
12+
)
913

1014
const (
1115
defaultTickInteval = 500 * time.Millisecond
1216
defaultMetricsInterval = 10 * time.Second
1317
defaultReplicaChangeBaseDelay = 100 * time.Millisecond
14-
defaultReplicaAddDelayFactor = 16
18+
defaultRebalancingSnapshotRate = 32 << 20 // 32MiB/s
1519
defaultSplitQueueDelay = 100 * time.Millisecond
16-
defaultRangeSizeSplitThreshold = 512 * 1024 * 1024 // 512mb
20+
defaultRangeSizeSplitThreshold = 512 << 20 // 512MiB
1721
defaultRangeRebalanceThreshold = 0.05
1822
defaultPacerLoopInterval = 10 * time.Minute
1923
defaultPacerMinIterInterval = 10 * time.Millisecond
@@ -56,13 +60,11 @@ type SimulationSettings struct {
5660
// (add,remove). It accounts for a fixed overhead of initiating a replica
5761
// movement.
5862
ReplicaChangeBaseDelay time.Duration
59-
// ReplicaAddRate is the factor applied to the range size (MB) when
60-
// calculating how long a replica addition will take for a given range
61-
// size. For adding a replica to a new store, the delay is calculated as
62-
// ReplicaChangeBaseDelay + (RangeSize(MB) * ReplicaAddRate) milliseconds.
63-
// This is analogous to the rate at which a store will ingest snapshots for
64-
// up replication.
65-
ReplicaAddRate float64
63+
// RebalancingSnapshotRate is rate at which newly added replicas will be
64+
// added based on the range size. e.g., When the range size is 32MiB, and the
65+
// RebalancingSnapshotRate is 32 << 20, the delay for adding a replica wil be
66+
// 1 second + ReplicaChangeBaseDelay.
67+
RebalancingSnapshotRate int64
6668
// SplitQueueDelay is the delay that range splits take to complete.
6769
SplitQueueDelay time.Duration
6870
// RangeSizeSplitThreshold is the threshold in MB, below which ranges will
@@ -116,6 +118,11 @@ type SimulationSettings struct {
116118
LeaseQueueEnabled bool
117119
// SplitQueueEnabled controls whether the split queue is enabled.
118120
SplitQueueEnabled bool
121+
// st is used to update cockroach cluster settings.
122+
//
123+
// TODO(wenyihu6): Remove any non-simulation settings from this struct and
124+
// instead override the settings below.
125+
ST *cluster.Settings
119126
}
120127

121128
// DefaultSimulationSettings returns a set of default settings for simulation.
@@ -126,7 +133,7 @@ func DefaultSimulationSettings() *SimulationSettings {
126133
MetricsInterval: defaultMetricsInterval,
127134
Seed: defaultSeed,
128135
ReplicaChangeBaseDelay: defaultReplicaChangeBaseDelay,
129-
ReplicaAddRate: defaultReplicaAddDelayFactor,
136+
RebalancingSnapshotRate: defaultRebalancingSnapshotRate,
130137
SplitQueueDelay: defaultSplitQueueDelay,
131138
RangeSizeSplitThreshold: defaultRangeSizeSplitThreshold,
132139
RangeRebalanceThreshold: defaultRangeRebalanceThreshold,
@@ -145,6 +152,7 @@ func DefaultSimulationSettings() *SimulationSettings {
145152
ReplicateQueueEnabled: true,
146153
LeaseQueueEnabled: true,
147154
SplitQueueEnabled: true,
155+
ST: cluster.MakeClusterSettings(),
148156
}
149157
}
150158

@@ -154,7 +162,8 @@ func (s *SimulationSettings) ReplicaChangeDelayFn() func(rangeSize int64, add bo
154162
return func(rangeSize int64, add bool) time.Duration {
155163
delay := s.ReplicaChangeBaseDelay
156164
if add {
157-
delay += (time.Duration(rangeSize/(1024*1024)) / time.Duration(s.ReplicaAddRate))
165+
estimatedTimeToAddReplica := float64(rangeSize) / float64(s.RebalancingSnapshotRate)
166+
delay += time.Duration(estimatedTimeToAddReplica) * time.Second
158167
}
159168
return delay
160169
}

0 commit comments

Comments
 (0)