Skip to content

Commit 4fa3974

Browse files
craig[bot]tbgTheComputerM
committed
146990: roachtest: unconditionally save clusters that show raft fatal errors r=tbg a=tbg When a cluster's logs contain a raft panic, it will be extended (by a week), volume snapshots will be taken, and the cluster will not be destroyed. This gives us the artifacts for a thorough investigation. Verified manually via: ``` run --local acceptance/invariant-check-detection/failed=true ``` Here is the (editorialized) output: ``` test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified) test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049: logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost? ``` Closes #145953. Informs #146617. Informs #138028. Fixes #146355. Epic: none 147683: pkg/util/log: parse otlp sink from yaml config r=TheComputerM a=TheComputerM OpenTelemetry is now an industry standard for o11y and is more efficient than other log sinks currently available. This commit only defines basic configuration options for the OTLP sink, like address, insecure, and compression, and adds logic to parse them from the YAML config. The actual sink implementation will follow in a future commit. Informs: #143049 Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Mudit Somani <[email protected]>
3 parents 808f7f1 + 96baece + 5f5bd38 commit 4fa3974

File tree

15 files changed

+417
-12
lines changed

15 files changed

+417
-12
lines changed

docs/generated/logsinks.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The supported log output sink types are documented below.
88

99
- [Output to HTTP servers.](#output-to-http-servers.)
1010

11+
- [Output to OpenTelemetry compatible collectors.](#output-to-opentelemetry-compatible-collectors.)
12+
1113
- [Standard error stream](#standard-error-stream)
1214

1315

@@ -242,6 +244,70 @@ Configuration options shared across all sink types:
242244

243245

244246

247+
<a name="output-to-opentelemetry-compatible-collectors.">
248+
249+
## Sink type: Output to OpenTelemetry compatible collectors.
250+
251+
252+
This sink type causes logging data to be sent over the network through gRPC to
253+
a collector that can ingest log data in an [OTLP](https://opentelemetry.io) format.
254+
255+
The configuration key under the `sinks` key in the YAML
256+
configuration is `otlp-servers`. Example configuration:
257+
258+
// sinks:
259+
// otlp-servers:
260+
// health:
261+
// channels: HEALTH
262+
// address: 127.0.0.1:4317
263+
264+
Every new server sink configured automatically inherits the configuration set in the `otlp-defaults` section.
265+
266+
For example:
267+
268+
// otlp-defaults:
269+
// redactable: false # default: disable redaction markers
270+
// sinks:
271+
// otlp-servers:
272+
// health:
273+
// channels: HEALTH
274+
// # This sink has redactable set to false,
275+
// # as the setting is inherited from fluent-defaults
276+
// # unless overridden here.
277+
278+
The default output format for OTLP sinks is
279+
`json`. [Other supported formats.](log-formats.html)
280+
281+
{{site.data.alerts.callout_info}}
282+
Run `cockroach debug check-log-config` to verify the effect of defaults inheritance.
283+
{{site.data.alerts.end}}
284+
285+
286+
Type-specific configuration options:
287+
288+
| Field | Description |
289+
|--|--|
290+
| `channels` | the list of logging channels that use this sink. See the [channel selection configuration](#channel-format) section for details. |
291+
| `address` | the network address of the gRPC endpoint for ingestion of logs on your OpenTelemetry Collector/Platform. The host/address and port parts are separated with a colon. |
292+
| `insecure` | Disables transport security for the underlying gRPC connection. Inherited from `otlp-defaults.insecure` if not specified. |
293+
| `compression` | can be "none" or "gzip" to enable gzip compression. Set to "gzip" by default. Inherited from `otlp-defaults.compression` if not specified. |
294+
295+
296+
Configuration options shared across all sink types:
297+
298+
| Field | Description |
299+
|--|--|
300+
| `filter` | specifies the default minimum severity for log events to be emitted to this sink, when not otherwise specified by the 'channels' sink attribute. |
301+
| `format` | the entry format to use. |
302+
| `format-options` | additional options for the format. |
303+
| `redact` | whether to strip sensitive information before log events are emitted to this sink. |
304+
| `redactable` | whether to keep redaction markers in the sink's output. The presence of redaction markers makes it possible to strip sensitive data reliably. |
305+
| `exit-on-error` | whether the logging system should terminate the process if an error is encountered while writing to this sink. |
306+
| `auditable` | translated to tweaks to the other settings for this sink during validation. For example, it enables `exit-on-error` and changes the format of files from `crdb-v1` to `crdb-v1-count`. |
307+
| `buffering` | configures buffering for this log sink, or NONE to explicitly disable. See the [common buffering configuration](#buffering-config) section for details. |
308+
309+
310+
245311
<a name="standard-error-stream">
246312

247313
## Sink type: Standard error stream

pkg/cli/log_flags_test.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,17 @@ func TestSetupLogging(t *testing.T) {
5757
`flush-trigger-size: 1.0MiB, ` +
5858
`max-buffer-size: 50MiB, ` +
5959
`format: newline}}`
60+
const defaultOtlpConfig = `otlp-defaults: {` +
61+
`insecure: false, ` +
62+
`compression: gzip, ` +
63+
`filter: INFO, ` +
64+
`format: json, ` +
65+
`redactable: true, ` +
66+
`exit-on-error: false, ` +
67+
`buffering: {max-staleness: 5s, ` +
68+
`flush-trigger-size: 1.0MiB, ` +
69+
`max-buffer-size: 50MiB, ` +
70+
`format: newline}}`
6071
stdFileDefaultsRe := regexp.MustCompile(
6172
`file-defaults: \{` +
6273
`dir: (?P<path>[^,]+), ` +
@@ -189,6 +200,7 @@ func TestSetupLogging(t *testing.T) {
189200
// Shorten the configuration for legibility during reviews of test changes.
190201
actual = strings.ReplaceAll(actual, defaultFluentConfig, "<fluentDefaults>")
191202
actual = strings.ReplaceAll(actual, defaultHTTPConfig, "<httpDefaults>")
203+
actual = strings.ReplaceAll(actual, defaultOtlpConfig, "<otlpDefaults>")
192204
actual = stdFileDefaultsRe.ReplaceAllString(actual, "<stdFileDefaults($path)>")
193205
actual = fileDefaultsNoMaxSizeRe.ReplaceAllString(actual, "<fileDefaultsNoMaxSize($path)>")
194206
actual = strings.ReplaceAll(actual, fileDefaultsNoDir, "<fileDefaultsNoDir>")

pkg/cli/testdata/logflags

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ start
1515
config: {<stdFileDefaults(<defaultLogDir>)>,
1616
<fluentDefaults>,
1717
<httpDefaults>,
18+
<otlpDefaults>,
1819
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
1920
OPS],
2021
WARNING: [HEALTH,
@@ -51,6 +52,7 @@ start-single-node
5152
config: {<stdFileDefaults(<defaultLogDir>)>,
5253
<fluentDefaults>,
5354
<httpDefaults>,
55+
<otlpDefaults>,
5456
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
5557
OPS],
5658
WARNING: [HEALTH,
@@ -91,6 +93,7 @@ sql
9193
config: {<fileDefaultsNoDir>,
9294
<fluentDefaults>,
9395
<httpDefaults>,
96+
<otlpDefaults>,
9497
sinks: {<stderrEnabledWarningNoRedaction>}}
9598

9699
run
@@ -99,6 +102,7 @@ init
99102
config: {<fileDefaultsNoDir>,
100103
<fluentDefaults>,
101104
<httpDefaults>,
105+
<otlpDefaults>,
102106
sinks: {<stderrEnabledWarningNoRedaction>}}
103107

104108

@@ -112,6 +116,7 @@ bank
112116
config: {<fileDefaultsNoDir>,
113117
<fluentDefaults>,
114118
<httpDefaults>,
119+
<otlpDefaults>,
115120
sinks: {<stderrEnabledInfoNoRedaction>}}
116121

117122

@@ -123,6 +128,7 @@ demo
123128
config: {<fileDefaultsNoDir>,
124129
<fluentDefaults>,
125130
<httpDefaults>,
131+
<otlpDefaults>,
126132
sinks: {<stderrCfg(FATAL,false)>}}
127133

128134

@@ -139,6 +145,7 @@ start
139145
config: {<fileDefaultsNoDir>,
140146
<fluentDefaults>,
141147
<httpDefaults>,
148+
<otlpDefaults>,
142149
sinks: {<stderrEnabledInfoNoRedaction>}}
143150

144151

@@ -152,6 +159,7 @@ start
152159
config: {<stdFileDefaults(/pathA/logs)>,
153160
<fluentDefaults>,
154161
<httpDefaults>,
162+
<otlpDefaults>,
155163
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
156164
OPS],
157165
WARNING: [HEALTH,
@@ -190,6 +198,7 @@ start
190198
config: {<stdFileDefaults(/mypath)>,
191199
<fluentDefaults>,
192200
<httpDefaults>,
201+
<otlpDefaults>,
193202
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
194203
OPS],
195204
WARNING: [HEALTH,
@@ -229,6 +238,7 @@ start
229238
config: {<stdFileDefaults(/pathA/logs)>,
230239
<fluentDefaults>,
231240
<httpDefaults>,
241+
<otlpDefaults>,
232242
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
233243
OPS],
234244
WARNING: [HEALTH,
@@ -273,6 +283,7 @@ start
273283
config: {<stdFileDefaults(/mypath)>,
274284
<fluentDefaults>,
275285
<httpDefaults>,
286+
<otlpDefaults>,
276287
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
277288
OPS],
278289
WARNING: [HEALTH,
@@ -310,6 +321,7 @@ start
310321
config: {<stdFileDefaults(<defaultLogDir>)>,
311322
<fluentDefaults>,
312323
<httpDefaults>,
324+
<otlpDefaults>,
313325
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
314326
OPS],
315327
WARNING: [HEALTH,
@@ -348,6 +360,7 @@ start
348360
config: {<stdFileDefaults(<defaultLogDir>)>,
349361
<fluentDefaults>,
350362
<httpDefaults>,
363+
<otlpDefaults>,
351364
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
352365
OPS],
353366
WARNING: [HEALTH,
@@ -396,6 +409,7 @@ start
396409
config: {<stdFileDefaults(<defaultLogDir>)>,
397410
<fluentDefaults>,
398411
<httpDefaults>,
412+
<otlpDefaults>,
399413
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
400414
OPS],
401415
WARNING: [HEALTH,
@@ -455,6 +469,7 @@ start
455469
config: {<fileDefaultsNoDir>,
456470
<fluentDefaults>,
457471
<httpDefaults>,
472+
<otlpDefaults>,
458473
sinks: {<stderrEnabledInfoNoRedaction>}}
459474

460475

@@ -466,6 +481,7 @@ start
466481
config: {<stdFileDefaults(/mypath)>,
467482
<fluentDefaults>,
468483
<httpDefaults>,
484+
<otlpDefaults>,
469485
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
470486
OPS],
471487
WARNING: [HEALTH,
@@ -505,6 +521,7 @@ start
505521
config: {<stdFileDefaults(/pathA)>,
506522
<fluentDefaults>,
507523
<httpDefaults>,
524+
<otlpDefaults>,
508525
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
509526
OPS],
510527
WARNING: [HEALTH,
@@ -544,6 +561,7 @@ init
544561
config: {<fileDefaultsNoMaxSize(/mypath)>,
545562
<fluentDefaults>,
546563
<httpDefaults>,
564+
<otlpDefaults>,
547565
sinks: {file-groups: {default: {channels: {INFO: all},
548566
dir: /mypath,
549567
file-permissions: "0640",
@@ -563,6 +581,7 @@ start
563581
config: {<stdFileDefaults(<defaultLogDir>)>,
564582
<fluentDefaults>,
565583
<httpDefaults>,
584+
<otlpDefaults>,
566585
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
567586
OPS],
568587
WARNING: [HEALTH,
@@ -600,6 +619,7 @@ start
600619
config: {<stdFileDefaults(<defaultLogDir>)>,
601620
<fluentDefaults>,
602621
<httpDefaults>,
622+
<otlpDefaults>,
603623
sinks: {file-groups: {default: <fileCfg(INFO: [DEV,
604624
OPS],
605625
WARNING: [HEALTH,
@@ -637,6 +657,7 @@ init
637657
config: {<fileDefaultsNoDir>,
638658
<fluentDefaults>,
639659
<httpDefaults>,
660+
<otlpDefaults>,
640661
sinks: {<stderrEnabledInfoNoRedaction>}}
641662

642663
# Default when no severity is specified is WARNING.
@@ -647,6 +668,7 @@ init
647668
config: {<fileDefaultsNoDir>,
648669
<fluentDefaults>,
649670
<httpDefaults>,
671+
<otlpDefaults>,
650672
sinks: {<stderrEnabledWarningNoRedaction>}}
651673

652674

pkg/cmd/roachtest/cluster.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1109,6 +1109,12 @@ func (c *clusterImpl) Save(ctx context.Context, msg string, l *logger.Logger) {
11091109
c.destroyState.mu.Unlock()
11101110
}
11111111

1112+
func (c *clusterImpl) saved() bool {
1113+
c.destroyState.mu.Lock()
1114+
defer c.destroyState.mu.Unlock()
1115+
return c.destroyState.mu.saved
1116+
}
1117+
11121118
var errClusterNotFound = errors.New("cluster not found")
11131119

11141120
// validateCluster takes a cluster and checks that the reality corresponds to

pkg/cmd/roachtest/run.go

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -152,12 +152,20 @@ func runTests(register func(registry.Registry), filter *registry.TestFilter) err
152152
ctx, cancel := context.WithCancel(context.Background())
153153
defer cancel()
154154
CtrlC(ctx, l, cancel, cr)
155-
// Install goroutine leak checker and run it at the end of the entire test
156-
// run. If a test is leaking a goroutine, then it will likely be still around.
157-
// We could diff goroutine snapshots before/after each executed test, but that
158-
// could yield false positives; e.g., user-specified test teardown goroutines
159-
// may still be running long after the test has completed.
160-
defer leaktest.AfterTest(l)()
155+
if false {
156+
// Install goroutine leak checker and run it at the end of the entire test
157+
// run. If a test is leaking a goroutine, then it will likely be still around.
158+
// We could diff goroutine snapshots before/after each executed test, but that
159+
// could yield false positives; e.g., user-specified test teardown goroutines
160+
// may still be running long after the test has completed.
161+
//
162+
// NB: we currently don't do this since it's been firing for a long time and
163+
// nobody has cleaned up the leaks. While there are leaks, the leaktest
164+
// output pollutes stdout and makes roachtest annoying to use.
165+
//
166+
// Tracking issue: https://github.com/cockroachdb/cockroach/issues/148196
167+
defer leaktest.AfterTest(l)()
168+
}
161169

162170
// We allow roachprod users to set a default auth-mode through the
163171
// ROACHPROD_DEFAULT_AUTH_MODE env var. However, roachtests shouldn't

0 commit comments

Comments
 (0)