-
Notifications
You must be signed in to change notification settings - Fork 514
db: improve metrics and logs with multiple disks being used by a DB #5378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e1d6b17 to
4c5809a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 19 files reviewed, 2 unresolved discussions (waiting on @annrpom)
record/log_writer_latency_test.go line 17 at r2 (raw file):
) func TestWALMetricsWithErrorfsLatency(t *testing.T) {
Is this test reliable? In general, relying on timings leads to flakes. Can you run it with --exec 'stress -p 64' and with --exec 'stress -p 64' --race?
More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.
record/log_writer.go line 837 at r2 (raw file):
writeStart := crtime.NowMono() _, err = w.w.Write(data) if writeLatency := writeStart.Elapsed(); w.flusher.walFileMetrics.WriteLatency != nil {
[nit] the assignment here is strange (it's not used in the condition). I'd just inline writeStart.Elapsed().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the test code yet, but in a cursory glance I found the sheer volume a little daunting. We try to hold our test code to a similar standard of simplicity as production code, often uncoupling test data/cases from test code that performs the mechanics. I'm wondering if we can test these changes with less code, perhaps leaning on either datadriven tests (uncoupling data and test code) or randomized testing (which tends to achieve higher coverage with less test code to maintain).
Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom and @xxmplus)
metrics.go line 568 at r2 (raw file):
CloseLatency prometheus.Histogram // File close operations StatLatency prometheus.Histogram // File stat operations OpenDirLatency prometheus.Histogram // Directory open operations
I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)
metrics.go line 568 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I don't think we need per-operation histograms, but rather one histogram for the primary and one histogram for the secondary, reporting every operation against the individual file.
These ops have quite different latency and frequency access patterns. If we consolidate them into one, we will lose visibility to the less frequent ops. Is that still ok?
record/log_writer_latency_test.go line 17 at r2 (raw file):
Previously, RaduBerinde wrote…
Is this test reliable? In general, relying on timings leads to flakes. Can you run it with
--exec 'stress -p 64'and with--exec 'stress -p 64' --race?More generally, this is a LOT of testing code. ~1500 lines of code for a relatively simple addition - these tests will need to be maintained, sometimes debugged. I don't think it's worth it. Some things are hard to add unit tests for and that's ok, especially when it's an observability thing that has no bearing on correctness.
I ran it against stress but not race. And ok, I will remove some verbose tests.
I had a chance to sync up with Jackson. While they have different latency profiles, we care more about how file ops impact as a whole. I will update the code to consolidate them and also, not increase the number of metrics. |
2350d6b to
2d250f8
Compare
Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary. Fixes cockroachdb#5328
2d250f8 to
3d8fdf3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbowens reviewed 1 of 2 files at r1, 1 of 17 files at r4, all commit messages.
Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @RaduBerinde, and @xxmplus)
open.go line 295 at r4 (raw file):
walFileOpHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "pebble_wal_file_op_duration_nanoseconds", Help: "Histogram of WAL file operation latencies in nanoseconds",
nit: I think we should just exclude the Name and Help and sections; it'll be unused in the context of Cockroach and may confuse
metrics.go line 578 at r4 (raw file):
prometheus.LinearBuckets(0.0, 0.01, 50), // 0 to 0.5ms in 10μs increments prometheus.ExponentialBucketsRange(1.0, 10000.0, 50)..., // 1ms to 10s )
I think we can unify the histogram configuration, consolidating FsyncLatencyBuckets and WALFileOpLatencyBuckets
db.go line 383 at r4 (raw file):
// protected by mu. fsyncLatency prometheus.Histogram // Updated whenever a wal.Writer is closed.
nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?
wal/failover_writer.go line 463 at r4 (raw file):
// WAL file operation latency histogram walFileOpHistogram record.WALFileOpHistogram directoryType record.DirectoryType
the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.
3d8fdf3 to
63a4485
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 19 files reviewed, 7 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)
db.go line 383 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: can we keep the comment that metrics.LogWriterMetrics is updated when a WAL writer is closed?
ack. added to WALMetrics definition in metrics.go
metrics.go line 578 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think we can unify the histogram configuration, consolidating
FsyncLatencyBucketsandWALFileOpLatencyBuckets
ack. done.
open.go line 295 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: I think we should just exclude the
NameandHelpand sections; it'll be unused in the context of Cockroach and may confuse
removed.
wal/failover_writer.go line 463 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
the failover handles writing to both the primary and secondary. I think we need access to both histograms and be able to construct child LogWriters that use one or the other depending on which device the child LogWriter is writing to.
ack.
63a4485 to
84b5370
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
Claude just gave new meaning to "my two cents" |
Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).
84b5370 to
4f01c47
Compare
|
I believe the AI review bot found a real bug. The PR has been updated accordingly. |
Log full path to WAL file in disk slow info so it's explicit if the WAL file is on the primary or secondary.
Enhance the log writer to instrument all filesystem operations done for writing the log file in the latency histogram (create, write, sync, etc.).
Fixes #5328