-
Notifications
You must be signed in to change notification settings - Fork 380
feat(storer): add comprehensive metrics for ReserveSample performance #5177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nugaon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that would be better if it sent metrics in batches wherever it is possible and avoid from redundant metric data.
pkg/storer/sample.go
Outdated
| stampValidDuration := time.Since(stampValidStart) | ||
| stats.ValidStampDuration += stampValidDuration | ||
| db.metrics.ReserveSampleStampValidations.Inc() | ||
| db.metrics.ReserveSampleStampValidDuration.WithLabelValues("valid").Observe(stampValidDuration.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconds is too low resolution
pkg/storer/sample.go
Outdated
| select { | ||
| case chunkC <- ch: | ||
| stats.TotalIterated++ | ||
| db.metrics.ReserveSampleChunksIterated.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be more performant if stats.totaliterated was just sent in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better approach would be to remove TotalIterated since it uses a mutex. Using Inc() on a counter with atomic operations is much faster than a mutex. The overhead of writing to an atomic variable is negligible compared to the total amount of current work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TotalIterated is incremented without mutex, the mutex is used only at the end of the goroutine when it is added to the common, aggregating. Not sure which one is really the faster (atomic operations have hardware level contention still and there are many increments) but one of them should be kept only since they are the same.
pkg/storer/sample.go
Outdated
| chunk, err := db.ChunkStore().Get(ctx, chItem.Address) | ||
| if err != nil { | ||
| wstat.ChunkLoadFailed++ | ||
| db.metrics.ReserveSampleChunksLoadFailed.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as at TotalIterated
pkg/storer/sample.go
Outdated
| } | ||
|
|
||
| wstat.ChunkLoadDuration += time.Since(chunkLoadStart) | ||
| db.metrics.ReserveSampleChunksLoaded.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is TotalIterated - (BelowBalanceIgnored + RogueChunk + ChunkLoadFailed). There are metrics for all of them.
pkg/storer/sample.go
Outdated
| wstat.TaddrDuration += time.Since(taddrStart) | ||
| taddrDuration := time.Since(taddrStart) | ||
| wstat.TaddrDuration += taddrDuration | ||
| db.metrics.ReserveSampleTaddrDuration.WithLabelValues(chItem.ChunkType.String()).Observe(taddrDuration.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconds resolution is too low. Miliseconds or nanoseconds would be better
pkg/storer/sample.go
Outdated
| stats.ValidStampDuration += time.Since(start) | ||
| stampValidDuration := time.Since(stampValidStart) | ||
| stats.ValidStampDuration += stampValidDuration | ||
| db.metrics.ReserveSampleStampValidations.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is stats.SampleInserts
pkg/storer/sample.go
Outdated
| if err != nil { | ||
| status = "failure" | ||
| } | ||
| db.metrics.ReserveSampleDuration.WithLabelValues(status, fmt.Sprintf("%d", workers)).Observe(duration.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below you send workers as a separate metric.
pkg/storer/sample.go
Outdated
| return err | ||
| } | ||
| wstat.TaddrDuration += time.Since(taddrStart) | ||
| taddrDuration := time.Since(taddrStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you created a new variable for it but you don't use that anywhere else than in the below line.
| }, | ||
| []string{"metric"}, | ||
| ), | ||
| ReserveSampleLastRunTimestamp: prometheus.NewGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it important? we have metrics for reserve sampling duration and there is a approximate time for it when crawling happens of stat metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the exact finish timestamp, we can calculate the precise start time and detect delays in the game timing. This enables monitoring whether sampling starts on schedule within the expected phase. It also simplifies time-based calculations without needing to parse logs.
pkg/storer/metrics.go
Outdated
| Subsystem: subsystem, | ||
| Name: "reserve_sample_duration_seconds", | ||
| Help: "Duration of ReserveSample operations in seconds.", | ||
| Buckets: []float64{30, 60, 120, 300, 600, 900, 1200, 1500, 1800, 2400, 3000, 3600, 4800}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the range is really large here, 30 seconds to 80 minutes... none of them seems realistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 3min to 30min as upper limit?
Checklist
Description
Adds additional metrics for the ReseveSample to measure its perfomance
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):