Skip to content

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Aug 1, 2025

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Adds additional metrics for the ReseveSample to measure its perfomance

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

@gacevicljubisa gacevicljubisa marked this pull request as ready for review August 4, 2025 08:32
@gacevicljubisa gacevicljubisa requested a review from janos August 10, 2025 15:13
Copy link
Member

@nugaon nugaon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that would be better if it sent metrics in batches wherever it is possible and avoid from redundant metric data.

stampValidDuration := time.Since(stampValidStart)
stats.ValidStampDuration += stampValidDuration
db.metrics.ReserveSampleStampValidations.Inc()
db.metrics.ReserveSampleStampValidDuration.WithLabelValues("valid").Observe(stampValidDuration.Seconds())
Copy link
Member

@nugaon nugaon Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconds is too low resolution

select {
case chunkC <- ch:
stats.TotalIterated++
db.metrics.ReserveSampleChunksIterated.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be more performant if stats.totaliterated was just sent in the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better approach would be to remove TotalIterated since it uses a mutex. Using Inc() on a counter with atomic operations is much faster than a mutex. The overhead of writing to an atomic variable is negligible compared to the total amount of current work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TotalIterated is incremented without mutex, the mutex is used only at the end of the goroutine when it is added to the common, aggregating. Not sure which one is really the faster (atomic operations have hardware level contention still and there are many increments) but one of them should be kept only since they are the same.

chunk, err := db.ChunkStore().Get(ctx, chItem.Address)
if err != nil {
wstat.ChunkLoadFailed++
db.metrics.ReserveSampleChunksLoadFailed.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as at TotalIterated

}

wstat.ChunkLoadDuration += time.Since(chunkLoadStart)
db.metrics.ReserveSampleChunksLoaded.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is TotalIterated - (BelowBalanceIgnored + RogueChunk + ChunkLoadFailed). There are metrics for all of them.

wstat.TaddrDuration += time.Since(taddrStart)
taddrDuration := time.Since(taddrStart)
wstat.TaddrDuration += taddrDuration
db.metrics.ReserveSampleTaddrDuration.WithLabelValues(chItem.ChunkType.String()).Observe(taddrDuration.Seconds())
Copy link
Member

@nugaon nugaon Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconds resolution is too low. Miliseconds or nanoseconds would be better

stats.ValidStampDuration += time.Since(start)
stampValidDuration := time.Since(stampValidStart)
stats.ValidStampDuration += stampValidDuration
db.metrics.ReserveSampleStampValidations.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is stats.SampleInserts

if err != nil {
status = "failure"
}
db.metrics.ReserveSampleDuration.WithLabelValues(status, fmt.Sprintf("%d", workers)).Observe(duration.Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below you send workers as a separate metric.

return err
}
wstat.TaddrDuration += time.Since(taddrStart)
taddrDuration := time.Since(taddrStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you created a new variable for it but you don't use that anywhere else than in the below line.

},
[]string{"metric"},
),
ReserveSampleLastRunTimestamp: prometheus.NewGauge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it important? we have metrics for reserve sampling duration and there is a approximate time for it when crawling happens of stat metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exact finish timestamp, we can calculate the precise start time and detect delays in the game timing. This enables monitoring whether sampling starts on schedule within the expected phase. It also simplifies time-based calculations without needing to parse logs.

Subsystem: subsystem,
Name: "reserve_sample_duration_seconds",
Help: "Duration of ReserveSample operations in seconds.",
Buckets: []float64{30, 60, 120, 300, 600, 900, 1200, 1500, 1800, 2400, 3000, 3600, 4800},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the range is really large here, 30 seconds to 80 minutes... none of them seems realistic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 3min to 30min as upper limit?

@gacevicljubisa gacevicljubisa merged commit 8e62840 into master Oct 1, 2025
15 checks passed
@gacevicljubisa gacevicljubisa deleted the feat/add-reserve-sample-metrics branch October 1, 2025 12:13
@bcsorvasi bcsorvasi added this to the v2.7.0 milestone Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants