Skip to content

Conversation

@nugaon
Copy link
Member

@nugaon nugaon commented Jul 16, 2025

Bee Network Metrics Enhancement Session Summary

Code Changes and Investigations

Neighborhood Metrics Implementation

enhanced the Salud package with new metrics to better track neighborhood-specific performance:

  • Added two new Prometheus metrics in pkg/salud/metrics.go:
    • NeighborhoodAvgDur: Tracks average response duration specifically for neighborhood peers
    • NeighborCount: Tracks the count of neighborhood peers

Node Spinup Metrics:

  • Defined node metrics in pkg/node/metrics.go:

    • WarmupDuration: Histogram measuring time for node warmup to complete
    • FullSyncDuration: Histogram measuring time for full sync to complete
  • Implemented metrics collection in pkg/node/node.go:

    • Initialize metrics at node startup
    • Start measuring warmup time when node begins initialization
    • Record warmup duration when node stabilizes
    • Track full sync duration after warmup completes
    • Define full sync criteria based on:
      • Zero sync rate from puller service
      • Healthy status from salud service
      • Reserve size reaching threshold (half of reserve capacity)
    • Continuously check sync status and record full sync duration when criteria are met

Grafana Visualization Proposal

Below is a Grafana dashboard design to visualize the newly introduced metrics:

Node Spinup Performance Dashboard

  • Metrics:
    • bee_init_warmup_duration_seconds
    • bee_salud_neighbors
    • bee_salud_neighborhood_dur
    • bee_storer_reserve_size_within_radius
  • Visualization: Scatter plot matrix
  • Description: Correlation between spinup time, neighbor count, neighborhood duration and reserve size

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

@nugaon nugaon requested a review from gacevicljubisa July 24, 2025 14:27
return
case <-syncCheckTicker.C:
synced := isFullySynced()
logger.Debug("sync status check", "synced", synced, "reserveSize", localStore.ReserveSize(), "threshold", reserveTreshold, "syncRate", pullerService.SyncRate())
Copy link
Member

@gacevicljubisa gacevicljubisa Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change log level to Trace, because it will spam every second until ReserveSize reaches trashold? Or we can even increase the time checking to 2 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the time check to 2 seconds because debug level is the most verbose.

@nugaon nugaon marked this pull request as ready for review August 13, 2025 08:02
wg sync.WaitGroup
totaldur float64
peers []peer
neighborhoodPeers []peer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see this is more like a counter, why slice of peer ?

@nugaon nugaon requested a review from gacevicljubisa August 18, 2025 11:05
@nugaon nugaon merged commit 2378edd into master Aug 21, 2025
15 checks passed
@nugaon nugaon deleted the feat/spinup-metrics branch August 21, 2025 14:00
@bcsorvasi bcsorvasi added this to the v2.7.0 milestone Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants