Skip to content

Conversation

@arzonus
Copy link
Contributor

@arzonus arzonus commented Nov 14, 2025

What changed?
Added metrics to track shard assignment distribution and handover latency in the shard distributor service. This includes:

  • New ShardDistributorShardAssignmentDistributionLatency metric measuring time from shard assignment to distribution
  • New ShardDistributorShardHandoverLatency metric measuring handover time between executors
  • HandoverType enum (GRACEFUL/EMERGENCY) to distinguish handover types
  • Enhanced ShardAssignment to track previous executor heartbeat time and handover metadata
  • Store methods to persist and retrieve shard statistics
  • Extracted shard statistics update logic from datastore to leader processor for better separation of concerns

Why?
To provide visibility into shard distribution performance and identify potential issues with shard handovers, particularly distinguishing between graceful and emergency handovers.

How did you test it?
Added comprehensive unit tests covering metric emission logic and shard statistics tracking.

Potential risks
Additional storage operations during heartbeat processing (mitigated by running metrics emission in background goroutine).

Release notes
Added shard handover latency metrics for monitoring distribution performance.

Documentation Changes
None required.

@arzonus arzonus changed the title feat(sharddistributor): add shard handover latency metrics feat(shard-distributor): add shard handover latency metrics Nov 14, 2025
@arzonus arzonus marked this pull request as draft November 18, 2025 08:47
@arzonus arzonus force-pushed the add-shard-handover-latency-metric branch 4 times, most recently from 14c44de to 53a34b4 Compare November 19, 2025 11:35
@arzonus arzonus force-pushed the add-shard-handover-latency-metric branch from 53a34b4 to 096cd69 Compare November 24, 2025 12:18
@arzonus arzonus marked this pull request as ready for review November 24, 2025 12:21
Comment on lines +123 to +126
if len(newAssignedShardIDs) == 0 {
// no handovers happened, nothing to do
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(newAssignedShardIDs) == 0 {
// no handovers happened, nothing to do
return
}

Nit: This is not necessary, the loop below will just not run if the list is empty

Comment on lines +350 to +358
for shardID, stats := range request.ShardStats {
// Update the executor's assigned_state key.
shardStatsKey := etcdkeys.BuildShardKey(s.prefix, namespace, shardID, etcdkeys.ShardStatisticsKey)
value, err := json.Marshal(stats)
if err != nil {
return fmt.Errorf("marshal assigned shards for shard %s: %w", shardID, err)
}
ops = append(ops, clientv3.OpPut(shardStatsKey, string(value)))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will add an action per shard. ETCD cannot handle that - we should do this outside of the main write loop, like we do with shard statistics.

ETCd can by default, at most handle 128 operations in a single transaction. This is configurable, but we cannot have one per shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants