Skip to content

Add syncer metrics for federation, transform and queue wait time#1295

Merged
tpantelis merged 2 commits intosubmariner-io:develfrom
msft-paddy14:users/padgupta/add_metrics_syncer
Feb 27, 2026
Merged

Add syncer metrics for federation, transform and queue wait time#1295
tpantelis merged 2 commits intosubmariner-io:develfrom
msft-paddy14:users/padgupta/add_metrics_syncer

Conversation

@msft-paddy14
Copy link
Contributor

@msft-paddy14 msft-paddy14 commented Feb 20, 2026

Adds optional Prometheus histogram metrics to ResourceSyncer for observing federation latency, transform latency, queue wait time, and queue depth distribution.
Fixes #1294

Summary by CodeRabbit

  • New Features

    • Centralized, configurable metrics subsystem for syncers with optional custom collectors and default buckets.
    • Instrumentation added for enqueue/dequeue, transform, federation durations, queue wait time, queue length, and per-operation counters.
    • Work‑queue length query exposed for observability.
    • Configurable max log verbosity added for syncer components.
  • Tests

    • Tests updated to initialize and validate the new metrics configuration and collectors.

@submariner-bot
Copy link
Contributor

🤖 Created branch: z_pr1295/msft-paddy14/users/padgupta/add_metrics_syncer

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces per-resource Prometheus gauge fields with a centralized MetricsConfig and internal syncerMetrics subsystem; adds resolver/registration/unregister logic and instruments enqueue/dequeue, queue length, queue wait, transform duration, and federation duration across the sync pipeline.

Changes

Cohort / File(s) Summary
Metrics core
pkg/syncer/metrics.go
Add MetricsConfig, syncerMetrics, resolver functions (ResolveMetricsConfig, newSyncerMetrics), default buckets, register/unregister, and helpers to inc/observe metrics for sync counter, federation/transform durations, queue wait, and queue length.
Config & wiring
pkg/syncer/types.go, pkg/syncer/resource_syncer.go, pkg/syncer/broker/syncer.go
Replace SyncCounterOpts/SyncCounter fields with Metrics MetricsConfig; add internal metrics *syncerMetrics to resourceSyncer; wire initialization via resolved Metrics and unregister on shutdown; update constructor/signature usage.
Instrumentation (notifications & processing)
pkg/syncer/notifications.go, pkg/syncer/process_resource.go
Track enqueue timestamps and sample queue length at enqueue; record dequeue timestamps and queue wait; measure/emit transform and federation duration metrics around Transform/Distribute/Delete flows.
Workqueue API
pkg/workqueue/queue.go
Add Len() int to the workqueue Interface to allow sampling current queue length.
Broker tests
pkg/syncer/broker/syncer_test.go
Update test config to set metric opts under ResourceConfigs[0].Metrics.SyncCounterOpts instead of the removed top-level field.
Resource syncer tests
pkg/syncer/resource_syncer_test.go, pkg/syncer/transform_func_test.go
Update tests to use d.config.Metrics.* paths for metric opts/collectors; add transform-duration histogram setup in tests.

Sequence Diagram(s)

sequenceDiagram
    participant Producer as Event Producer
    participant WorkQ as WorkQueue
    participant Syncer as resourceSyncer
    participant Transformer as Transformer
    participant APIServer as API Server
    participant Prom as Prometheus

    Producer->>WorkQ: enqueue(key)
    Note right of WorkQ: metrics.trackEnqueue(key)\nworkqueue.Len() -> observeQueueLength
    WorkQ->>Syncer: deliver(key)
    Syncer->>Syncer: recordDequeue(key)\nobserveQueueWaitDurationMs
    Syncer->>Transformer: Transform(resource)
    Transformer-->>Syncer: transformed
    Syncer->>Prom: observeTransformDurationMs(direction,op,name,duration)
    Syncer->>APIServer: Distribute/Delete (federation)
    APIServer-->>Syncer: result
    Syncer->>Prom: observeFederationDurationMs(direction,op,name,duration)
    Syncer->>Prom: incSyncCounter(direction,op,name)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

release-note-needed

Suggested reviewers

  • Oats87
  • skitt
  • sridhargaddam
  • vthapar
  • tpantelis

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Actionable Comments Resolved ❌ Error Metrics initialized at line 147 of newResourceSyncer are never unregistered if early error return occurs at line 173, leaving orphaned Prometheus collectors. Implement defer-based cleanup in newResourceSyncer() to unregister metrics on error paths before returning early.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding Prometheus metrics for federation, transform, and queue wait time tracking to the syncer.
Linked Issues check ✅ Passed All requirements from issue #1294 are met: QueueLatencyMs/queue length histogram tracking [#1294], FederationLatencyMs measurement [#1294], TransformationLatencyMs tracking [#1294], and SyncCounter/SyncCounterOpts pattern adherence [#1294].
Out of Scope Changes check ✅ Passed All changes are directly scoped to metrics implementation: new MetricsConfig type, metrics resolution logic, instrumentation in processing phases, and a queue length interface method are all necessary for the stated objectives.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@msft-paddy14 msft-paddy14 marked this pull request as ready for review February 20, 2026 16:12
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/syncer/metrics.go (1)

157-162: duration.Milliseconds() truncates sub-millisecond durations to 0 — all fast transforms will cluster in the lowest bucket.

time.Duration.Milliseconds() returns an int64 (whole milliseconds). Transform calls that complete in < 1 ms will always observe 0, making the histogram useless for fast paths.

♻️ Proposed fix — use fractional milliseconds
-    }).Observe(float64(duration.Milliseconds()))
+    }).Observe(float64(duration.Nanoseconds()) / 1e6)

(Apply to all four Observe call sites.)

Also applies to: 169-173, 181-183, 191-193

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 157 - 162, The histogram Observe calls
use duration.Milliseconds() which truncates to whole ms and yields 0 for sub-ms
durations; change each Observe at m.federationDurationMs.With(...).Observe(...)
(and the other three Observe call sites noted) to pass a fractional millisecond
value, e.g. compute float64(duration)/float64(time.Millisecond) or
duration.Seconds()*1000.0 and pass that float64 into Observe so sub-millisecond
timings are represented accurately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 262-263: Both syncers receive the same rc.Metrics pointer causing
duplicate Prometheus registrations when Opts are non-nil; fix by resolving
Opts→collector once in broker/syncer.go before creating both syncers: in the
code that calls newResourceSyncer (for localSyncer and remoteSyncer) pre-create
the HistogramVecs referenced by rc.Metrics (e.g., for FederationDurationMsOpts
and any other *Opts fields) and assign them into rc.Metrics (so
Metrics.*HistogramVec is non-nil), mirroring the SyncCounterOpts pattern used
for SyncCounter; alternatively, if you prefer to change newSyncerMetrics, update
it to call prometheus.Register and handle prometheus.AlreadyRegisteredError to
return the existing collector instead of panicking.

In `@pkg/syncer/metrics.go`:
- Around line 80-82: The code mutates the caller-owned *prometheus.HistogramOpts
by assigning default buckets directly (e.g.,
cfg.FederationDurationMsOpts.Buckets = defaultDurationMsBuckets), which can
produce surprising side-effects; instead, make a local copy of each opts before
modifying and pass that copy to NewHistogramVec (for example copy the struct
referenced by cfg.FederationDurationMsOpts into a local variable, set
localOpts.Buckets = defaultDurationMsBuckets, then use localOpts when creating
the histogram), and apply the same pattern to the other three opts blocks
referenced in the file (the analogous blocks at lines around 92-94, 106-108,
118-120).
- Around line 77-127: prometheus.MustRegister panics when the same HistogramVec
is registered twice for shared MetricsConfig; in newSyncerMetrics (and
specifically the federationDurationMs, transformDurationMs, queueWaitDurationMs,
queueLength creation blocks) replace prometheus.MustRegister with
prometheus.Register, check for prometheus.AlreadyRegisteredError and if present
reuse the existing collector (type-assert the Err.ExistingCollector to
*prometheus.HistogramVec and assign it to m.<metric>) so you don't take
ownership or re-create the metric; keep the existing behavior of setting m.* =
cfg.* when direct pointers are provided and only create/register when opts are
used.

---

Nitpick comments:
In `@pkg/syncer/metrics.go`:
- Around line 157-162: The histogram Observe calls use duration.Milliseconds()
which truncates to whole ms and yields 0 for sub-ms durations; change each
Observe at m.federationDurationMs.With(...).Observe(...) (and the other three
Observe call sites noted) to pass a fractional millisecond value, e.g. compute
float64(duration)/float64(time.Millisecond) or duration.Seconds()*1000.0 and
pass that float64 into Observe so sub-millisecond timings are represented
accurately.

@msft-paddy14 msft-paddy14 force-pushed the users/padgupta/add_metrics_syncer branch from bb7f49f to c0c56eb Compare February 23, 2026 05:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/syncer/broker/syncer.go (1)

319-379: Bucket-defaulting logic is duplicated between resolveMetricsConfig and newSyncerMetrics.

Both functions independently copy opts, apply default buckets if nil, and call prometheus.NewHistogramVec. Extracting a small helper (e.g., an exported ResolveMetricsConfig in pkg/syncer/metrics.go that creates and registers the live collectors once) would remove the duplication and give direct NewResourceSyncer callers the same safe single-registration path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/broker/syncer.go` around lines 319 - 379, The bucket-defaulting
and histogram creation/registration logic duplicated between
resolveMetricsConfig and newSyncerMetrics should be extracted into a single
helper (exported ResolveMetricsConfig in pkg/syncer/metrics.go) that accepts a
*syncer.MetricsConfig, copies it, applies
DefaultDurationMsBuckets/DefaultQueueLengthBuckets when opts.Buckets == nil,
creates prometheus.NewHistogramVec collectors and prometheus.MustRegister them,
clears the Ops fields (e.g., FederationDurationMsOpts, TransformDurationMsOpts,
QueueWaitDurationMsOpts, QueueLengthOpts) and returns the resolved
*syncer.MetricsConfig; replace the logic in resolveMetricsConfig and
newSyncerMetrics to call this new ResolveMetricsConfig so all callers share the
same single-registration path.
pkg/syncer/metrics.go (1)

160-202: Sub-millisecond durations all record as 0.0 — use fractional ms for better resolution.

float64(duration.Milliseconds()) truncates to integer ms. For fast operations (e.g., pure in-memory transforms completing in ~100µs), every observation is 0, making the low end of the histogram useless. Switching to float64(duration.Nanoseconds()) / 1e6 produces fractional ms without any unit change.

♻️ Proposed fix — fractional ms precision
-).Observe(float64(duration.Milliseconds()))
+).Observe(float64(duration.Nanoseconds()) / 1e6)

(Apply to all three Observe call sites in observeFederationDurationMs, observeTransformDurationMs, and observeQueueWaitDurationMs.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 160 - 202, The histogram observations use
duration.Milliseconds() which truncates to integer ms, so change the Observe
calls in observeFederationDurationMs, observeTransformDurationMs, and
observeQueueWaitDurationMs to record fractional milliseconds by converting
nanoseconds to ms (use float64(duration.Nanoseconds())/1e6) instead of
float64(duration.Milliseconds()); leave observeQueueLength unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 341-374: The resolver is only nil-ing the *Opts on the local copy,
so calling NewSyncer again with the same SyncerConfig will recreate and
MustRegister the same metrics and panic; in resolveMetricsConfig, after creating
each prometheus.NewHistogramVec and registering it (for example when setting
resolved.TransformDurationMs, resolved.QueueWaitDurationMs,
resolved.QueueLength, etc.), also clear the corresponding Opts on the original
cfg.Metrics (e.g., set cfg.Metrics.TransformDurationMsOpts = nil,
cfg.Metrics.QueueWaitDurationMsOpts = nil, cfg.Metrics.QueueLengthOpts = nil) so
repeated NewSyncer calls won’t attempt to recreate/register the same metric
names. Ensure you reference resolveMetricsConfig and the cfg.Metrics fields
(TransformDurationMsOpts, QueueWaitDurationMsOpts, QueueLengthOpts) when
applying this change.

In `@pkg/syncer/metrics.go`:
- Around line 54-56: DefaultDurationMsBuckets and DefaultQueueLengthBuckets are
exported mutable slice vars that can be mutated by callers and shared via slice
header assignment; change the code to avoid sharing the backing array by copying
these slices when assigning them (e.g., when setting opts.Buckets or in
resolveMetricsConfig) instead of directly assigning the exported vars, or
alternatively make the vars unexported and expose safe accessors; ensure every
assignment uses a copy (e.g., make([]float64, len(...)) + copy) so each
histogram gets its own backing array.

---

Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 81-135: newSyncerMetrics currently calls prometheus.MustRegister
which panics if the same *MetricsConfig with only *Opts is reused; change the
registration logic in newSyncerMetrics to use prometheus.Register, detect
prometheus.AlreadyRegisteredError, extract the ExistingCollector and reuse it
for m.federationDurationMs / m.transformDurationMs / m.queueWaitDurationMs /
m.queueLength as appropriate, and write the resolved HistogramVec back into the
caller-owned cfg (e.g., set cfg.FederationDurationMs, cfg.TransformDurationMs,
cfg.QueueWaitDurationMs, cfg.QueueLength) so subsequent NewResourceSyncer calls
with the same *MetricsConfig don’t re-create and re-register collectors;
reference resolveMetricsConfig for the broker-safe behavior and ensure the code
handles both pre-existing *HistogramVec and *Opts paths.

---

Nitpick comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 319-379: The bucket-defaulting and histogram creation/registration
logic duplicated between resolveMetricsConfig and newSyncerMetrics should be
extracted into a single helper (exported ResolveMetricsConfig in
pkg/syncer/metrics.go) that accepts a *syncer.MetricsConfig, copies it, applies
DefaultDurationMsBuckets/DefaultQueueLengthBuckets when opts.Buckets == nil,
creates prometheus.NewHistogramVec collectors and prometheus.MustRegister them,
clears the Ops fields (e.g., FederationDurationMsOpts, TransformDurationMsOpts,
QueueWaitDurationMsOpts, QueueLengthOpts) and returns the resolved
*syncer.MetricsConfig; replace the logic in resolveMetricsConfig and
newSyncerMetrics to call this new ResolveMetricsConfig so all callers share the
same single-registration path.

In `@pkg/syncer/metrics.go`:
- Around line 160-202: The histogram observations use duration.Milliseconds()
which truncates to integer ms, so change the Observe calls in
observeFederationDurationMs, observeTransformDurationMs, and
observeQueueWaitDurationMs to record fractional milliseconds by converting
nanoseconds to ms (use float64(duration.Nanoseconds())/1e6) instead of
float64(duration.Milliseconds()); leave observeQueueLength unchanged.

Copy link
Contributor

@tpantelis tpantelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would nice to refactor SyncCounter/SyncCounterOpts into MetricsConfig but that can be done in a separate PR if you prefer..

WorkQueueConfig *workqueue.Config

// Metrics if specified, configures optional Prometheus metrics for federation, transform, and queue operations.
Metrics *MetricsConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MetricsConfig contains all pointer fields, this really doesn't need to be a pointer field. Not a big deal but it would simplify code a bit not having to deal with another possibly nil pointer.

Suggested change
Metrics *MetricsConfig
Metrics MetricsConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have converted to value instead of pointer. Still kept pointer in functions as there were golanggci-lint errors for too large objects being coped.

@msft-paddy14
Copy link
Contributor Author

It would nice to refactor SyncCounter/SyncCounterOpts into MetricsConfig but that can be done in a separate PR if you prefer..

I thought about doing this but it's exposed as a field in ResourceConfig and lighthouse agent is already consuming it. Would be a breaking change for the upgrade to new version then, right? I do agree it will be cleaner to have everything together though. Can do it in a separate PR to avoid issues as you suggested

@tpantelis
Copy link
Contributor

It would nice to refactor SyncCounter/SyncCounterOpts into MetricsConfig but that can be done in a separate PR if you prefer..

I thought about doing this but it's exposed as a field in ResourceConfig and lighthouse agent is already consuming it. Would be a breaking change for the upgrade to new version then, right?

yeah that's fine - we can adjust lighthouse code when we bump the Go dependency.

@msft-paddy14 msft-paddy14 force-pushed the users/padgupta/add_metrics_syncer branch from c0c56eb to dd81e22 Compare February 25, 2026 07:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/syncer/broker/syncer.go (1)

216-216: ⚠️ Potential issue | 🟠 Major

ResolveMetricsConfig can panic on repeated NewSyncer calls with reused config.

At Line 216, metrics := syncer.ResolveMetricsConfig(&rc.Metrics) resolves/registers collectors from a copy, but the resolved state is not persisted. If NewSyncer is called again with the same SyncerConfig and metric names, duplicate prometheus.MustRegister attempts can panic. This affects both syncers receiving metrics at Line 246 and Line 286.

Suggested fix (cache resolved metrics in the resource config)
-		metrics := syncer.ResolveMetricsConfig(&rc.Metrics)
+		metrics := syncer.ResolveMetricsConfig(&rc.Metrics)
+		rc.Metrics = metrics
...
-			Metrics:             metrics,
+			Metrics:             rc.Metrics,
...
-			Metrics:             metrics,
+			Metrics:             rc.Metrics,
#!/bin/bash
set -euo pipefail

echo "1) Verify resolver behavior (copy + registration):"
rg -n -C2 'func ResolveMetricsConfig|resolved := \*cfg|MustRegister\(|resolveHistogram\(' pkg/syncer/metrics.go

echo
echo "2) Verify NewSyncer resolves from rc.Metrics without persisting:"
rg -n -C2 'ResolveMetricsConfig\(&rc\.Metrics\)|Metrics:\s+metrics' pkg/syncer/broker/syncer.go

echo
echo "3) Confirm whether resolved metrics are written back to rc.Metrics:"
rg -n 'rc\.Metrics\s*=' pkg/syncer/broker/syncer.go || true

Also applies to: 246-246, 286-286

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/broker/syncer.go` at line 216, ResolveMetricsConfig currently
registers Prometheus collectors on a copy and doesn't persist the resolved
state, causing prometheus.MustRegister to panic on repeated NewSyncer calls with
the same SyncerConfig; to fix, after calling metrics :=
syncer.ResolveMetricsConfig(&rc.Metrics) in NewSyncer, write the resolved
metrics back into the original config (assign rc.Metrics = metrics or store the
resolved sub-struct/flags on rc.Metrics) so subsequent NewSyncer invocations
reuse the cached/resolved metrics instead of re-registering, ensuring you
reference ResolveMetricsConfig, rc.Metrics, NewSyncer, and
prometheus.MustRegister in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/syncer/broker/syncer.go`:
- Line 216: ResolveMetricsConfig currently registers Prometheus collectors on a
copy and doesn't persist the resolved state, causing prometheus.MustRegister to
panic on repeated NewSyncer calls with the same SyncerConfig; to fix, after
calling metrics := syncer.ResolveMetricsConfig(&rc.Metrics) in NewSyncer, write
the resolved metrics back into the original config (assign rc.Metrics = metrics
or store the resolved sub-struct/flags on rc.Metrics) so subsequent NewSyncer
invocations reuse the cached/resolved metrics instead of re-registering,
ensuring you reference ResolveMetricsConfig, rc.Metrics, NewSyncer, and
prometheus.MustRegister in the change.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0c56eb and dd81e22.

📒 Files selected for processing (9)
  • pkg/syncer/broker/syncer.go
  • pkg/syncer/broker/syncer_test.go
  • pkg/syncer/metrics.go
  • pkg/syncer/notifications.go
  • pkg/syncer/process_resource.go
  • pkg/syncer/resource_syncer.go
  • pkg/syncer/resource_syncer_test.go
  • pkg/syncer/types.go
  • pkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/syncer/notifications.go
  • pkg/workqueue/queue.go
  • pkg/syncer/metrics.go
  • pkg/syncer/process_resource.go

@msft-paddy14 msft-paddy14 force-pushed the users/padgupta/add_metrics_syncer branch from dd81e22 to 72158cc Compare February 25, 2026 08:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/syncer/metrics.go (2)

62-65: Exported var slices are mutable — callers can silently corrupt defaults.

DefaultDurationMsBuckets and DefaultQueueLengthBuckets are assigned by reference into histogram opts (line 83). Any external mutation of these slices corrupts all histograms that share the backing array. Consider copying on assign inside resolveHistogram:

🛡️ Proposed fix — copy on assign
 	o := **opts
 	if o.Buckets == nil {
-		o.Buckets = defaultBuckets
+		buckets := make([]float64, len(defaultBuckets))
+		copy(buckets, defaultBuckets)
+		o.Buckets = buckets
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 62 - 65, DefaultDurationMsBuckets and
DefaultQueueLengthBuckets are exported mutable slices currently assigned by
reference into histogram options in resolveHistogram, which lets callers mutate
shared backing arrays; modify resolveHistogram (and any place that inserts these
defaults into prometheus.HistogramOpts or similar) to use a defensive copy of
the slice (e.g., make a new slice and copy values) before assigning so each
histogram gets its own backing array and external mutations cannot corrupt
defaults.

74-90: prometheus.MustRegister will panic if the same metric name is registered twice.

If a caller creates two standalone ResourceSyncer instances with the same *Opts (without pre-resolving via ResolveMetricsConfig), the second newSyncerMetrics call hits MustRegister with a duplicate collector name and panics. The broker path avoids this by pre-resolving, but the standalone path does not.

Consider using prometheus.Register with AlreadyRegisteredError handling to make this robust in all call patterns:

🛡️ Proposed fix
-	*collector = prometheus.NewHistogramVec(o, labels)
-	prometheus.MustRegister(*collector)
+	vec := prometheus.NewHistogramVec(o, labels)
+	if err := prometheus.Register(vec); err != nil {
+		var are prometheus.AlreadyRegisteredError
+		if !errors.As(err, &are) {
+			panic(err)
+		}
+		*collector = are.ExistingCollector.(*prometheus.HistogramVec)
+	} else {
+		*collector = vec
+	}

(Same pattern needed for SyncCounter in ResolveMetricsConfig at line 112.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 74 - 90, The resolveHistogram function
currently calls prometheus.MustRegister which will panic on duplicate metric
registration; change it to use prometheus.Register and handle
prometheus.AlreadyRegisteredError: attempt to register the new HistogramVec, and
if Register returns an AlreadyRegisteredError, extract the existing collector
(type assert to *prometheus.HistogramVec) and use that instead of the newly
created one; also ensure you properly set *collector to the resolved instance
and still nil out *opts. Apply the same pattern for the SyncCounter registration
in ResolveMetricsConfig (the SyncCounter registration path referenced in the
review) so duplicate registrations return the already-registered collector
rather than panicking.
🧹 Nitpick comments (1)
pkg/syncer/metrics.go (1)

243-259: Re-enqueue overwrites the timestamp, under-reporting queue wait time.

When the workqueue deduplicates items, trackEnqueue overwrites the original enqueue timestamp with a newer one (line 248). The recorded wait duration will reflect only the time since the last enqueue, not the first. This is acceptable for dedup-style queues (the item was already pending), but it's worth documenting this semantic so operators interpreting the metric aren't surprised.

Consider adding a brief comment:

📝 Suggested doc comment
 func (m *syncerMetrics) trackEnqueue(key string) {
 	if m == nil || m.QueueWaitDurationMs == nil {
 		return
 	}
-
+	// If the key is already pending (workqueue dedup), the timestamp is
+	// overwritten so the recorded wait reflects the latest enqueue.
 	m.enqueueTimes.Store(key, time.Now())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 243 - 259, Add a brief doc comment above
the trackEnqueue function explaining that enqueueTimes.Store(key, time.Now())
overwrites any previous timestamp on re-enqueue, so the metric observed by
recordDequeue via observeQueueWaitDurationMs will report time since the last
enqueue (not the first), which is the intended/dedup semantics for this
workqueue; reference the trackEnqueue, recordDequeue, enqueueTimes and
observeQueueWaitDurationMs symbols in the comment for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/syncer/metrics.go`:
- Around line 124-135: newSyncerMetrics currently loses ownership information
when ResolveMetricsConfig was pre-run (broker path) because markOwnership skips
ownership if the original cfg.Opts is nil; update newSyncerMetrics so that when
cfg != nil and cfg.Collectors != nil but cfg.Opts == nil you explicitly mark the
resolved config as owning the collectors (e.g., set resolved.OwnsCollectors =
true) before returning, so markOwnership/newSyncerMetrics ensures unregister()
will remove collectors for broker-created/resolved configs; touch
newSyncerMetrics, ResolveMetricsConfig call site and ensure the logic respects
markOwnership/unregister behavior for ResourceSyncerConfig.Metrics.

---

Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 62-65: DefaultDurationMsBuckets and DefaultQueueLengthBuckets are
exported mutable slices currently assigned by reference into histogram options
in resolveHistogram, which lets callers mutate shared backing arrays; modify
resolveHistogram (and any place that inserts these defaults into
prometheus.HistogramOpts or similar) to use a defensive copy of the slice (e.g.,
make a new slice and copy values) before assigning so each histogram gets its
own backing array and external mutations cannot corrupt defaults.
- Around line 74-90: The resolveHistogram function currently calls
prometheus.MustRegister which will panic on duplicate metric registration;
change it to use prometheus.Register and handle
prometheus.AlreadyRegisteredError: attempt to register the new HistogramVec, and
if Register returns an AlreadyRegisteredError, extract the existing collector
(type assert to *prometheus.HistogramVec) and use that instead of the newly
created one; also ensure you properly set *collector to the resolved instance
and still nil out *opts. Apply the same pattern for the SyncCounter registration
in ResolveMetricsConfig (the SyncCounter registration path referenced in the
review) so duplicate registrations return the already-registered collector
rather than panicking.

---

Nitpick comments:
In `@pkg/syncer/metrics.go`:
- Around line 243-259: Add a brief doc comment above the trackEnqueue function
explaining that enqueueTimes.Store(key, time.Now()) overwrites any previous
timestamp on re-enqueue, so the metric observed by recordDequeue via
observeQueueWaitDurationMs will report time since the last enqueue (not the
first), which is the intended/dedup semantics for this workqueue; reference the
trackEnqueue, recordDequeue, enqueueTimes and observeQueueWaitDurationMs symbols
in the comment for clarity.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd81e22 and 72158cc.

📒 Files selected for processing (9)
  • pkg/syncer/broker/syncer.go
  • pkg/syncer/broker/syncer_test.go
  • pkg/syncer/metrics.go
  • pkg/syncer/notifications.go
  • pkg/syncer/process_resource.go
  • pkg/syncer/resource_syncer.go
  • pkg/syncer/resource_syncer_test.go
  • pkg/syncer/types.go
  • pkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/syncer/notifications.go
  • pkg/syncer/resource_syncer_test.go
  • pkg/syncer/process_resource.go


BeforeEach(func() {
d.config.SyncCounterOpts = &prometheus.GaugeOpts{
d.config.Metrics.SyncCounterOpts = &prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover the new metrics:

Suggested change
d.config.Metrics.SyncCounterOpts = &prometheus.GaugeOpts{
d.config.Metrics = syncer.MetricsConfig{
SyncCounterOpts: &prometheus.GaugeOpts{
Name: "sync-counter",
},
FederationDurationMsOpts: &prometheus.HistogramOpts{
Name: "federation",
},
QueueWaitDurationMsOpts: &prometheus.HistogramOpts{
Name: "queue-wait",
},
QueueLengthOpts: &prometheus.HistogramOpts{
Name: "queue-length",
},
}

To cover TransformDurationMs, add

t.config.Metrics = syncer.MetricsConfig{
    TransformDurationMsOpts: &prometheus.HistogramOpts{
	    Name: "transform",
	},
}

to BeforeEach in newTransformFunctionTestDriver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks


func (m *syncerMetrics) unregister() {
if m == nil {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is never reached in unit tests because syncerMetrics is never nil. As such, I think we should remove the m == nil checks here and in other methods. My philosophy generally is that if it's impossible to devise a test to cover a line then the line probably shouldn't be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. It's anyway not a public method so can't be abused as easily. Removed

// registration.
func ResolveMetricsConfig(cfg *MetricsConfig) MetricsConfig {
if cfg == nil {
return MetricsConfig{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All call sites pass in a valid pointer so this line is not covered by unit tests. Options:

  1. Add a unit test that specifically calls this with nil
  2. Don't check the pointer and assume the caller passes non-nil
  3. Change cfg param to a non-pointer value

I think 1 is overkill. 2 is reasonable but I prefer 3. The hugeParam linter will complain but this function is used in setup and copying is negligible so we can disable it via comment above line 97:

//nolint:hugeParam // This function is used in setup so copying MetricsConfig is negligible.

Also, the linter says "consider passing it by pointer" so it's just a hint anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed this but kept markOwnership as pointer and not doing nil check as it's not public. since it modifies one parameter I thought we can keep both pointers there for simplicity

Signed-off-by: Paddy <padgupta@microsoft.com>
@msft-paddy14 msft-paddy14 force-pushed the users/padgupta/add_metrics_syncer branch from 72158cc to 9d3ecce Compare February 27, 2026 18:44
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/syncer/metrics.go (1)

99-120: ⚠️ Potential issue | 🔴 Critical

ResolveMetricsConfig + ownership inference can still panic/leak across recreate cycles.

ResolveMetricsConfig uses prometheus.MustRegister (Line 110 and via Line 87), while newSyncerMetrics ownership is inferred from original *Opts pointers (Lines 135-153). With pre-resolved/shared configs (broker path), Opts are nil so unregister() becomes a no-op, and repeated construction with the same metric names can panic on duplicate registration.

#!/bin/bash
set -euo pipefail

echo "== metrics registration and ownership paths =="
rg -n "func ResolveMetricsConfig|MustRegister|func resolveHistogram|func newSyncerMetrics|func markOwnership" pkg/syncer/metrics.go -n -A4 -B2

echo
echo "== broker call site resolves metrics per NewSyncer invocation =="
rg -n "func NewSyncer\\(config SyncerConfig\\)|ResolveMetricsConfig\\(rc\\.Metrics\\)|Metrics:\\s+metrics" pkg/syncer/broker/syncer.go -n -A3 -B2

echo
echo "== expected verification =="
echo "1) MustRegister is used in resolver paths."
echo "2) Broker resolves metrics from rc.Metrics each NewSyncer call."
echo "3) Ownership is tied to original Opts, which are nil for pre-resolved configs."

Also applies to: 123-126, 135-153

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/syncer/metrics.go` around lines 99 - 120, ResolveMetricsConfig currently
calls prometheus.MustRegister (and resolveHistogram does similar) and later
ownership/unregister logic in newSyncerMetrics relies on original *Opts
pointers; when opts are nil (pre-resolved/shared configs) unregister becomes a
no-op and duplicate registration panics on recreate. Fix by changing
ResolveMetricsConfig/resolveHistogram to only create/register a collector when
the associated *Opts is non-nil, and when registering use prometheus.Register
(not MustRegister) with handling for prometheus.AlreadyRegisteredError: if
AlreadyRegisteredError is returned, reuse the existing collector instead of
creating a duplicate; additionally ensure you record ownership (e.g., set a new
owned flag or keep the created collector reference on the resolved struct) so
unregister() in newSyncerMetrics can detect and unregister only collectors this
instance created. Ensure references to SyncCounterOpts,
FederationDurationMsOpts, TransformDurationMsOpts, QueueWaitDurationMsOpts,
QueueLengthOpts, ResolveMetricsConfig, resolveHistogram, and newSyncerMetrics
are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/syncer/resource_syncer.go`:
- Line 147: The constructor assigns syncer.metrics = newSyncerMetrics(...) which
registers collectors but doesn’t unregister them if newResourceSyncer returns
early; modify newResourceSyncer so that after creating syncer.metrics you
register a deferred cleanup that unregisters those metrics (or call a provided
Unregister/Close method on syncer.metrics) on any subsequent error-return path
before returning the error (ensure the defer is cancelled/cleared on success or
only runs on error), referencing the newResourceSyncer function and the
syncer.metrics/newSyncerMetrics symbols so collectors are always unregistered
when construction fails.

---

Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 99-120: ResolveMetricsConfig currently calls
prometheus.MustRegister (and resolveHistogram does similar) and later
ownership/unregister logic in newSyncerMetrics relies on original *Opts
pointers; when opts are nil (pre-resolved/shared configs) unregister becomes a
no-op and duplicate registration panics on recreate. Fix by changing
ResolveMetricsConfig/resolveHistogram to only create/register a collector when
the associated *Opts is non-nil, and when registering use prometheus.Register
(not MustRegister) with handling for prometheus.AlreadyRegisteredError: if
AlreadyRegisteredError is returned, reuse the existing collector instead of
creating a duplicate; additionally ensure you record ownership (e.g., set a new
owned flag or keep the created collector reference on the resolved struct) so
unregister() in newSyncerMetrics can detect and unregister only collectors this
instance created. Ensure references to SyncCounterOpts,
FederationDurationMsOpts, TransformDurationMsOpts, QueueWaitDurationMsOpts,
QueueLengthOpts, ResolveMetricsConfig, resolveHistogram, and newSyncerMetrics
are updated accordingly.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72158cc and 9d3ecce.

📒 Files selected for processing (10)
  • pkg/syncer/broker/syncer.go
  • pkg/syncer/broker/syncer_test.go
  • pkg/syncer/metrics.go
  • pkg/syncer/notifications.go
  • pkg/syncer/process_resource.go
  • pkg/syncer/resource_syncer.go
  • pkg/syncer/resource_syncer_test.go
  • pkg/syncer/transform_func_test.go
  • pkg/syncer/types.go
  • pkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/workqueue/queue.go
  • pkg/syncer/broker/syncer_test.go
  • pkg/syncer/notifications.go
  • pkg/syncer/types.go

@tpantelis
Copy link
Contributor

@coderabbitai approve

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

✅ Actions performed

Comments resolved and changes approved.

@tpantelis tpantelis enabled auto-merge (rebase) February 27, 2026 19:17
@tpantelis tpantelis merged commit f09ac2b into submariner-io:devel Feb 27, 2026
18 checks passed
@submariner-bot
Copy link
Contributor

🤖 Closed branches: [z_pr1295/msft-paddy14/users/padgupta/add_metrics_syncer]

skitt added a commit to skitt/lighthouse that referenced this pull request Mar 4, 2026
submariner-io/admiral#1295 moved
SyncCounterOpts into a MetricsConfig struct; this adjusts accordingly.

Signed-off-by: Stephen Kitt <skitt@redhat.com>
skitt added a commit to submariner-io/lighthouse that referenced this pull request Mar 6, 2026
submariner-io/admiral#1295 moved
SyncCounterOpts into a MetricsConfig struct; this adjusts accordingly.

Signed-off-by: Stephen Kitt <skitt@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metrics support for federation, transform, and queue operations in Resource Syncer

3 participants