Add syncer metrics for federation, transform and queue wait time#1295
Conversation
|
🤖 Created branch: z_pr1295/msft-paddy14/users/padgupta/add_metrics_syncer |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces per-resource Prometheus gauge fields with a centralized MetricsConfig and internal syncerMetrics subsystem; adds resolver/registration/unregister logic and instruments enqueue/dequeue, queue length, queue wait, transform duration, and federation duration across the sync pipeline. Changes
Sequence Diagram(s)sequenceDiagram
participant Producer as Event Producer
participant WorkQ as WorkQueue
participant Syncer as resourceSyncer
participant Transformer as Transformer
participant APIServer as API Server
participant Prom as Prometheus
Producer->>WorkQ: enqueue(key)
Note right of WorkQ: metrics.trackEnqueue(key)\nworkqueue.Len() -> observeQueueLength
WorkQ->>Syncer: deliver(key)
Syncer->>Syncer: recordDequeue(key)\nobserveQueueWaitDurationMs
Syncer->>Transformer: Transform(resource)
Transformer-->>Syncer: transformed
Syncer->>Prom: observeTransformDurationMs(direction,op,name,duration)
Syncer->>APIServer: Distribute/Delete (federation)
APIServer-->>Syncer: result
Syncer->>Prom: observeFederationDurationMs(direction,op,name,duration)
Syncer->>Prom: incSyncCounter(direction,op,name)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/syncer/metrics.go (1)
157-162:duration.Milliseconds()truncates sub-millisecond durations to0— all fast transforms will cluster in the lowest bucket.
time.Duration.Milliseconds()returns anint64(whole milliseconds). Transform calls that complete in < 1 ms will always observe0, making the histogram useless for fast paths.♻️ Proposed fix — use fractional milliseconds
- }).Observe(float64(duration.Milliseconds())) + }).Observe(float64(duration.Nanoseconds()) / 1e6)(Apply to all four
Observecall sites.)Also applies to: 169-173, 181-183, 191-193
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 157 - 162, The histogram Observe calls use duration.Milliseconds() which truncates to whole ms and yields 0 for sub-ms durations; change each Observe at m.federationDurationMs.With(...).Observe(...) (and the other three Observe call sites noted) to pass a fractional millisecond value, e.g. compute float64(duration)/float64(time.Millisecond) or duration.Seconds()*1000.0 and pass that float64 into Observe so sub-millisecond timings are represented accurately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 262-263: Both syncers receive the same rc.Metrics pointer causing
duplicate Prometheus registrations when Opts are non-nil; fix by resolving
Opts→collector once in broker/syncer.go before creating both syncers: in the
code that calls newResourceSyncer (for localSyncer and remoteSyncer) pre-create
the HistogramVecs referenced by rc.Metrics (e.g., for FederationDurationMsOpts
and any other *Opts fields) and assign them into rc.Metrics (so
Metrics.*HistogramVec is non-nil), mirroring the SyncCounterOpts pattern used
for SyncCounter; alternatively, if you prefer to change newSyncerMetrics, update
it to call prometheus.Register and handle prometheus.AlreadyRegisteredError to
return the existing collector instead of panicking.
In `@pkg/syncer/metrics.go`:
- Around line 80-82: The code mutates the caller-owned *prometheus.HistogramOpts
by assigning default buckets directly (e.g.,
cfg.FederationDurationMsOpts.Buckets = defaultDurationMsBuckets), which can
produce surprising side-effects; instead, make a local copy of each opts before
modifying and pass that copy to NewHistogramVec (for example copy the struct
referenced by cfg.FederationDurationMsOpts into a local variable, set
localOpts.Buckets = defaultDurationMsBuckets, then use localOpts when creating
the histogram), and apply the same pattern to the other three opts blocks
referenced in the file (the analogous blocks at lines around 92-94, 106-108,
118-120).
- Around line 77-127: prometheus.MustRegister panics when the same HistogramVec
is registered twice for shared MetricsConfig; in newSyncerMetrics (and
specifically the federationDurationMs, transformDurationMs, queueWaitDurationMs,
queueLength creation blocks) replace prometheus.MustRegister with
prometheus.Register, check for prometheus.AlreadyRegisteredError and if present
reuse the existing collector (type-assert the Err.ExistingCollector to
*prometheus.HistogramVec and assign it to m.<metric>) so you don't take
ownership or re-create the metric; keep the existing behavior of setting m.* =
cfg.* when direct pointers are provided and only create/register when opts are
used.
---
Nitpick comments:
In `@pkg/syncer/metrics.go`:
- Around line 157-162: The histogram Observe calls use duration.Milliseconds()
which truncates to whole ms and yields 0 for sub-ms durations; change each
Observe at m.federationDurationMs.With(...).Observe(...) (and the other three
Observe call sites noted) to pass a fractional millisecond value, e.g. compute
float64(duration)/float64(time.Millisecond) or duration.Seconds()*1000.0 and
pass that float64 into Observe so sub-millisecond timings are represented
accurately.
bb7f49f to
c0c56eb
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/syncer/broker/syncer.go (1)
319-379: Bucket-defaulting logic is duplicated betweenresolveMetricsConfigandnewSyncerMetrics.Both functions independently copy opts, apply default buckets if nil, and call
prometheus.NewHistogramVec. Extracting a small helper (e.g., an exportedResolveMetricsConfiginpkg/syncer/metrics.gothat creates and registers the live collectors once) would remove the duplication and give directNewResourceSyncercallers the same safe single-registration path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/broker/syncer.go` around lines 319 - 379, The bucket-defaulting and histogram creation/registration logic duplicated between resolveMetricsConfig and newSyncerMetrics should be extracted into a single helper (exported ResolveMetricsConfig in pkg/syncer/metrics.go) that accepts a *syncer.MetricsConfig, copies it, applies DefaultDurationMsBuckets/DefaultQueueLengthBuckets when opts.Buckets == nil, creates prometheus.NewHistogramVec collectors and prometheus.MustRegister them, clears the Ops fields (e.g., FederationDurationMsOpts, TransformDurationMsOpts, QueueWaitDurationMsOpts, QueueLengthOpts) and returns the resolved *syncer.MetricsConfig; replace the logic in resolveMetricsConfig and newSyncerMetrics to call this new ResolveMetricsConfig so all callers share the same single-registration path.pkg/syncer/metrics.go (1)
160-202: Sub-millisecond durations all record as0.0— use fractional ms for better resolution.
float64(duration.Milliseconds())truncates to integer ms. For fast operations (e.g., pure in-memory transforms completing in ~100µs), every observation is0, making the low end of the histogram useless. Switching tofloat64(duration.Nanoseconds()) / 1e6produces fractional ms without any unit change.♻️ Proposed fix — fractional ms precision
-).Observe(float64(duration.Milliseconds())) +).Observe(float64(duration.Nanoseconds()) / 1e6)(Apply to all three
Observecall sites inobserveFederationDurationMs,observeTransformDurationMs, andobserveQueueWaitDurationMs.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 160 - 202, The histogram observations use duration.Milliseconds() which truncates to integer ms, so change the Observe calls in observeFederationDurationMs, observeTransformDurationMs, and observeQueueWaitDurationMs to record fractional milliseconds by converting nanoseconds to ms (use float64(duration.Nanoseconds())/1e6) instead of float64(duration.Milliseconds()); leave observeQueueLength unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 341-374: The resolver is only nil-ing the *Opts on the local copy,
so calling NewSyncer again with the same SyncerConfig will recreate and
MustRegister the same metrics and panic; in resolveMetricsConfig, after creating
each prometheus.NewHistogramVec and registering it (for example when setting
resolved.TransformDurationMs, resolved.QueueWaitDurationMs,
resolved.QueueLength, etc.), also clear the corresponding Opts on the original
cfg.Metrics (e.g., set cfg.Metrics.TransformDurationMsOpts = nil,
cfg.Metrics.QueueWaitDurationMsOpts = nil, cfg.Metrics.QueueLengthOpts = nil) so
repeated NewSyncer calls won’t attempt to recreate/register the same metric
names. Ensure you reference resolveMetricsConfig and the cfg.Metrics fields
(TransformDurationMsOpts, QueueWaitDurationMsOpts, QueueLengthOpts) when
applying this change.
In `@pkg/syncer/metrics.go`:
- Around line 54-56: DefaultDurationMsBuckets and DefaultQueueLengthBuckets are
exported mutable slice vars that can be mutated by callers and shared via slice
header assignment; change the code to avoid sharing the backing array by copying
these slices when assigning them (e.g., when setting opts.Buckets or in
resolveMetricsConfig) instead of directly assigning the exported vars, or
alternatively make the vars unexported and expose safe accessors; ensure every
assignment uses a copy (e.g., make([]float64, len(...)) + copy) so each
histogram gets its own backing array.
---
Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 81-135: newSyncerMetrics currently calls prometheus.MustRegister
which panics if the same *MetricsConfig with only *Opts is reused; change the
registration logic in newSyncerMetrics to use prometheus.Register, detect
prometheus.AlreadyRegisteredError, extract the ExistingCollector and reuse it
for m.federationDurationMs / m.transformDurationMs / m.queueWaitDurationMs /
m.queueLength as appropriate, and write the resolved HistogramVec back into the
caller-owned cfg (e.g., set cfg.FederationDurationMs, cfg.TransformDurationMs,
cfg.QueueWaitDurationMs, cfg.QueueLength) so subsequent NewResourceSyncer calls
with the same *MetricsConfig don’t re-create and re-register collectors;
reference resolveMetricsConfig for the broker-safe behavior and ensure the code
handles both pre-existing *HistogramVec and *Opts paths.
---
Nitpick comments:
In `@pkg/syncer/broker/syncer.go`:
- Around line 319-379: The bucket-defaulting and histogram creation/registration
logic duplicated between resolveMetricsConfig and newSyncerMetrics should be
extracted into a single helper (exported ResolveMetricsConfig in
pkg/syncer/metrics.go) that accepts a *syncer.MetricsConfig, copies it, applies
DefaultDurationMsBuckets/DefaultQueueLengthBuckets when opts.Buckets == nil,
creates prometheus.NewHistogramVec collectors and prometheus.MustRegister them,
clears the Ops fields (e.g., FederationDurationMsOpts, TransformDurationMsOpts,
QueueWaitDurationMsOpts, QueueLengthOpts) and returns the resolved
*syncer.MetricsConfig; replace the logic in resolveMetricsConfig and
newSyncerMetrics to call this new ResolveMetricsConfig so all callers share the
same single-registration path.
In `@pkg/syncer/metrics.go`:
- Around line 160-202: The histogram observations use duration.Milliseconds()
which truncates to integer ms, so change the Observe calls in
observeFederationDurationMs, observeTransformDurationMs, and
observeQueueWaitDurationMs to record fractional milliseconds by converting
nanoseconds to ms (use float64(duration.Nanoseconds())/1e6) instead of
float64(duration.Milliseconds()); leave observeQueueLength unchanged.
tpantelis
left a comment
There was a problem hiding this comment.
It would nice to refactor SyncCounter/SyncCounterOpts into MetricsConfig but that can be done in a separate PR if you prefer..
pkg/syncer/types.go
Outdated
| WorkQueueConfig *workqueue.Config | ||
|
|
||
| // Metrics if specified, configures optional Prometheus metrics for federation, transform, and queue operations. | ||
| Metrics *MetricsConfig |
There was a problem hiding this comment.
Since MetricsConfig contains all pointer fields, this really doesn't need to be a pointer field. Not a big deal but it would simplify code a bit not having to deal with another possibly nil pointer.
| Metrics *MetricsConfig | |
| Metrics MetricsConfig |
There was a problem hiding this comment.
Have converted to value instead of pointer. Still kept pointer in functions as there were golanggci-lint errors for too large objects being coped.
I thought about doing this but it's exposed as a field in ResourceConfig and lighthouse agent is already consuming it. Would be a breaking change for the upgrade to new version then, right? I do agree it will be cleaner to have everything together though. Can do it in a separate PR to avoid issues as you suggested |
yeah that's fine - we can adjust lighthouse code when we bump the Go dependency. |
c0c56eb to
dd81e22
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/syncer/broker/syncer.go (1)
216-216:⚠️ Potential issue | 🟠 Major
ResolveMetricsConfigcan panic on repeatedNewSyncercalls with reused config.At Line 216,
metrics := syncer.ResolveMetricsConfig(&rc.Metrics)resolves/registers collectors from a copy, but the resolved state is not persisted. IfNewSynceris called again with the sameSyncerConfigand metric names, duplicateprometheus.MustRegisterattempts can panic. This affects both syncers receivingmetricsat Line 246 and Line 286.Suggested fix (cache resolved metrics in the resource config)
- metrics := syncer.ResolveMetricsConfig(&rc.Metrics) + metrics := syncer.ResolveMetricsConfig(&rc.Metrics) + rc.Metrics = metrics ... - Metrics: metrics, + Metrics: rc.Metrics, ... - Metrics: metrics, + Metrics: rc.Metrics,#!/bin/bash set -euo pipefail echo "1) Verify resolver behavior (copy + registration):" rg -n -C2 'func ResolveMetricsConfig|resolved := \*cfg|MustRegister\(|resolveHistogram\(' pkg/syncer/metrics.go echo echo "2) Verify NewSyncer resolves from rc.Metrics without persisting:" rg -n -C2 'ResolveMetricsConfig\(&rc\.Metrics\)|Metrics:\s+metrics' pkg/syncer/broker/syncer.go echo echo "3) Confirm whether resolved metrics are written back to rc.Metrics:" rg -n 'rc\.Metrics\s*=' pkg/syncer/broker/syncer.go || trueAlso applies to: 246-246, 286-286
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/broker/syncer.go` at line 216, ResolveMetricsConfig currently registers Prometheus collectors on a copy and doesn't persist the resolved state, causing prometheus.MustRegister to panic on repeated NewSyncer calls with the same SyncerConfig; to fix, after calling metrics := syncer.ResolveMetricsConfig(&rc.Metrics) in NewSyncer, write the resolved metrics back into the original config (assign rc.Metrics = metrics or store the resolved sub-struct/flags on rc.Metrics) so subsequent NewSyncer invocations reuse the cached/resolved metrics instead of re-registering, ensuring you reference ResolveMetricsConfig, rc.Metrics, NewSyncer, and prometheus.MustRegister in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/syncer/broker/syncer.go`:
- Line 216: ResolveMetricsConfig currently registers Prometheus collectors on a
copy and doesn't persist the resolved state, causing prometheus.MustRegister to
panic on repeated NewSyncer calls with the same SyncerConfig; to fix, after
calling metrics := syncer.ResolveMetricsConfig(&rc.Metrics) in NewSyncer, write
the resolved metrics back into the original config (assign rc.Metrics = metrics
or store the resolved sub-struct/flags on rc.Metrics) so subsequent NewSyncer
invocations reuse the cached/resolved metrics instead of re-registering,
ensuring you reference ResolveMetricsConfig, rc.Metrics, NewSyncer, and
prometheus.MustRegister in the change.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
pkg/syncer/broker/syncer.gopkg/syncer/broker/syncer_test.gopkg/syncer/metrics.gopkg/syncer/notifications.gopkg/syncer/process_resource.gopkg/syncer/resource_syncer.gopkg/syncer/resource_syncer_test.gopkg/syncer/types.gopkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/syncer/notifications.go
- pkg/workqueue/queue.go
- pkg/syncer/metrics.go
- pkg/syncer/process_resource.go
dd81e22 to
72158cc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/syncer/metrics.go (2)
62-65: Exportedvarslices are mutable — callers can silently corrupt defaults.
DefaultDurationMsBucketsandDefaultQueueLengthBucketsare assigned by reference into histogram opts (line 83). Any external mutation of these slices corrupts all histograms that share the backing array. Consider copying on assign insideresolveHistogram:🛡️ Proposed fix — copy on assign
o := **opts if o.Buckets == nil { - o.Buckets = defaultBuckets + buckets := make([]float64, len(defaultBuckets)) + copy(buckets, defaultBuckets) + o.Buckets = buckets }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 62 - 65, DefaultDurationMsBuckets and DefaultQueueLengthBuckets are exported mutable slices currently assigned by reference into histogram options in resolveHistogram, which lets callers mutate shared backing arrays; modify resolveHistogram (and any place that inserts these defaults into prometheus.HistogramOpts or similar) to use a defensive copy of the slice (e.g., make a new slice and copy values) before assigning so each histogram gets its own backing array and external mutations cannot corrupt defaults.
74-90:prometheus.MustRegisterwill panic if the same metric name is registered twice.If a caller creates two standalone
ResourceSyncerinstances with the same*Opts(without pre-resolving viaResolveMetricsConfig), the secondnewSyncerMetricscall hitsMustRegisterwith a duplicate collector name and panics. The broker path avoids this by pre-resolving, but the standalone path does not.Consider using
prometheus.RegisterwithAlreadyRegisteredErrorhandling to make this robust in all call patterns:🛡️ Proposed fix
- *collector = prometheus.NewHistogramVec(o, labels) - prometheus.MustRegister(*collector) + vec := prometheus.NewHistogramVec(o, labels) + if err := prometheus.Register(vec); err != nil { + var are prometheus.AlreadyRegisteredError + if !errors.As(err, &are) { + panic(err) + } + *collector = are.ExistingCollector.(*prometheus.HistogramVec) + } else { + *collector = vec + }(Same pattern needed for
SyncCounterinResolveMetricsConfigat line 112.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 74 - 90, The resolveHistogram function currently calls prometheus.MustRegister which will panic on duplicate metric registration; change it to use prometheus.Register and handle prometheus.AlreadyRegisteredError: attempt to register the new HistogramVec, and if Register returns an AlreadyRegisteredError, extract the existing collector (type assert to *prometheus.HistogramVec) and use that instead of the newly created one; also ensure you properly set *collector to the resolved instance and still nil out *opts. Apply the same pattern for the SyncCounter registration in ResolveMetricsConfig (the SyncCounter registration path referenced in the review) so duplicate registrations return the already-registered collector rather than panicking.
🧹 Nitpick comments (1)
pkg/syncer/metrics.go (1)
243-259: Re-enqueue overwrites the timestamp, under-reporting queue wait time.When the workqueue deduplicates items,
trackEnqueueoverwrites the original enqueue timestamp with a newer one (line 248). The recorded wait duration will reflect only the time since the last enqueue, not the first. This is acceptable for dedup-style queues (the item was already pending), but it's worth documenting this semantic so operators interpreting the metric aren't surprised.Consider adding a brief comment:
📝 Suggested doc comment
func (m *syncerMetrics) trackEnqueue(key string) { if m == nil || m.QueueWaitDurationMs == nil { return } - + // If the key is already pending (workqueue dedup), the timestamp is + // overwritten so the recorded wait reflects the latest enqueue. m.enqueueTimes.Store(key, time.Now()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 243 - 259, Add a brief doc comment above the trackEnqueue function explaining that enqueueTimes.Store(key, time.Now()) overwrites any previous timestamp on re-enqueue, so the metric observed by recordDequeue via observeQueueWaitDurationMs will report time since the last enqueue (not the first), which is the intended/dedup semantics for this workqueue; reference the trackEnqueue, recordDequeue, enqueueTimes and observeQueueWaitDurationMs symbols in the comment for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/syncer/metrics.go`:
- Around line 124-135: newSyncerMetrics currently loses ownership information
when ResolveMetricsConfig was pre-run (broker path) because markOwnership skips
ownership if the original cfg.Opts is nil; update newSyncerMetrics so that when
cfg != nil and cfg.Collectors != nil but cfg.Opts == nil you explicitly mark the
resolved config as owning the collectors (e.g., set resolved.OwnsCollectors =
true) before returning, so markOwnership/newSyncerMetrics ensures unregister()
will remove collectors for broker-created/resolved configs; touch
newSyncerMetrics, ResolveMetricsConfig call site and ensure the logic respects
markOwnership/unregister behavior for ResourceSyncerConfig.Metrics.
---
Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 62-65: DefaultDurationMsBuckets and DefaultQueueLengthBuckets are
exported mutable slices currently assigned by reference into histogram options
in resolveHistogram, which lets callers mutate shared backing arrays; modify
resolveHistogram (and any place that inserts these defaults into
prometheus.HistogramOpts or similar) to use a defensive copy of the slice (e.g.,
make a new slice and copy values) before assigning so each histogram gets its
own backing array and external mutations cannot corrupt defaults.
- Around line 74-90: The resolveHistogram function currently calls
prometheus.MustRegister which will panic on duplicate metric registration;
change it to use prometheus.Register and handle
prometheus.AlreadyRegisteredError: attempt to register the new HistogramVec, and
if Register returns an AlreadyRegisteredError, extract the existing collector
(type assert to *prometheus.HistogramVec) and use that instead of the newly
created one; also ensure you properly set *collector to the resolved instance
and still nil out *opts. Apply the same pattern for the SyncCounter registration
in ResolveMetricsConfig (the SyncCounter registration path referenced in the
review) so duplicate registrations return the already-registered collector
rather than panicking.
---
Nitpick comments:
In `@pkg/syncer/metrics.go`:
- Around line 243-259: Add a brief doc comment above the trackEnqueue function
explaining that enqueueTimes.Store(key, time.Now()) overwrites any previous
timestamp on re-enqueue, so the metric observed by recordDequeue via
observeQueueWaitDurationMs will report time since the last enqueue (not the
first), which is the intended/dedup semantics for this workqueue; reference the
trackEnqueue, recordDequeue, enqueueTimes and observeQueueWaitDurationMs symbols
in the comment for clarity.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
pkg/syncer/broker/syncer.gopkg/syncer/broker/syncer_test.gopkg/syncer/metrics.gopkg/syncer/notifications.gopkg/syncer/process_resource.gopkg/syncer/resource_syncer.gopkg/syncer/resource_syncer_test.gopkg/syncer/types.gopkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/syncer/notifications.go
- pkg/syncer/resource_syncer_test.go
- pkg/syncer/process_resource.go
pkg/syncer/resource_syncer_test.go
Outdated
|
|
||
| BeforeEach(func() { | ||
| d.config.SyncCounterOpts = &prometheus.GaugeOpts{ | ||
| d.config.Metrics.SyncCounterOpts = &prometheus.GaugeOpts{ |
There was a problem hiding this comment.
To cover the new metrics:
| d.config.Metrics.SyncCounterOpts = &prometheus.GaugeOpts{ | |
| d.config.Metrics = syncer.MetricsConfig{ | |
| SyncCounterOpts: &prometheus.GaugeOpts{ | |
| Name: "sync-counter", | |
| }, | |
| FederationDurationMsOpts: &prometheus.HistogramOpts{ | |
| Name: "federation", | |
| }, | |
| QueueWaitDurationMsOpts: &prometheus.HistogramOpts{ | |
| Name: "queue-wait", | |
| }, | |
| QueueLengthOpts: &prometheus.HistogramOpts{ | |
| Name: "queue-length", | |
| }, | |
| } |
To cover TransformDurationMs, add
t.config.Metrics = syncer.MetricsConfig{
TransformDurationMsOpts: &prometheus.HistogramOpts{
Name: "transform",
},
}
to BeforeEach in newTransformFunctionTestDriver.
pkg/syncer/metrics.go
Outdated
|
|
||
| func (m *syncerMetrics) unregister() { | ||
| if m == nil { | ||
| return |
There was a problem hiding this comment.
This line is never reached in unit tests because syncerMetrics is never nil. As such, I think we should remove the m == nil checks here and in other methods. My philosophy generally is that if it's impossible to devise a test to cover a line then the line probably shouldn't be there.
There was a problem hiding this comment.
Fair point. It's anyway not a public method so can't be abused as easily. Removed
pkg/syncer/metrics.go
Outdated
| // registration. | ||
| func ResolveMetricsConfig(cfg *MetricsConfig) MetricsConfig { | ||
| if cfg == nil { | ||
| return MetricsConfig{} |
There was a problem hiding this comment.
All call sites pass in a valid pointer so this line is not covered by unit tests. Options:
- Add a unit test that specifically calls this with nil
- Don't check the pointer and assume the caller passes non-nil
- Change
cfgparam to a non-pointer value
I think 1 is overkill. 2 is reasonable but I prefer 3. The hugeParam linter will complain but this function is used in setup and copying is negligible so we can disable it via comment above line 97:
//nolint:hugeParam // This function is used in setup so copying MetricsConfig is negligible.
Also, the linter says "consider passing it by pointer" so it's just a hint anyway.
There was a problem hiding this comment.
Have changed this but kept markOwnership as pointer and not doing nil check as it's not public. since it modifies one parameter I thought we can keep both pointers there for simplicity
Signed-off-by: Paddy <padgupta@microsoft.com>
72158cc to
9d3ecce
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/syncer/metrics.go (1)
99-120:⚠️ Potential issue | 🔴 Critical
ResolveMetricsConfig+ ownership inference can still panic/leak across recreate cycles.
ResolveMetricsConfigusesprometheus.MustRegister(Line 110 and via Line 87), whilenewSyncerMetricsownership is inferred from original*Optspointers (Lines 135-153). With pre-resolved/shared configs (broker path),Optsare nil sounregister()becomes a no-op, and repeated construction with the same metric names can panic on duplicate registration.#!/bin/bash set -euo pipefail echo "== metrics registration and ownership paths ==" rg -n "func ResolveMetricsConfig|MustRegister|func resolveHistogram|func newSyncerMetrics|func markOwnership" pkg/syncer/metrics.go -n -A4 -B2 echo echo "== broker call site resolves metrics per NewSyncer invocation ==" rg -n "func NewSyncer\\(config SyncerConfig\\)|ResolveMetricsConfig\\(rc\\.Metrics\\)|Metrics:\\s+metrics" pkg/syncer/broker/syncer.go -n -A3 -B2 echo echo "== expected verification ==" echo "1) MustRegister is used in resolver paths." echo "2) Broker resolves metrics from rc.Metrics each NewSyncer call." echo "3) Ownership is tied to original Opts, which are nil for pre-resolved configs."Also applies to: 123-126, 135-153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/syncer/metrics.go` around lines 99 - 120, ResolveMetricsConfig currently calls prometheus.MustRegister (and resolveHistogram does similar) and later ownership/unregister logic in newSyncerMetrics relies on original *Opts pointers; when opts are nil (pre-resolved/shared configs) unregister becomes a no-op and duplicate registration panics on recreate. Fix by changing ResolveMetricsConfig/resolveHistogram to only create/register a collector when the associated *Opts is non-nil, and when registering use prometheus.Register (not MustRegister) with handling for prometheus.AlreadyRegisteredError: if AlreadyRegisteredError is returned, reuse the existing collector instead of creating a duplicate; additionally ensure you record ownership (e.g., set a new owned flag or keep the created collector reference on the resolved struct) so unregister() in newSyncerMetrics can detect and unregister only collectors this instance created. Ensure references to SyncCounterOpts, FederationDurationMsOpts, TransformDurationMsOpts, QueueWaitDurationMsOpts, QueueLengthOpts, ResolveMetricsConfig, resolveHistogram, and newSyncerMetrics are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/syncer/resource_syncer.go`:
- Line 147: The constructor assigns syncer.metrics = newSyncerMetrics(...) which
registers collectors but doesn’t unregister them if newResourceSyncer returns
early; modify newResourceSyncer so that after creating syncer.metrics you
register a deferred cleanup that unregisters those metrics (or call a provided
Unregister/Close method on syncer.metrics) on any subsequent error-return path
before returning the error (ensure the defer is cancelled/cleared on success or
only runs on error), referencing the newResourceSyncer function and the
syncer.metrics/newSyncerMetrics symbols so collectors are always unregistered
when construction fails.
---
Duplicate comments:
In `@pkg/syncer/metrics.go`:
- Around line 99-120: ResolveMetricsConfig currently calls
prometheus.MustRegister (and resolveHistogram does similar) and later
ownership/unregister logic in newSyncerMetrics relies on original *Opts
pointers; when opts are nil (pre-resolved/shared configs) unregister becomes a
no-op and duplicate registration panics on recreate. Fix by changing
ResolveMetricsConfig/resolveHistogram to only create/register a collector when
the associated *Opts is non-nil, and when registering use prometheus.Register
(not MustRegister) with handling for prometheus.AlreadyRegisteredError: if
AlreadyRegisteredError is returned, reuse the existing collector instead of
creating a duplicate; additionally ensure you record ownership (e.g., set a new
owned flag or keep the created collector reference on the resolved struct) so
unregister() in newSyncerMetrics can detect and unregister only collectors this
instance created. Ensure references to SyncCounterOpts,
FederationDurationMsOpts, TransformDurationMsOpts, QueueWaitDurationMsOpts,
QueueLengthOpts, ResolveMetricsConfig, resolveHistogram, and newSyncerMetrics
are updated accordingly.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
pkg/syncer/broker/syncer.gopkg/syncer/broker/syncer_test.gopkg/syncer/metrics.gopkg/syncer/notifications.gopkg/syncer/process_resource.gopkg/syncer/resource_syncer.gopkg/syncer/resource_syncer_test.gopkg/syncer/transform_func_test.gopkg/syncer/types.gopkg/workqueue/queue.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/workqueue/queue.go
- pkg/syncer/broker/syncer_test.go
- pkg/syncer/notifications.go
- pkg/syncer/types.go
|
@coderabbitai approve |
✅ Actions performedComments resolved and changes approved. |
|
🤖 Closed branches: [z_pr1295/msft-paddy14/users/padgupta/add_metrics_syncer] |
submariner-io/admiral#1295 moved SyncCounterOpts into a MetricsConfig struct; this adjusts accordingly. Signed-off-by: Stephen Kitt <skitt@redhat.com>
submariner-io/admiral#1295 moved SyncCounterOpts into a MetricsConfig struct; this adjusts accordingly. Signed-off-by: Stephen Kitt <skitt@redhat.com>
Adds optional Prometheus histogram metrics to ResourceSyncer for observing federation latency, transform latency, queue wait time, and queue depth distribution.
Fixes #1294
Summary by CodeRabbit
New Features
Tests