prevent duplicate targets after OTel Collector Prometheus relabeling by mike9421 · Pull Request #4381 · open-telemetry/opentelemetry-operator

mike9421 · 2025-09-25T12:27:09Z

Description:

Extract hash calculation fixes into a separate PR

Link to tracking Issue(s):

Testing:

Documentation:

… Prometheus relabeling

mike9421 · 2025-09-25T12:50:18Z

Hi @swiatekm @jaronoff97 This is the bugfix-related PR that I've split out from the original changes.

During testing, I noticed that the unit test

opentelemetry-operator/cmd/otel-allocator/internal/watcher/promOperator_test.go

Line 954 in 287b175

func TestNamespaceLabelUpdate(t *testing.T) {

exhibits flaky behavior - it sometimes passes and sometimes fails. I suspect this is due to race conditions caused by asynchronous event processing and cache synchronization delays.

If this flaky test issue persists during your review, would it be appropriate to include a fix for this unit test in this PR as well?

swiatekm · 2025-09-25T16:48:18Z

If this flaky test issue persists during your review, would it be appropriate to include a fix for this unit test in this PR as well?

If you are able to figure out why it fails, I'd love a fix!

mike9421 · 2025-09-26T09:21:48Z

Hello @swiatekm
I've analyzed this test failure and haven't found any obvious issues in the code logic. My suspicion is that this is caused by event processing delays in the CI environment.

Evidence:
Timeout 60s(mike9421@140357f): Tests pass consistently.
Timeout 30s(mike9421@780c6e8): Tests fail consistently (no code changes)

swiatekm · 2025-09-26T11:18:27Z

@mike9421 Feel free to include that timeout increase in this PR, then, and we'll see if it helps universally.

swiatekm

This largely looks good to me. What I'd like to understand as well is the performance impact. Can you run this benchmark first on main, and then on this PR, and post the benchstat comparison results?

cmd/otel-allocator/internal/prehook/relabel_test.go

…ilures

mike9421 · 2025-09-26T17:06:45Z

This largely looks good to me. What I'd like to understand as well is the performance impact. Can you run this benchmark first on main, and then on this PR, and post the benchstat comparison results?

Hi @swiatekm I've run benchmark tests before and after the changes to compare performance impact:

Before changes: a7855fa8.txt
After changes: db181c3d5cfec9b8a28328889e34006141f4e63e.txt

Summary:

Basic target processing (BenchmarkProcessTargets): Performance remains relatively stable across different target counts (1K-800K targets)
With relabel config (BenchmarkProcessTargetsWithRelabelConfig): Shows some performance degradation, particularly noticeable in smaller target counts (~15-20% slower for 1K targets)

Benchmark output:

before: a7855fa.txt

goos: darwin
goarch: arm64
cpu: Apple M2
BenchmarkProcessTargets/least-weighted/1000-8         	     741	   1723669 ns/op
BenchmarkProcessTargets/consistent-hashing/1000-8     	     792	   1596023 ns/op
BenchmarkProcessTargets/per-node/1000-8               	     774	   1576435 ns/op
BenchmarkProcessTargets/least-weighted/10000-8        	      88	  13645884 ns/op
BenchmarkProcessTargets/consistent-hashing/10000-8    	      92	  13781715 ns/op
BenchmarkProcessTargets/per-node/10000-8              	      92	  13357022 ns/op
BenchmarkProcessTargets/least-weighted/100000-8       	       8	 138047641 ns/op
BenchmarkProcessTargets/consistent-hashing/100000-8   	       8	 142912250 ns/op
BenchmarkProcessTargets/per-node/100000-8             	       7	 145062518 ns/op
BenchmarkProcessTargets/least-weighted/800000-8       	       1	1972748500 ns/op
BenchmarkProcessTargets/consistent-hashing/800000-8   	       1	2443097916 ns/op
BenchmarkProcessTargets/per-node/800000-8             	       1	2378607000 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/1000-8         	     650	   1782508 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/1000-8     	     697	   1769206 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/1000-8               	     697	   1767336 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/10000-8        	      70	  15840392 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/10000-8    	      76	  16009201 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/10000-8              	      74	  16251981 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/100000-8             	       6	 186242056 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/100000-8       	       7	 155009708 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/100000-8   	       7	 155289196 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/800000-8       	       1	2916311083 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/800000-8   	       1	2635351875 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/800000-8             	       1	2457772750 ns/op
PASS
ok  	command-line-arguments	49.259s

after: db181c3.txt

goos: darwin
goarch: arm64
cpu: Apple M2
BenchmarkProcessTargets/per-node/1000-8      	     776	   1844361 ns/op
BenchmarkProcessTargets/least-weighted/1000-8         	     793	   1505621 ns/op
BenchmarkProcessTargets/consistent-hashing/1000-8     	     804	   1530055 ns/op
BenchmarkProcessTargets/least-weighted/10000-8        	      78	  13633648 ns/op
BenchmarkProcessTargets/consistent-hashing/10000-8    	      92	  13473093 ns/op
BenchmarkProcessTargets/per-node/10000-8              	      92	  13527862 ns/op
BenchmarkProcessTargets/least-weighted/100000-8       	       6	 173648896 ns/op
BenchmarkProcessTargets/consistent-hashing/100000-8   	       7	 154749196 ns/op
BenchmarkProcessTargets/per-node/100000-8             	       8	 138934964 ns/op
BenchmarkProcessTargets/least-weighted/800000-8       	       1	2345997042 ns/op
BenchmarkProcessTargets/consistent-hashing/800000-8   	       1	2246322416 ns/op
BenchmarkProcessTargets/per-node/800000-8             	       1	2365535333 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/1000-8         	     513	   2099554 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/1000-8     	     610	   1980358 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/1000-8               	     619	   1964083 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/10000-8        	      63	  17078673 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/10000-8    	      73	  19457858 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/10000-8              	      70	  17340674 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/100000-8       	       6	 182351708 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/100000-8   	       6	 172529250 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/100000-8             	       6	 178715729 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/800000-8       	       1	2393839333 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/800000-8   	       1	3335741666 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/800000-8             	       1	3170912458 ns/op
PASS
ok  	command-line-arguments	47.172s

swiatekm · 2025-09-27T18:19:10Z

I did 10 runs of both main and this branch, here's the benchstat:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
                                                             │ bench_main.txt │          bench_branch.txt           │
                                                             │     sec/op     │   sec/op     vs base                │
ProcessTargets/least-weighted/1000-32                             1.082m ± 2%   1.051m ± 3%   -2.85% (p=0.043 n=10)
ProcessTargets/consistent-hashing/1000-32                         1.091m ± 3%   1.076m ± 3%   -1.42% (p=0.043 n=10)
ProcessTargets/per-node/1000-32                                   1.083m ± 2%   1.076m ± 2%        ~ (p=0.481 n=10)
ProcessTargets/least-weighted/10000-32                            6.959m ± 1%   6.812m ± 2%   -2.11% (p=0.043 n=10)
ProcessTargets/consistent-hashing/10000-32                        6.834m ± 2%   6.860m ± 2%        ~ (p=0.853 n=10)
ProcessTargets/per-node/10000-32                                  6.925m ± 4%   6.795m ± 2%        ~ (p=0.165 n=10)
ProcessTargets/per-node/100000-32                                 67.67m ± 6%   68.01m ± 9%        ~ (p=0.684 n=10)
ProcessTargets/least-weighted/100000-32                           68.18m ± 5%   67.24m ± 1%        ~ (p=0.218 n=10)
ProcessTargets/consistent-hashing/100000-32                       67.53m ± 4%   67.24m ± 4%        ~ (p=0.739 n=10)
ProcessTargets/least-weighted/800000-32                           668.8m ± 3%   664.4m ± 3%        ~ (p=0.315 n=10)
ProcessTargets/consistent-hashing/800000-32                       651.9m ± 8%   662.3m ± 4%        ~ (p=0.481 n=10)
ProcessTargets/per-node/800000-32                                 657.9m ± 4%   672.1m ± 6%        ~ (p=0.280 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32            1.547m ± 2%   1.669m ± 2%   +7.88% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32        1.532m ± 1%   1.657m ± 2%   +8.16% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                  1.543m ± 2%   1.666m ± 2%   +7.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32       11.00m ± 1%   12.77m ± 3%  +16.06% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                 11.06m ± 1%   12.79m ± 1%  +15.62% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32           11.09m ± 2%   12.81m ± 2%  +15.51% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32          109.8m ± 2%   122.0m ± 2%  +11.07% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32      110.3m ± 1%   121.0m ± 3%   +9.75% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32                109.9m ± 2%   124.4m ± 2%  +13.19% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32          867.3m ± 2%   993.5m ± 6%  +14.55% (p=0.001 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32      873.4m ± 2%   965.0m ± 2%  +10.48% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32                867.7m ± 2%   982.0m ± 3%  +13.18% (p=0.000 n=10)
geomean                                                           29.32m        30.92m        +5.49%

                                                             │ bench_main.txt │           bench_branch.txt           │
                                                             │      B/op      │     B/op      vs base                │
ProcessTargets/least-weighted/1000-32                            4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/1000-32                        4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/1000-32                                  4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/10000-32                           41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/10000-32                       41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/10000-32                                 41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/100000-32                                415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/100000-32                          415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/100000-32                      415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/800000-32                          3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/800000-32                      3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/800000-32                                3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32           4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32       4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                 4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32      44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32          44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32         445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32     445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32               445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32         3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32     3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32               3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
geomean                                                          128.9Mi        140.0Mi        +8.54%

                                                             │ bench_main.txt │           bench_branch.txt            │
                                                             │   allocs/op    │  allocs/op   vs base                  │
ProcessTargets/least-weighted/1000-32                             3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10)
ProcessTargets/consistent-hashing/1000-32                         3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10) ¹
ProcessTargets/per-node/1000-32                                   3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10) ¹
ProcessTargets/least-weighted/10000-32                            33.77k ± 0%   33.77k ± 0%        ~ (p=0.700 n=10)
ProcessTargets/consistent-hashing/10000-32                        33.77k ± 0%   33.77k ± 0%        ~ (p=0.211 n=10)
ProcessTargets/per-node/10000-32                                  33.77k ± 0%   33.77k ± 0%        ~ (p=1.000 n=10)
ProcessTargets/per-node/100000-32                                 336.5k ± 0%   336.5k ± 0%        ~ (p=0.838 n=10)
ProcessTargets/least-weighted/100000-32                           336.5k ± 0%   336.5k ± 0%        ~ (p=0.084 n=10)
ProcessTargets/consistent-hashing/100000-32                       336.5k ± 0%   336.5k ± 0%        ~ (p=0.206 n=10)
ProcessTargets/least-weighted/800000-32                           2.694M ± 0%   2.695M ± 0%        ~ (p=0.853 n=10)
ProcessTargets/consistent-hashing/800000-32                       2.694M ± 0%   2.694M ± 0%        ~ (p=0.593 n=10)
ProcessTargets/per-node/800000-32                                 2.694M ± 0%   2.694M ± 0%        ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32            6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32        6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                  6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32       63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                 63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32           63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32          636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32      636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32                636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32          5.091M ± 0%   5.891M ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32      5.091M ± 0%   5.891M ± 0%  +15.71% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32                5.091M ± 0%   5.891M ± 0%  +15.71% (p=0.000 n=10)
geomean                                                           138.7k        149.2k        +7.56%

So we have around a 15% increase in CPU usage and memory for targets with relabeling enabled. This is a fair amount, but considering this is fixing a bug and we're going to improve the overall efficiency of the whole system by doing target relabeling in the target allocator in a subsequent change, I think this is acceptable. WDYT @jaronoff97 @pavolloffay ?

liushui123456 · 2025-10-01T15:45:03Z

This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime? @swiatekm

swiatekm · 2025-10-01T16:33:16Z

This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime? @swiatekm

There is no workaround other than not rewriting the address label. The bug only happens as a result of that.

For this PR, I've already approved it, but it has a performance impact, so I'd like opinions from my fellow maintainers before merging it. The earliest it might be released is two weeks from now, in 0.137.0, so there isn't any real hurry.

jaronoff97

I think the perf hit is worth the actual functionality fix. We've done a fair amount of work in making the TA more efficient, so I think a minor regression is acceptable.

mike9421 added 2 commits September 25, 2025 19:14

fix(target-allocator): prevent duplicate targets after OTel Collector…

7ee9bb5

… Prometheus relabeling

test(target-allocator): add unit tests for target hash calculation fixes

272e694

mike9421 requested a review from a team as a code owner September 25, 2025 12:27

swiatekm reviewed Sep 26, 2025

View reviewed changes

cmd/otel-allocator/internal/prehook/relabel_test.go Show resolved Hide resolved

mike9421 added 2 commits September 26, 2025 22:48

test: increase timeout for TestNamespaceLabelUpdate to avoid flaky fa…

6f60b4f

…ilures

fix(target): correctly set relabeledLabels and enhance test coverage

db181c3

swiatekm approved these changes Sep 27, 2025

View reviewed changes

swiatekm requested a review from a team October 1, 2025 16:31

jaronoff97 approved these changes Oct 7, 2025

View reviewed changes

swiatekm merged commit 5f53647 into open-telemetry:main Oct 7, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent duplicate targets after OTel Collector Prometheus relabeling#4381

prevent duplicate targets after OTel Collector Prometheus relabeling#4381
swiatekm merged 4 commits intoopen-telemetry:mainfrom
mike9421:fix-targetallocator-hash-assignment

mike9421 commented Sep 25, 2025

Uh oh!

mike9421 commented Sep 25, 2025

Uh oh!

swiatekm commented Sep 25, 2025

Uh oh!

mike9421 commented Sep 26, 2025

Uh oh!

swiatekm commented Sep 26, 2025

Uh oh!

swiatekm left a comment

Uh oh!

Uh oh!

mike9421 commented Sep 26, 2025 •

edited by jaronoff97

Loading

Uh oh!

swiatekm commented Sep 27, 2025

Uh oh!

liushui123456 commented Oct 1, 2025

Uh oh!

swiatekm commented Oct 1, 2025

Uh oh!

jaronoff97 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mike9421 commented Sep 25, 2025

Uh oh!

mike9421 commented Sep 25, 2025

Uh oh!

swiatekm commented Sep 25, 2025

Uh oh!

mike9421 commented Sep 26, 2025

Uh oh!

swiatekm commented Sep 26, 2025

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mike9421 commented Sep 26, 2025 • edited by jaronoff97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swiatekm commented Sep 27, 2025

Uh oh!

liushui123456 commented Oct 1, 2025

Uh oh!

swiatekm commented Oct 1, 2025

Uh oh!

jaronoff97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mike9421 commented Sep 26, 2025 •

edited by jaronoff97

Loading