Skip to content

prevent duplicate targets after OTel Collector Prometheus relabeling#4381

Merged
swiatekm merged 4 commits intoopen-telemetry:mainfrom
mike9421:fix-targetallocator-hash-assignment
Oct 7, 2025
Merged

prevent duplicate targets after OTel Collector Prometheus relabeling#4381
swiatekm merged 4 commits intoopen-telemetry:mainfrom
mike9421:fix-targetallocator-hash-assignment

Conversation

@mike9421
Copy link
Copy Markdown
Contributor

Description:

Extract hash calculation fixes into a separate PR

Link to tracking Issue(s):

Testing:

Documentation:

@mike9421 mike9421 requested a review from a team as a code owner September 25, 2025 12:27
@mike9421
Copy link
Copy Markdown
Contributor Author

Hi @swiatekm @jaronoff97 This is the bugfix-related PR that I've split out from the original changes.

During testing, I noticed that the unit test

func TestNamespaceLabelUpdate(t *testing.T) {
exhibits flaky behavior - it sometimes passes and sometimes fails. I suspect this is due to race conditions caused by asynchronous event processing and cache synchronization delays.

If this flaky test issue persists during your review, would it be appropriate to include a fix for this unit test in this PR as well?

@swiatekm
Copy link
Copy Markdown
Contributor

If this flaky test issue persists during your review, would it be appropriate to include a fix for this unit test in this PR as well?

If you are able to figure out why it fails, I'd love a fix!

@mike9421
Copy link
Copy Markdown
Contributor Author

Hello @swiatekm
I've analyzed this test failure and haven't found any obvious issues in the code logic. My suspicion is that this is caused by event processing delays in the CI environment.

Evidence:
Timeout 60s(mike9421@140357f): Tests pass consistently.
Timeout 30s(mike9421@780c6e8): Tests fail consistently (no code changes)

@swiatekm
Copy link
Copy Markdown
Contributor

@mike9421 Feel free to include that timeout increase in this PR, then, and we'll see if it helps universally.

Copy link
Copy Markdown
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This largely looks good to me. What I'd like to understand as well is the performance impact. Can you run this benchmark first on main, and then on this PR, and post the benchstat comparison results?

@mike9421
Copy link
Copy Markdown
Contributor Author

mike9421 commented Sep 26, 2025

This largely looks good to me. What I'd like to understand as well is the performance impact. Can you run this benchmark first on main, and then on this PR, and post the benchstat comparison results?

Hi @swiatekm I've run benchmark tests before and after the changes to compare performance impact:

Before changes: a7855fa8.txt
After changes: db181c3d5cfec9b8a28328889e34006141f4e63e.txt

Summary:

  • Basic target processing (BenchmarkProcessTargets): Performance remains relatively stable across different target counts (1K-800K targets)
  • With relabel config (BenchmarkProcessTargetsWithRelabelConfig): Shows some performance degradation, particularly noticeable in smaller target counts (~15-20% slower for 1K targets)

Benchmark output:

goos: darwin
goarch: arm64
cpu: Apple M2
BenchmarkProcessTargets/least-weighted/1000-8         	     741	   1723669 ns/op
BenchmarkProcessTargets/consistent-hashing/1000-8     	     792	   1596023 ns/op
BenchmarkProcessTargets/per-node/1000-8               	     774	   1576435 ns/op
BenchmarkProcessTargets/least-weighted/10000-8        	      88	  13645884 ns/op
BenchmarkProcessTargets/consistent-hashing/10000-8    	      92	  13781715 ns/op
BenchmarkProcessTargets/per-node/10000-8              	      92	  13357022 ns/op
BenchmarkProcessTargets/least-weighted/100000-8       	       8	 138047641 ns/op
BenchmarkProcessTargets/consistent-hashing/100000-8   	       8	 142912250 ns/op
BenchmarkProcessTargets/per-node/100000-8             	       7	 145062518 ns/op
BenchmarkProcessTargets/least-weighted/800000-8       	       1	1972748500 ns/op
BenchmarkProcessTargets/consistent-hashing/800000-8   	       1	2443097916 ns/op
BenchmarkProcessTargets/per-node/800000-8             	       1	2378607000 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/1000-8         	     650	   1782508 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/1000-8     	     697	   1769206 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/1000-8               	     697	   1767336 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/10000-8        	      70	  15840392 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/10000-8    	      76	  16009201 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/10000-8              	      74	  16251981 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/100000-8             	       6	 186242056 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/100000-8       	       7	 155009708 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/100000-8   	       7	 155289196 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/800000-8       	       1	2916311083 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/800000-8   	       1	2635351875 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/800000-8             	       1	2457772750 ns/op
PASS
ok  	command-line-arguments	49.259s
goos: darwin
goarch: arm64
cpu: Apple M2
BenchmarkProcessTargets/per-node/1000-8      	     776	   1844361 ns/op
BenchmarkProcessTargets/least-weighted/1000-8         	     793	   1505621 ns/op
BenchmarkProcessTargets/consistent-hashing/1000-8     	     804	   1530055 ns/op
BenchmarkProcessTargets/least-weighted/10000-8        	      78	  13633648 ns/op
BenchmarkProcessTargets/consistent-hashing/10000-8    	      92	  13473093 ns/op
BenchmarkProcessTargets/per-node/10000-8              	      92	  13527862 ns/op
BenchmarkProcessTargets/least-weighted/100000-8       	       6	 173648896 ns/op
BenchmarkProcessTargets/consistent-hashing/100000-8   	       7	 154749196 ns/op
BenchmarkProcessTargets/per-node/100000-8             	       8	 138934964 ns/op
BenchmarkProcessTargets/least-weighted/800000-8       	       1	2345997042 ns/op
BenchmarkProcessTargets/consistent-hashing/800000-8   	       1	2246322416 ns/op
BenchmarkProcessTargets/per-node/800000-8             	       1	2365535333 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/1000-8         	     513	   2099554 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/1000-8     	     610	   1980358 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/1000-8               	     619	   1964083 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/10000-8        	      63	  17078673 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/10000-8    	      73	  19457858 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/10000-8              	      70	  17340674 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/100000-8       	       6	 182351708 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/100000-8   	       6	 172529250 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/100000-8             	       6	 178715729 ns/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted/800000-8       	       1	2393839333 ns/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing/800000-8   	       1	3335741666 ns/op
BenchmarkProcessTargetsWithRelabelConfig/per-node/800000-8             	       1	3170912458 ns/op
PASS
ok  	command-line-arguments	47.172s

@swiatekm
Copy link
Copy Markdown
Contributor

I did 10 runs of both main and this branch, here's the benchstat:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
                                                             │ bench_main.txt │          bench_branch.txt           │
                                                             │     sec/op     │   sec/op     vs base                │
ProcessTargets/least-weighted/1000-32                             1.082m ± 2%   1.051m ± 3%   -2.85% (p=0.043 n=10)
ProcessTargets/consistent-hashing/1000-32                         1.091m ± 3%   1.076m ± 3%   -1.42% (p=0.043 n=10)
ProcessTargets/per-node/1000-32                                   1.083m ± 2%   1.076m ± 2%        ~ (p=0.481 n=10)
ProcessTargets/least-weighted/10000-32                            6.959m ± 1%   6.812m ± 2%   -2.11% (p=0.043 n=10)
ProcessTargets/consistent-hashing/10000-32                        6.834m ± 2%   6.860m ± 2%        ~ (p=0.853 n=10)
ProcessTargets/per-node/10000-32                                  6.925m ± 4%   6.795m ± 2%        ~ (p=0.165 n=10)
ProcessTargets/per-node/100000-32                                 67.67m ± 6%   68.01m ± 9%        ~ (p=0.684 n=10)
ProcessTargets/least-weighted/100000-32                           68.18m ± 5%   67.24m ± 1%        ~ (p=0.218 n=10)
ProcessTargets/consistent-hashing/100000-32                       67.53m ± 4%   67.24m ± 4%        ~ (p=0.739 n=10)
ProcessTargets/least-weighted/800000-32                           668.8m ± 3%   664.4m ± 3%        ~ (p=0.315 n=10)
ProcessTargets/consistent-hashing/800000-32                       651.9m ± 8%   662.3m ± 4%        ~ (p=0.481 n=10)
ProcessTargets/per-node/800000-32                                 657.9m ± 4%   672.1m ± 6%        ~ (p=0.280 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32            1.547m ± 2%   1.669m ± 2%   +7.88% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32        1.532m ± 1%   1.657m ± 2%   +8.16% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                  1.543m ± 2%   1.666m ± 2%   +7.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32       11.00m ± 1%   12.77m ± 3%  +16.06% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                 11.06m ± 1%   12.79m ± 1%  +15.62% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32           11.09m ± 2%   12.81m ± 2%  +15.51% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32          109.8m ± 2%   122.0m ± 2%  +11.07% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32      110.3m ± 1%   121.0m ± 3%   +9.75% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32                109.9m ± 2%   124.4m ± 2%  +13.19% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32          867.3m ± 2%   993.5m ± 6%  +14.55% (p=0.001 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32      873.4m ± 2%   965.0m ± 2%  +10.48% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32                867.7m ± 2%   982.0m ± 3%  +13.18% (p=0.000 n=10)
geomean                                                           29.32m        30.92m        +5.49%

                                                             │ bench_main.txt │           bench_branch.txt           │
                                                             │      B/op      │     B/op      vs base                │
ProcessTargets/least-weighted/1000-32                            4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/1000-32                        4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/1000-32                                  4.167Mi ± 0%   4.197Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/10000-32                           41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/10000-32                       41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/10000-32                                 41.60Mi ± 0%   41.90Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/100000-32                                415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/100000-32                          415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/100000-32                      415.9Mi ± 0%   418.9Mi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/least-weighted/800000-32                          3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/consistent-hashing/800000-32                      3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargets/per-node/800000-32                                3.280Gi ± 0%   3.304Gi ± 0%   +0.73% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32           4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32       4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                 4.456Mi ± 0%   5.211Mi ± 0%  +16.95% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32      44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32          44.51Mi ± 0%   52.07Mi ± 0%  +16.97% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32         445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32     445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32               445.1Mi ± 0%   520.7Mi ± 0%  +16.98% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32         3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32     3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32               3.492Gi ± 0%   4.082Gi ± 0%  +16.90% (p=0.000 n=10)
geomean                                                          128.9Mi        140.0Mi        +8.54%

                                                             │ bench_main.txt │           bench_branch.txt            │
                                                             │   allocs/op    │  allocs/op   vs base                  │
ProcessTargets/least-weighted/1000-32                             3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10)
ProcessTargets/consistent-hashing/1000-32                         3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10) ¹
ProcessTargets/per-node/1000-32                                   3.395k ± 0%   3.395k ± 0%        ~ (p=1.000 n=10) ¹
ProcessTargets/least-weighted/10000-32                            33.77k ± 0%   33.77k ± 0%        ~ (p=0.700 n=10)
ProcessTargets/consistent-hashing/10000-32                        33.77k ± 0%   33.77k ± 0%        ~ (p=0.211 n=10)
ProcessTargets/per-node/10000-32                                  33.77k ± 0%   33.77k ± 0%        ~ (p=1.000 n=10)
ProcessTargets/per-node/100000-32                                 336.5k ± 0%   336.5k ± 0%        ~ (p=0.838 n=10)
ProcessTargets/least-weighted/100000-32                           336.5k ± 0%   336.5k ± 0%        ~ (p=0.084 n=10)
ProcessTargets/consistent-hashing/100000-32                       336.5k ± 0%   336.5k ± 0%        ~ (p=0.206 n=10)
ProcessTargets/least-weighted/800000-32                           2.694M ± 0%   2.695M ± 0%        ~ (p=0.853 n=10)
ProcessTargets/consistent-hashing/800000-32                       2.694M ± 0%   2.694M ± 0%        ~ (p=0.593 n=10)
ProcessTargets/per-node/800000-32                                 2.694M ± 0%   2.694M ± 0%        ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-32            6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-32        6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-32                  6.388k ± 0%   7.388k ± 0%  +15.65% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-32       63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-32                 63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-32           63.74k ± 0%   73.74k ± 0%  +15.69% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-32          636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-32      636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-32                636.3k ± 0%   736.3k ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-32          5.091M ± 0%   5.891M ± 0%  +15.72% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-32      5.091M ± 0%   5.891M ± 0%  +15.71% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-32                5.091M ± 0%   5.891M ± 0%  +15.71% (p=0.000 n=10)
geomean                                                           138.7k        149.2k        +7.56%

So we have around a 15% increase in CPU usage and memory for targets with relabeling enabled. This is a fair amount, but considering this is fixing a bug and we're going to improve the overall efficiency of the whole system by doing target relabeling in the target allocator in a subsequent change, I think this is acceptable. WDYT @jaronoff97 @pavolloffay ?

@liushui123456
Copy link
Copy Markdown

This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime? @swiatekm

@swiatekm swiatekm requested a review from a team October 1, 2025 16:31
@swiatekm
Copy link
Copy Markdown
Contributor

swiatekm commented Oct 1, 2025

This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime? @swiatekm

There is no workaround other than not rewriting the address label. The bug only happens as a result of that.

For this PR, I've already approved it, but it has a performance impact, so I'd like opinions from my fellow maintainers before merging it. The earliest it might be released is two weeks from now, in 0.137.0, so there isn't any real hurry.

Copy link
Copy Markdown
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the perf hit is worth the actual functionality fix. We've done a fair amount of work in making the TA more efficient, so I think a minor regression is acceptable.

@swiatekm swiatekm merged commit 5f53647 into open-telemetry:main Oct 7, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants