Prevent duplicate LH EndpointSlices resources by tpantelis · Pull Request #1757 · submariner-io/lighthouse

tpantelis · 2025-04-07T22:41:24Z

We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise.

Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario.

To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried.

Depends on submariner-io/admiral#1101

submariner-bot · 2025-04-07T22:41:28Z

🤖 Created branch: z_pr1757/tpantelis/duplicate_eps
🚀 Full E2E won't run until the "ready-to-test" label is applied. I will add it automatically once the PR has 2 approvals, or you can add it manually.

github-actions · 2025-04-08T18:08:09Z

This PR/issue depends on:

~~Add Context param to AwaitStopped admiral#1101~~
By Dependent Issues (🤖). Happy coding!

vthapar · 2025-04-09T14:07:00Z

pkg/agent/controller/service_import.go

+
+		err := obj.(*ServiceEndpointSliceController).stop(ctx)
+		if err != nil {
+			return errors.Wrapf(err, "failed to stop previous EndpointSlice controller for %q", key)


In what scenarios this can fail? Is it safe to delete if failed to stop previous? Any cleanup missed?

In what scenarios this can fail?

If stop times out while awaiting the syncer queue to drain. It may mean it's stuck trying to create an EPS. We propagate the error so it is retried and give the stuck process a chance to complete, successfully or not.

Is it safe to delete if failed to stop previous?

I'm not clear - safe to delete what?

Any cleanup missed?

There's no more cleanup here.

vthapar · 2025-04-09T14:10:25Z

coredns/gateway/controller.go

 	}

-	go c.queue.Run(c.stopCh, c.processNextGateway)
+	go c.queue.Run(c.processNextGateway)


Would've preferred this to be different commit as it is a bit unintutive why we have change in gateway controller for endpointslices. Not an issue, just was confused a bit by this.

yeah sorry this was due to the change in the admiral PR to remove the stopCh param. This was another API change made by that PR.

vthapar

Look good, just one question about changes.

We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise. Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario. To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried. Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

submariner-bot · 2025-04-10T12:30:57Z

🤖 Closed branches: [z_pr1757/tpantelis/duplicate_eps]

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis self-assigned this Apr 7, 2025

tpantelis requested review from Oats87, aswinsuryan, dfarrell07 and maayanf24 as code owners April 7, 2025 22:41

tpantelis added this to Submariner 0.21 Apr 7, 2025

tpantelis requested review from skitt, sridhargaddam, vthapar and yboaron as code owners April 7, 2025 22:41

github-project-automation bot moved this to Todo in Submariner 0.21 Apr 7, 2025

tpantelis moved this from Todo to In Review in Submariner 0.21 Apr 7, 2025

tpantelis force-pushed the duplicate_eps branch 2 times, most recently from 76a7bee to deaf1ad Compare April 7, 2025 23:10

github-actions bot added dependent and removed dependent labels Apr 7, 2025

tpantelis added the backport label Apr 9, 2025

aswinsuryan approved these changes Apr 9, 2025

View reviewed changes

vthapar reviewed Apr 9, 2025

View reviewed changes

tpantelis force-pushed the duplicate_eps branch from deaf1ad to 280c77f Compare April 9, 2025 18:55

tpantelis merged commit 3251cff into submariner-io:devel Apr 10, 2025
23 checks passed

github-project-automation bot moved this from In Review to Done in Submariner 0.21 Apr 10, 2025

tpantelis mentioned this pull request Apr 10, 2025

Automated backport of #1757: Prevent duplicate LH EndpointSlices resources #1763

Merged

tpantelis added the backport-handled label Apr 10, 2025

tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Apr 11, 2025

Add 0.21 release note for lighthouse PR #1757

b834687

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Apr 11, 2025

Add 0.20 release note for lighthouse PR #1757

7cfbf9c

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

This was referenced Apr 11, 2025

Add 0.21 release note for lighthouse PR #1757 submariner-io/submariner-website#1253

Merged

Add 0.20 release note for lighthouse PR #1757 submariner-io/submariner-website#1254

Merged

tpantelis added release-note-needed release-note-handled labels Apr 11, 2025

tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Apr 15, 2025

Add 0.21 release note for lighthouse PR #1757

62414fc

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Apr 15, 2025

Add 0.20 release note for lighthouse PR #1757

2af2074

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis deleted the duplicate_eps branch April 24, 2025 21:18

tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Jun 12, 2025

Add 0.20 release note for lighthouse PR #1757

859786b

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Jun 12, 2025

Add 0.20 release note for lighthouse PR #1757

6726a1b

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Aug 14, 2025

Add 0.21 release note for lighthouse PR #1757

da7a09b

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Aug 14, 2025

Add 0.21 release note for lighthouse PR #1757

799e5e8

submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent duplicate LH EndpointSlices resources#1757

Prevent duplicate LH EndpointSlices resources#1757
tpantelis merged 1 commit intosubmariner-io:develfrom
tpantelis:duplicate_eps

tpantelis commented Apr 7, 2025 •

edited

Loading

Uh oh!

submariner-bot commented Apr 7, 2025

Uh oh!

github-actions bot commented Apr 8, 2025

Uh oh!

vthapar Apr 9, 2025

Uh oh!

tpantelis Apr 9, 2025

Uh oh!

vthapar Apr 9, 2025 •

edited

Loading

Uh oh!

tpantelis Apr 9, 2025

Uh oh!

vthapar left a comment

Uh oh!

Uh oh!

submariner-bot commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tpantelis commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

submariner-bot commented Apr 7, 2025

Uh oh!

github-actions bot commented Apr 8, 2025

Uh oh!

vthapar Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

tpantelis Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

vthapar Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpantelis Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

vthapar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

submariner-bot commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tpantelis commented Apr 7, 2025 •

edited

Loading

vthapar Apr 9, 2025 •

edited

Loading