Skip to content

Prevent duplicate LH EndpointSlices resources#1757

Merged
tpantelis merged 1 commit intosubmariner-io:develfrom
tpantelis:duplicate_eps
Apr 10, 2025
Merged

Prevent duplicate LH EndpointSlices resources#1757
tpantelis merged 1 commit intosubmariner-io:develfrom
tpantelis:duplicate_eps

Conversation

@tpantelis
Copy link
Contributor

@tpantelis tpantelis commented Apr 7, 2025

We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise.

Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario.

To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried.

Depends on submariner-io/admiral#1101

@submariner-bot
Copy link
Contributor

🤖 Created branch: z_pr1757/tpantelis/duplicate_eps
🚀 Full E2E won't run until the "ready-to-test" label is applied. I will add it automatically once the PR has 2 approvals, or you can add it manually.

@tpantelis tpantelis moved this from Todo to In Review in Submariner 0.21 Apr 7, 2025
@tpantelis tpantelis force-pushed the duplicate_eps branch 2 times, most recently from 76a7bee to deaf1ad Compare April 7, 2025 23:10
@github-actions
Copy link

github-actions bot commented Apr 8, 2025

This PR/issue depends on:


err := obj.(*ServiceEndpointSliceController).stop(ctx)
if err != nil {
return errors.Wrapf(err, "failed to stop previous EndpointSlice controller for %q", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenarios this can fail? Is it safe to delete if failed to stop previous? Any cleanup missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenarios this can fail?

If stop times out while awaiting the syncer queue to drain. It may mean it's stuck trying to create an EPS. We propagate the error so it is retried and give the stuck process a chance to complete, successfully or not.

Is it safe to delete if failed to stop previous?

I'm not clear - safe to delete what?

Any cleanup missed?

There's no more cleanup here.

}

go c.queue.Run(c.stopCh, c.processNextGateway)
go c.queue.Run(c.processNextGateway)
Copy link
Contributor

@vthapar vthapar Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would've preferred this to be different commit as it is a bit unintutive why we have change in gateway controller for endpointslices. Not an issue, just was confused a bit by this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry this was due to the change in the admiral PR to remove the stopCh param. This was another API change made by that PR.

Copy link
Contributor

@vthapar vthapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good, just one question about changes.

We observed a case in QE where there were two LH EPS resources that were
created less than a second apart for the same ClusterIP service. The
intention is that there's a single EPS created using the GenerateName
functionality. As such, when updating the resource, we don't know the
generated name so we look it up by its identifying labels via
CreateOrUpdate. This expects that there's only one resource with the
identifying labels and fails otherwise.

Since there's a single ServiceEndpointSliceController per service and
it has a single-threaded work queue, there should never be two EPS
resources. The only possible scenario that I can see is if there were
two ServiceEndpointSliceController instances running concurrently
in the small window between stopping the current controller and
starting a new one when a service is updated. When stopping, we drain
the work queue which will wait for all queued items to complete
processing so we shouldn't have two instances running concurrently.
However, the drain has a 5 sec deadline so, if it times out, there may
still be a task being processed. So it's possible there was a current
create-or-update operation that was delayed for more than 5 sec and
remained running long enough when the new controller instance was started,
resulting in both creating an EPS resource. I developed a unit test that
reproduces this scenario.

To close this window, we need the previous controller instance to be
completely stopped before we start a new instance. So when calling
AwaitStopped on the resource syncer, pass in a timed Context and
propagate the returned error so the operation is retried.

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
@tpantelis tpantelis merged commit 3251cff into submariner-io:devel Apr 10, 2025
23 checks passed
@submariner-bot
Copy link
Contributor

🤖 Closed branches: [z_pr1757/tpantelis/duplicate_eps]

tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Apr 11, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Apr 11, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Apr 15, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Apr 15, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
@tpantelis tpantelis deleted the duplicate_eps branch April 24, 2025 21:18
tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Jun 12, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Jun 12, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to tpantelis/submariner-website that referenced this pull request Aug 14, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
tpantelis added a commit to submariner-io/submariner-website that referenced this pull request Aug 14, 2025
submariner-io/lighthouse#1757

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants