Skip to content

Automated backport of #1757: Prevent duplicate LH EndpointSlices resources#1763

Merged
skitt merged 1 commit intosubmariner-io:release-0.20from
tpantelis:automated-backport-of-#1757-upstream-release-0.20
Apr 10, 2025
Merged

Automated backport of #1757: Prevent duplicate LH EndpointSlices resources#1763
skitt merged 1 commit intosubmariner-io:release-0.20from
tpantelis:automated-backport-of-#1757-upstream-release-0.20

Conversation

@tpantelis
Copy link
Contributor

@tpantelis tpantelis commented Apr 10, 2025

Backport of #1757 on release-0.20.

#1757: Prevent duplicate LH EndpointSlices resources

For details on the backport process, see the backport requests page.

Depends on submariner-io/shipyard#1912

We observed a case in QE where there were two LH EPS resources that were
created less than a second apart for the same ClusterIP service. The
intention is that there's a single EPS created using the GenerateName
functionality. As such, when updating the resource, we don't know the
generated name so we look it up by its identifying labels via
CreateOrUpdate. This expects that there's only one resource with the
identifying labels and fails otherwise.

Since there's a single ServiceEndpointSliceController per service and
it has a single-threaded work queue, there should never be two EPS
resources. The only possible scenario that I can see is if there were
two ServiceEndpointSliceController instances running concurrently
in the small window between stopping the current controller and
starting a new one when a service is updated. When stopping, we drain
the work queue which will wait for all queued items to complete
processing so we shouldn't have two instances running concurrently.
However, the drain has a 5 sec deadline so, if it times out, there may
still be a task being processed. So it's possible there was a current
create-or-update operation that was delayed for more than 5 sec and
remained running long enough when the new controller instance was started,
resulting in both creating an EPS resource. I developed a unit test that
reproduces this scenario.

To close this window, we need the previous controller instance to be
completely stopped before we start a new instance. So when calling
AwaitStopped on the resource syncer, pass in a timed Context and
propagate the returned error so the operation is retried.

Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
@submariner-bot
Copy link
Contributor

🤖 Created branch: z_pr1763/tpantelis/automated-backport-of-#1757-upstream-release-0.20
🚀 Full E2E won't run until the "ready-to-test" label is applied. I will add it automatically once the PR has 2 approvals, or you can add it manually.

@tpantelis
Copy link
Contributor Author

@skitt Check branch dependencies is failing:

This project depends on github.com/submariner-io/admiral v0.20.1-0.20250409042613-8f51af59e8a8
but github.com/submariner-io/admiral branch devel does not contain commit 8f51af59e8a8

The release-0.20 branch of admiral does contain commit 8f51af59e8a8 but it's looking on the devel branch. In the script it seems ${GITHUB_BASE_REF} is empty. I assume github is supposed to set this when it calls the action.

@skitt
Copy link
Member

skitt commented Apr 10, 2025

@skitt Check branch dependencies is failing:

This project depends on github.com/submariner-io/admiral v0.20.1-0.20250409042613-8f51af59e8a8
but github.com/submariner-io/admiral branch devel does not contain commit 8f51af59e8a8

The release-0.20 branch of admiral does contain commit 8f51af59e8a8 but it's looking on the devel branch. In the script it seems ${GITHUB_BASE_REF} is empty. I assume github is supposed to set this when it calls the action.

It’s set correctly but isn’t being passed down through .dapper! See https://github.com/submariner-io/lighthouse/actions/runs/14383806384/job/40334038266?pr=1767

@skitt skitt requested a review from mkolesnik as a code owner April 10, 2025 15:16
@skitt skitt enabled auto-merge (rebase) April 10, 2025 15:17
@tpantelis
Copy link
Contributor Author

It’s set correctly but isn’t being passed down through .dapper! See https://github.com/submariner-io/lighthouse/actions/runs/14383806384/job/40334038266?pr=1767

How did this ever work....

@skitt
Copy link
Member

skitt commented Apr 10, 2025

It’s set correctly but isn’t being passed down through .dapper! See https://github.com/submariner-io/lighthouse/actions/runs/14383806384/job/40334038266?pr=1767

How did this ever work....

We’ve apparently never needed non-release dependencies in release branches.

@skitt
Copy link
Member

skitt commented Apr 10, 2025

submariner-io/shipyard#1911 will fix this properly (once backported) but I’ve added an equivalent commit here.

@tpantelis
Copy link
Contributor Author

submariner-io/shipyard#1911 will fix this properly (once backported) but I’ve added an equivalent commit here.

Ok - let's wait for shipyard instead of merging the temporary commit. Thanks for looking into this. I'm not sure I would've found that subtle issue 😄

@submariner-bot submariner-bot added the ready-to-test When a PR is ready for full E2E testing label Apr 10, 2025
@skitt skitt force-pushed the automated-backport-of-#1757-upstream-release-0.20 branch from 17c7a6b to d1ba5fd Compare April 10, 2025 16:02
@skitt
Copy link
Member

skitt commented Apr 10, 2025

I'm not sure I would've found that subtle issue 😄

I wonder if we shouldn’t re-evaluate the use of Dapper-style builds — they’re convenient for kickstarting local development but they create a significant maintenance burden with a very low bus factor!

@tpantelis
Copy link
Contributor Author

tpantelis commented Apr 10, 2025

I wonder if we shouldn’t re-evaluate the use of Dapper-style builds — they’re convenient for kickstarting local development but they create a significant maintenance burden with a very low bus factor!

yeah Dapper can be a pain. I sometimes have to bypass Dapper locally when something fails with Dapper but works outside of it.

Or maybe just use it where needed. In this case, the "Check branch dependencies" job shouldn't need to run thru Dapper.

@github-actions
Copy link

@skitt skitt merged commit a8f1ae0 into submariner-io:release-0.20 Apr 10, 2025
50 of 51 checks passed
@submariner-bot
Copy link
Contributor

🤖 Closed branches: [z_pr1763/tpantelis/automated-backport-of-#1757-upstream-release-0.20]

@dfarrell07
Copy link
Member

@tpantelis tpantelis deleted the automated-backport-of-#1757-upstream-release-0.20 branch April 24, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automated-backport ready-to-test When a PR is ready for full E2E testing release-note-handled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants