Automated backport of #1757: Prevent duplicate LH EndpointSlices resources#1763
Conversation
We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise. Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario. To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried. Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
|
🤖 Created branch: z_pr1763/tpantelis/automated-backport-of-#1757-upstream-release-0.20 |
|
@skitt Check branch dependencies is failing: The release-0.20 branch of admiral does contain commit 8f51af59e8a8 but it's looking on the devel branch. In the script it seems |
It’s set correctly but isn’t being passed down through |
How did this ever work.... |
We’ve apparently never needed non-release dependencies in release branches. |
|
submariner-io/shipyard#1911 will fix this properly (once backported) but I’ve added an equivalent commit here. |
Ok - let's wait for shipyard instead of merging the temporary commit. Thanks for looking into this. I'm not sure I would've found that subtle issue 😄 |
17c7a6b to
d1ba5fd
Compare
I wonder if we shouldn’t re-evaluate the use of Dapper-style builds — they’re convenient for kickstarting local development but they create a significant maintenance burden with a very low bus factor! |
yeah Dapper can be a pain. I sometimes have to bypass Dapper locally when something fails with Dapper but works outside of it. Or maybe just use it where needed. In this case, the "Check branch dependencies" job shouldn't need to run thru Dapper. |
|
This PR/issue depends on: |
|
🤖 Closed branches: [z_pr1763/tpantelis/automated-backport-of-#1757-upstream-release-0.20] |
Backport of #1757 on release-0.20.
#1757: Prevent duplicate LH EndpointSlices resources
For details on the backport process, see the backport requests page.
Depends on submariner-io/shipyard#1912