Prevent duplicate LH EndpointSlices resources#1757
Prevent duplicate LH EndpointSlices resources#1757tpantelis merged 1 commit intosubmariner-io:develfrom
Conversation
|
🤖 Created branch: z_pr1757/tpantelis/duplicate_eps |
76a7bee to
deaf1ad
Compare
|
This PR/issue depends on:
|
|
|
||
| err := obj.(*ServiceEndpointSliceController).stop(ctx) | ||
| if err != nil { | ||
| return errors.Wrapf(err, "failed to stop previous EndpointSlice controller for %q", key) |
There was a problem hiding this comment.
In what scenarios this can fail? Is it safe to delete if failed to stop previous? Any cleanup missed?
There was a problem hiding this comment.
In what scenarios this can fail?
If stop times out while awaiting the syncer queue to drain. It may mean it's stuck trying to create an EPS. We propagate the error so it is retried and give the stuck process a chance to complete, successfully or not.
Is it safe to delete if failed to stop previous?
I'm not clear - safe to delete what?
Any cleanup missed?
There's no more cleanup here.
| } | ||
|
|
||
| go c.queue.Run(c.stopCh, c.processNextGateway) | ||
| go c.queue.Run(c.processNextGateway) |
There was a problem hiding this comment.
Would've preferred this to be different commit as it is a bit unintutive why we have change in gateway controller for endpointslices. Not an issue, just was confused a bit by this.
There was a problem hiding this comment.
yeah sorry this was due to the change in the admiral PR to remove the stopCh param. This was another API change made by that PR.
vthapar
left a comment
There was a problem hiding this comment.
Look good, just one question about changes.
We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise. Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario. To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried. Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
|
🤖 Closed branches: [z_pr1757/tpantelis/duplicate_eps] |
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
submariner-io/lighthouse#1757 Signed-off-by: Tom Pantelis <tompantelis@gmail.com>
We observed a case in QE where there were two LH EPS resources that were created less than a second apart for the same ClusterIP service. The intention is that there's a single EPS created using the GenerateName functionality. As such, when updating the resource, we don't know the generated name so we look it up by its identifying labels via CreateOrUpdate. This expects that there's only one resource with the identifying labels and fails otherwise.
Since there's a single ServiceEndpointSliceController per service and it has a single-threaded work queue, there should never be two EPS resources. The only possible scenario that I can see is if there were two ServiceEndpointSliceController instances running concurrently in the small window between stopping the current controller and starting a new one when a service is updated. When stopping, we drain the work queue which will wait for all queued items to complete processing so we shouldn't have two instances running concurrently. However, the drain has a 5 sec deadline so, if it times out, there may still be a task being processed. So it's possible there was a current create-or-update operation that was delayed for more than 5 sec and remained running long enough when the new controller instance was started, resulting in both creating an EPS resource. I developed a unit test that reproduces this scenario.
To close this window, we need the previous controller instance to be completely stopped before we start a new instance. So when calling AwaitStopped on the resource syncer, pass in a timed Context and propagate the returned error so the operation is retried.
Depends on submariner-io/admiral#1101