-
Notifications
You must be signed in to change notification settings - Fork 366
fix: full sync during restart results in loss of dataplane traffic #2489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where full sync during controller restart could result in data plane traffic loss. The issue occurred when the controller performed full sync before all CRD resources were processed, potentially sending incomplete configuration to the data plane.
- Implements a readiness manager to track when all relevant resources have been reconciled at least once
- Integrates readiness tracking across all controllers to mark resources as processed
- Delays full sync operations until initial reconciliation is complete
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/manager/readiness/manager.go | New readiness manager implementation to track resource processing completion |
| internal/manager/run.go | Integration of readiness manager with provider initialization and controller setup |
| internal/manager/controllers.go | Registration of readiness GVKs and integration with all controllers |
| internal/provider/adc/adc.go | Modified to wait for readiness before starting sync operations |
| internal/provider/provider.go | Added NeedLeaderElection interface method |
| internal/types/k8s.go | Added GvkOf function and renamed v1 import to netv1 to avoid conflicts |
| internal/controller/indexer/indexer.go | Added ControllerName index and reordered indexer setup |
| test/e2e/gatewayapi/httproute.go | Added E2E test for HTTPRoute sync during startup |
| test/e2e/crds/consumer.go | Added E2E test for Consumer sync during startup |
| test/e2e/apisix/route.go | Added E2E test for ApisixRoute sync during startup |
Comments suppressed due to low confidence (2)
test/e2e/gatewayapi/httproute.go:945
- The test description mentions 'ApisixRoute' but this is testing HTTPRoute. The description should be 'Should sync HTTPRoute during startup' to match the actual test behavior.
Eventually(func() string {
test/e2e/gatewayapi/httproute.go:946
- The test step description mentions 'ApisixRoute' but this is testing HTTPRoute. The description should be 'apply HTTPRoute' to match the actual resource being applied.
_, err := s.GetResourceYaml("HTTPRoutePolicy", "http-route-policy-0")
| // Reconcile FIXME: implement the reconcile logic (For now, it dose nothing other than directly accepting) | ||
| func (r *ApisixConsumerReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { | ||
| r.Log.Info("reconcile", "request", req.NamespacedName) | ||
| r.Readier.Done(&apiv2.ApisixConsumer{}, req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more reasonable to execute the packaging within defer func
ronething
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others lgtm.
Type of change:
What this PR does / why we need it:
background
The controller watches CRDs and incrementally builds an in-memory cache of dataplane resources. It periodically performs a full (asynchronous) sync from this cache to the dataplane, without waiting for the cache to be fully populated.
On startup, if the number of CRDs is large, the first full sync may happen before all resources are processed, resulting in partial configuration being pushed to the dataplane. This can cause temporary 404 errors and traffic disruption.
Reproduction Steps:
Apply 5,000 ApisixRoute to the kubernetes cluster.
Restart the Ingress controller.
While the Ingress controller is starting, attempt to access all routes via the data plane. Some routes will return 404 Not Found errors.
Pre-submission checklist: