-
Notifications
You must be signed in to change notification settings - Fork 51
Fix router controllers dropping tombstone delete events #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical reliability issue in the router controllers where delete events, particularly those arriving as Kubernetes tombstones during API server disruptions, were being ignored. By updating the key generation logic in the event enqueueing mechanisms, the system can now properly process these deletion signals, ensuring that the router's state remains consistent and accurate even under transient network or API server issues. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/assign @hzxuzhonghu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a great pull request that correctly fixes an important bug where tombstone delete events were being dropped. The change to use cache.DeletionHandlingMetaNamespaceKeyFunc is the right approach. I appreciate the thoroughness in applying this fix across all relevant controllers and adding comprehensive, table-driven unit tests for each change to validate the fix for both normal objects and tombstones. The code is clean and the changes are low-risk. I have one suggestion to improve the maintainability of the new test code.
3e6ef56 to
0a67eac
Compare
d37b3a1 to
c4a9340
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
7b46c65 to
a83e3f2
Compare
Router controllers were using cache.MetaNamespaceKeyFunc which fails on cache.DeletedFinalStateUnknown (tombstone) objects delivered during informer watch reconnection. This caused delete events to be silently dropped, leaving stale routes in the datastore. Changed to cache.DeletionHandlingMetaNamespaceKeyFunc in: - ModelRouteController.enqueueModelRoute - ModelServerController.enqueueModelServer - ModelServerController.enqueuePod - GatewayController.enqueueGateway - HTTPRouteController.enqueueHTTPRoute - InferencePoolController.enqueueInferencePool Added table-driven tests covering both normal objects and tombstones. Signed-off-by: WHOIM1205 <[email protected]>
a83e3f2 to
999a776
Compare
FAUST-BENCHOU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets wait for maintainers' review
| |------------|------|---------| | ||
| | https://ghcr.io/volcano-sh/charts/kthena | networking | 1.0.0 | | ||
| | https://ghcr.io/volcano-sh/charts/kthena | workload | 1.0.0 | | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? why all instead of partly in
#731 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the table was invalid as a whole (column mismatch) and was breaking gen check
removing it entirely was the minimal and safest fix to unblock cl rather than keeping a partially broken structure
| @./test/e2e/setup.sh | ||
| @echo "Running E2E tests sequentially..." | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list ./... | grep /test/e2e) -v -timeout=15m | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list -f '{{if or .TestGoFiles .XTestGoFiles}}{{.ImportPath}}{{end}}' ./... | grep /test/e2e | grep -v '^$$') -v -timeout=15m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#731 (comment)
? so wdum in this one.why different again in above pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change only avoids failing cl on packages with (no test files) actual e2e tests are still executed and will fail the build if they fail ,the goal is to avoid false negatives not to relax real test coverage
| {{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }} | ||
|
|
||
| {{ template "chart.requirementsSection" . }} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need this one anymore or what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this template is still needed the change only affects generated docs to fix invalid markdown it doesn’t remove or deprecate the chart template
|
|
||
| func (c *GatewayController) enqueueGateway(obj interface{}) { | ||
| key, err := cache.MetaNamespaceKeyFunc(obj) | ||
| key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acceptable(since we also get a DeleteFunc in NewHTTPRouteController or else).
kthena/pkg/kthena-router/controller/httproute_controller.go
Lines 60 to 64 in a8c9193
| controller.registration, _ = httpRouteInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ | |
| AddFunc: controller.enqueueHTTPRoute, | |
| UpdateFunc: func(old, new interface{}) { controller.enqueueHTTPRoute(new) }, | |
| DeleteFunc: controller.enqueueHTTPRoute, | |
| }) |
k8s standard FYI:
https://github.com/kubernetes/client-go/blob/f651faf89451a2a3263d06653561101c26675659/examples/workqueue/main.go#L177-L202
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for confirming
Fix: Handle Tombstone Delete Events in Router Controllers
Description
Router controllers were silently dropping delete events when informers delivered
DeletedFinalStateUnknowntombstones during watch reconnects.This happens under normal Kubernetes conditions (API server restarts, network blips)
and leads to stale router state.
The issue was caused by using
cache.MetaNamespaceKeyFunc, which cannot handletombstone objects.
What’s Fixed
cache.MetaNamespaceKeyFuncwithcache.DeletionHandlingMetaNamespaceKeyFuncin all router controllerenqueue*handlers.objects or tombstones.
Impact
Code Changes
Changes are limited to router controller enqueue paths:
pkg/kthena-router/controller/modelroute_controller.gopkg/kthena-router/controller/modelserver_controller.goenqueueModelServerenqueuePodpkg/kthena-router/controller/gateway_controller.gopkg/kthena-router/controller/httproute_controller.gopkg/kthena-router/controller/inferencepool_controller.goAll changes are mechanical, low-risk, and follow standard Kubernetes controller patterns.
Tests Added
Table-driven Go tests were added for each affected enqueue function:
TestEnqueueModelRouteTestEnqueueModelServerTestEnqueuePodTestEnqueueGatewayTestEnqueueHTTPRouteTestEnqueueInferencePoolEach test validates:
DeletedFinalStateUnknown) delete handlingall the test cases are passed locally :

How to Verify