Skip to content

Conversation

@WHOIM1205
Copy link
Contributor

Fix: Handle Tombstone Delete Events in Router Controllers

Description

Router controllers were silently dropping delete events when informers delivered
DeletedFinalStateUnknown tombstones during watch reconnects.
This happens under normal Kubernetes conditions (API server restarts, network blips)
and leads to stale router state.

The issue was caused by using cache.MetaNamespaceKeyFunc, which cannot handle
tombstone objects.


What’s Fixed

  • replaced cache.MetaNamespaceKeyFunc with
    cache.DeletionHandlingMetaNamespaceKeyFunc in all router controller enqueue* handlers.
  • this ensures delete events are correctly enqueued whether they arrive as normal
    objects or tombstones.

Impact

  • prevents silent loss of delete events
  • avoids permanent stale routes and backends after controller restarts
  • no behavior change for normal (non-delete) events
  • improves router correctness and reliability in production clusters

Code Changes

Changes are limited to router controller enqueue paths:

  • pkg/kthena-router/controller/modelroute_controller.go
  • pkg/kthena-router/controller/modelserver_controller.go
    • enqueueModelServer
    • enqueuePod
  • pkg/kthena-router/controller/gateway_controller.go
  • pkg/kthena-router/controller/httproute_controller.go
  • pkg/kthena-router/controller/inferencepool_controller.go

All changes are mechanical, low-risk, and follow standard Kubernetes controller patterns.


Tests Added

Table-driven Go tests were added for each affected enqueue function:

  • TestEnqueueModelRoute
  • TestEnqueueModelServer
  • TestEnqueuePod
  • TestEnqueueGateway
  • TestEnqueueHTTPRoute
  • TestEnqueueInferencePool

Each test validates:

  • Normal object enqueue behavior (unchanged)
  • Tombstone (DeletedFinalStateUnknown) delete handling

all the test cases are passed locally :
image


How to Verify

go test -v ./pkg/kthena-router/controller/ -run "Enqueue"

@gemini-code-assist
Copy link

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical reliability issue in the router controllers where delete events, particularly those arriving as Kubernetes tombstones during API server disruptions, were being ignored. By updating the key generation logic in the event enqueueing mechanisms, the system can now properly process these deletion signals, ensuring that the router's state remains consistent and accurate even under transient network or API server issues.

Highlights

  • Critical Bug Fix: Router controllers no longer silently drop delete events (tombstones) during Kubernetes watch reconnects, preventing stale routing states.
  • Enhanced Event Handling: Replaced cache.MetaNamespaceKeyFunc with cache.DeletionHandlingMetaNamespaceKeyFunc across all affected enqueue* handlers to correctly process DeletedFinalStateUnknown objects.
  • Comprehensive Testing: New table-driven Go tests have been added for each modified enqueue function, specifically validating both normal object and tombstone delete event handling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@WHOIM1205
Copy link
Contributor Author

/assign @hzxuzhonghu

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great pull request that correctly fixes an important bug where tombstone delete events were being dropped. The change to use cache.DeletionHandlingMetaNamespaceKeyFunc is the right approach. I appreciate the thoroughness in applying this fix across all relevant controllers and adding comprehensive, table-driven unit tests for each change to validate the fix for both normal objects and tombstones. The code is clean and the changes are low-risk. I have one suggestion to improve the maintainability of the new test code.

@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from 3e6ef56 to 0a67eac Compare January 31, 2026 19:58
@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch 3 times, most recently from d37b3a1 to c4a9340 Compare January 31, 2026 20:20
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch 2 times, most recently from 7b46c65 to a83e3f2 Compare January 31, 2026 20:38
Router controllers were using cache.MetaNamespaceKeyFunc which fails
on cache.DeletedFinalStateUnknown (tombstone) objects delivered during
informer watch reconnection. This caused delete events to be silently
dropped, leaving stale routes in the datastore.

Changed to cache.DeletionHandlingMetaNamespaceKeyFunc in:
- ModelRouteController.enqueueModelRoute
- ModelServerController.enqueueModelServer
- ModelServerController.enqueuePod
- GatewayController.enqueueGateway
- HTTPRouteController.enqueueHTTPRoute
- InferencePoolController.enqueueInferencePool

Added table-driven tests covering both normal objects and tombstones.

Signed-off-by: WHOIM1205 <[email protected]>
@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from a83e3f2 to 999a776 Compare January 31, 2026 21:04
Copy link
Contributor

@FAUST-BENCHOU FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets wait for maintainers' review

|------------|------|---------|
| https://ghcr.io/volcano-sh/charts/kthena | networking | 1.0.0 |
| https://ghcr.io/volcano-sh/charts/kthena | workload | 1.0.0 |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? why all instead of partly in
#731 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table was invalid as a whole (column mismatch) and was breaking gen check
removing it entirely was the minimal and safest fix to unblock cl rather than keeping a partially broken structure

@./test/e2e/setup.sh
@echo "Running E2E tests sequentially..."
@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list ./... | grep /test/e2e) -v -timeout=15m
@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list -f '{{if or .TestGoFiles .XTestGoFiles}}{{.ImportPath}}{{end}}' ./... | grep /test/e2e | grep -v '^$$') -v -timeout=15m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#731 (comment)
? so wdum in this one.why different again in above pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change only avoids failing cl on packages with (no test files) actual e2e tests are still executed and will fail the build if they fail ,the goal is to avoid false negatives not to relax real test coverage

{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}

{{ template "chart.requirementsSection" . }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need this one anymore or what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this template is still needed the change only affects generated docs to fix invalid markdown it doesn’t remove or deprecate the chart template


func (c *GatewayController) enqueueGateway(obj interface{}) {
key, err := cache.MetaNamespaceKeyFunc(obj)
key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acceptable(since we also get a DeleteFunc in NewHTTPRouteController or else).

controller.registration, _ = httpRouteInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: controller.enqueueHTTPRoute,
UpdateFunc: func(old, new interface{}) { controller.enqueueHTTPRoute(new) },
DeleteFunc: controller.enqueueHTTPRoute,
})

k8s standard FYI:
https://github.com/kubernetes/client-go/blob/f651faf89451a2a3263d06653561101c26675659/examples/workqueue/main.go#L177-L202

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants