Skip to content

Conversation

@WHOIM1205
Copy link
Contributor

Summary

This PR fixes a nil pointer panic in the PD disaggregated scheduling path that can happen when a pod is deleted while scheduling is in progress

The change is small and defensive and it avoids crashing the router in a very realistic race scenario


What I Fixed

While looking at the PDGroup scheduling flow I noticed that
getPrefillPodsForDecodeGroup assumes the underlying Pod inside PodInfo
is always present

In practice, this is not always true If a pod gets deleted while a scheduling
cycle is still running the PodInfo can still exist but pod.Pod can be nil
Accessing labels in this case causes a panic

This PR adds a simple nil check and safely returns instead of crashing


Why This Matters

Pod deletion during scheduling is normal in real clusters:

  • scale downs
  • rolling updates
  • manual pod deletes

Before this fix, hitting this race would:

  • panic the router
  • fail all in-flight requests
  • require a restart to recover

After this change the scheduler just skips the stale pod entry and continues
normally


Code Changes

  • pkg/kthena-router/datastore/model_server.go
    Added a defensive nil guard in getPrefillPodsForDecodeGroup

  • pkg/kthena-router/datastore/pdgroup_pods_test.go
    Added a small table-driven test covering nil pod cases


Tests

  • Added TestGetPrefillPodsForDecodeGroupNilPodInfo
    • covers nil PodInfo
    • covers PodInfo with a nil Pod
  • All existing PDGroup tests are passing
  • No behavior change in the normal scheduling path
  • all the test cases are passed locally
image

@gemini-code-assist
Copy link

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical stability issue in the PD disaggregated scheduling path by introducing a defensive nil check. The change prevents a router panic that previously occurred if a pod was deleted during a scheduling cycle, causing a PodInfo to reference a nil Pod. This enhancement significantly improves the robustness of the scheduler, allowing it to gracefully handle stale pod entries and continue operations without crashing, thereby preventing service disruptions.

Highlights

  • Nil Pointer Panic Fix: Implemented a defensive nil check in the getPrefillPodsForDecodeGroup function within model_server.go to prevent a nil pointer panic that could occur if a pod is deleted mid-scheduling cycle, leading to a PodInfo with a nil Pod.
  • Enhanced Test Coverage: Added a new table-driven test, TestGetPrefillPodsForDecodeGroupNilPodInfo, to specifically cover scenarios where PodInfo itself is nil or contains a nil Pod, ensuring the fix works as expected and prevents future panics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical nil pointer panic that can occur when a pod is deleted during a scheduling cycle. The fix involves adding a defensive nil check in getPrefillPodsForDecodeGroup to handle cases where the PodInfo or its underlying Pod is nil. This is a good and necessary change to improve the router's stability. A new table-driven test has also been added to verify the fix, covering both nil PodInfo and PodInfo with a nil Pod.

My review includes one suggestion to check for other potentially vulnerable locations in the code that might suffer from the same race condition, to ensure the fix is comprehensive.

Comment on lines +159 to +161
if pod == nil || pod.Pod == nil {
return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a good defensive check to prevent a nil pointer panic. Based on the PR description, this race condition is a real possibility. If pod.Pod can indeed be nil while the PodInfo object still exists, there might be other areas in the code that are vulnerable to the same panic.

For instance, in pkg/kthena-router/datastore/store.go:

  • DeletePod calls ms.removePodFromPDGroups(podName, pod.Pod.Labels) which would panic if pod.Pod is nil.
  • updatePodMetrics and updatePodModels also access pod.Pod without a nil check.

To make the fix more comprehensive, consider adding similar guards to these other locations to protect against a nil pod.Pod.

Copy link
Contributor Author

@WHOIM1205 WHOIM1205 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion
I’ve kept this PR focused on the specific panic reported here but I agree it’s worth reviewing other paths separately

@WHOIM1205
Copy link
Contributor Author

/assign @hzxuzhonghu

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: WHOIM1205 <[email protected]>
@WHOIM1205 WHOIM1205 force-pushed the fix/nil-pointer-pdgroup-scheduling branch from c75b8ba to e2b33b3 Compare January 30, 2026 19:52
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WHOIM1205 From my perspective it is not possible even when pod is deleted that podInfo.Pod be nil. It is referencing a pointer, even the object is deleted, it can not be freed unless the reference freed.

Can you reproduce a nil case?

@WHOIM1205
Copy link
Contributor Author

@WHOIM1205 From my perspective it is not possible even when pod is deleted that podInfo.Pod be nil. It is referencing a pointer, even the object is deleted, it can not be freed unless the reference freed.

Can you reproduce a nil case?

The intent here isn’t that the pod object itself is freed while referenced but that podInfo can outlive the underlying Pod during delete/update handling in that window the pod field can be missing even though the podInfo is still being used by the scheduling path
I haven’t been able to reproduce this reliably in a live cluster yet the test models this edge case defensively to avoid a hard panic if it occurs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants