Fix nil pointer panic when PD pods are deleted during scheduling #722

WHOIM1205 · 2026-01-30T19:24:43Z

Summary

This PR fixes a nil pointer panic in the PD disaggregated scheduling path that can happen when a pod is deleted while scheduling is in progress

The change is small and defensive and it avoids crashing the router in a very realistic race scenario

What I Fixed

While looking at the PDGroup scheduling flow I noticed that
getPrefillPodsForDecodeGroup assumes the underlying Pod inside PodInfo
is always present

In practice, this is not always true If a pod gets deleted while a scheduling
cycle is still running the PodInfo can still exist but pod.Pod can be nil
Accessing labels in this case causes a panic

This PR adds a simple nil check and safely returns instead of crashing

Why This Matters

Pod deletion during scheduling is normal in real clusters:

scale downs
rolling updates
manual pod deletes

Before this fix, hitting this race would:

panic the router
fail all in-flight requests
require a restart to recover

After this change the scheduler just skips the stale pod entry and continues
normally

Code Changes

pkg/kthena-router/datastore/model_server.go
Added a defensive nil guard in getPrefillPodsForDecodeGroup
pkg/kthena-router/datastore/pdgroup_pods_test.go
Added a small table-driven test covering nil pod cases

Tests

Added TestGetPrefillPodsForDecodeGroupNilPodInfo
- covers nil PodInfo
- covers PodInfo with a nil Pod
All existing PDGroup tests are passing
No behavior change in the normal scheduling path
all the test cases are passed locally

Signed-off-by: WHOIM1205 <[email protected]>

gemini-code-assist · 2026-01-30T19:24:59Z

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical stability issue in the PD disaggregated scheduling path by introducing a defensive nil check. The change prevents a router panic that previously occurred if a pod was deleted during a scheduling cycle, causing a PodInfo to reference a nil Pod. This enhancement significantly improves the robustness of the scheduler, allowing it to gracefully handle stale pod entries and continue operations without crashing, thereby preventing service disruptions.

Highlights

Nil Pointer Panic Fix: Implemented a defensive nil check in the getPrefillPodsForDecodeGroup function within model_server.go to prevent a nil pointer panic that could occur if a pod is deleted mid-scheduling cycle, leading to a PodInfo with a nil Pod.
Enhanced Test Coverage: Added a new table-driven test, TestGetPrefillPodsForDecodeGroupNilPodInfo, to specifically cover scenarios where PodInfo itself is nil or contains a nil Pod, ensuring the fix works as expected and prevents future panics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a critical nil pointer panic that can occur when a pod is deleted during a scheduling cycle. The fix involves adding a defensive nil check in getPrefillPodsForDecodeGroup to handle cases where the PodInfo or its underlying Pod is nil. This is a good and necessary change to improve the router's stability. A new table-driven test has also been added to verify the fix, covering both nil PodInfo and PodInfo with a nil Pod.

My review includes one suggestion to check for other potentially vulnerable locations in the code that might suffer from the same race condition, to ensure the fix is comprehensive.

gemini-code-assist · 2026-01-30T19:26:04Z

pkg/kthena-router/datastore/model_server.go

+	if pod == nil || pod.Pod == nil {
+		return nil
+	}


This is a good defensive check to prevent a nil pointer panic. Based on the PR description, this race condition is a real possibility. If pod.Pod can indeed be nil while the PodInfo object still exists, there might be other areas in the code that are vulnerable to the same panic.

For instance, in pkg/kthena-router/datastore/store.go:

DeletePod calls ms.removePodFromPDGroups(podName, pod.Pod.Labels) which would panic if pod.Pod is nil.

updatePodMetrics and updatePodModels also access pod.Pod without a nil check.

To make the fix more comprehensive, consider adding similar guards to these other locations to protect against a nil pod.Pod.

Thanks for the suggestion
I’ve kept this PR focused on the specific panic reported here but I agree it’s worth reviewing other paths separately

WHOIM1205 · 2026-01-30T19:26:11Z

/assign @hzxuzhonghu

volcano-sh-bot · 2026-01-30T19:51:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: WHOIM1205 <[email protected]>

hzxuzhonghu

@WHOIM1205 From my perspective it is not possible even when pod is deleted that podInfo.Pod be nil. It is referencing a pointer, even the object is deleted, it can not be freed unless the reference freed.

Can you reproduce a nil case?

WHOIM1205 · 2026-01-31T09:16:44Z

@WHOIM1205 From my perspective it is not possible even when pod is deleted that podInfo.Pod be nil. It is referencing a pointer, even the object is deleted, it can not be freed unless the reference freed.

Can you reproduce a nil case?

The intent here isn’t that the pod object itself is freed while referenced but that podInfo can outlive the underlying Pod during delete/update handling in that window the pod field can be missing even though the podInfo is still being used by the scheduling path
I haven’t been able to reproduce this reliably in a live cluster yet the test models this edge case defensively to avoid a hard panic if it occurs

Fix nil pointer in getPrefillPodsForDecodeGroup

642d348

Signed-off-by: WHOIM1205 <[email protected]>

volcano-sh-bot requested review from LiZhenCheng9527 and git-malu January 30, 2026 19:24

volcano-sh-bot added the size/M label Jan 30, 2026

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

volcano-sh-bot assigned hzxuzhonghu Jan 30, 2026

gofmt pdgroup pods test

e2b33b3

Signed-off-by: WHOIM1205 <[email protected]>

WHOIM1205 force-pushed the fix/nil-pointer-pdgroup-scheduling branch from c75b8ba to e2b33b3 Compare January 30, 2026 19:52

hzxuzhonghu requested changes Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nil pointer panic when PD pods are deleted during scheduling #722

Fix nil pointer panic when PD pods are deleted during scheduling #722

WHOIM1205 commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Uh oh!

WHOIM1205 Jan 30, 2026 •

edited

Loading

Uh oh!

WHOIM1205 commented Jan 30, 2026

Uh oh!

volcano-sh-bot commented Jan 30, 2026

Uh oh!

hzxuzhonghu left a comment

Uh oh!

WHOIM1205 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix nil pointer panic when PD pods are deleted during scheduling #722

Are you sure you want to change the base?

Fix nil pointer panic when PD pods are deleted during scheduling #722

Conversation

WHOIM1205 commented Jan 30, 2026

Summary

What I Fixed

Why This Matters

Code Changes

Tests

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Jan 30, 2026

Uh oh!

volcano-sh-bot commented Jan 30, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WHOIM1205 Jan 30, 2026 •

edited

Loading