-
Notifications
You must be signed in to change notification settings - Fork 51
Fix nil pointer panic when PD pods are deleted during scheduling #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix nil pointer panic when PD pods are deleted during scheduling #722
Conversation
Signed-off-by: WHOIM1205 <[email protected]>
Summary of ChangesHello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical stability issue in the PD disaggregated scheduling path by introducing a defensive nil check. The change prevents a router panic that previously occurred if a pod was deleted during a scheduling cycle, causing a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical nil pointer panic that can occur when a pod is deleted during a scheduling cycle. The fix involves adding a defensive nil check in getPrefillPodsForDecodeGroup to handle cases where the PodInfo or its underlying Pod is nil. This is a good and necessary change to improve the router's stability. A new table-driven test has also been added to verify the fix, covering both nil PodInfo and PodInfo with a nil Pod.
My review includes one suggestion to check for other potentially vulnerable locations in the code that might suffer from the same race condition, to ensure the fix is comprehensive.
| if pod == nil || pod.Pod == nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good defensive check to prevent a nil pointer panic. Based on the PR description, this race condition is a real possibility. If pod.Pod can indeed be nil while the PodInfo object still exists, there might be other areas in the code that are vulnerable to the same panic.
For instance, in pkg/kthena-router/datastore/store.go:
DeletePodcallsms.removePodFromPDGroups(podName, pod.Pod.Labels)which would panic ifpod.Podis nil.updatePodMetricsandupdatePodModelsalso accesspod.Podwithout a nil check.
To make the fix more comprehensive, consider adding similar guards to these other locations to protect against a nil pod.Pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion
I’ve kept this PR focused on the specific panic reported here but I agree it’s worth reviewing other paths separately
|
/assign @hzxuzhonghu |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: WHOIM1205 <[email protected]>
c75b8ba to
e2b33b3
Compare
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WHOIM1205 From my perspective it is not possible even when pod is deleted that podInfo.Pod be nil. It is referencing a pointer, even the object is deleted, it can not be freed unless the reference freed.
Can you reproduce a nil case?
The intent here isn’t that the pod object itself is freed while referenced but that podInfo can outlive the underlying Pod during delete/update handling in that window the pod field can be missing even though the podInfo is still being used by the scheduling path |
Summary
This PR fixes a nil pointer panic in the PD disaggregated scheduling path that can happen when a pod is deleted while scheduling is in progress
The change is small and defensive and it avoids crashing the router in a very realistic race scenario
What I Fixed
While looking at the PDGroup scheduling flow I noticed that
getPrefillPodsForDecodeGroupassumes the underlyingPodinsidePodInfois always present
In practice, this is not always true If a pod gets deleted while a scheduling
cycle is still running the
PodInfocan still exist butpod.Podcan be nilAccessing labels in this case causes a panic
This PR adds a simple nil check and safely returns instead of crashing
Why This Matters
Pod deletion during scheduling is normal in real clusters:
Before this fix, hitting this race would:
After this change the scheduler just skips the stale pod entry and continues
normally
Code Changes
pkg/kthena-router/datastore/model_server.goAdded a defensive nil guard in
getPrefillPodsForDecodeGrouppkg/kthena-router/datastore/pdgroup_pods_test.goAdded a small table-driven test covering nil pod cases
Tests
TestGetPrefillPodsForDecodeGroupNilPodInfoPodInfoPodInfowith a nilPod