Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions pkg/kthena-router/datastore/model_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,11 @@ func (m *modelServer) getPrefillPodsForDecodeGroup(pod *PodInfo) []types.Namespa
return nil
}

// Guard against nil Pod - can occur if pod is deleted mid-scheduling cycle
if pod == nil || pod.Pod == nil {
return nil
}
Comment on lines +159 to +161

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a good defensive check to prevent a nil pointer panic. Based on the PR description, this race condition is a real possibility. If pod.Pod can indeed be nil while the PodInfo object still exists, there might be other areas in the code that are vulnerable to the same panic.

For instance, in pkg/kthena-router/datastore/store.go:

  • DeletePod calls ms.removePodFromPDGroups(podName, pod.Pod.Labels) which would panic if pod.Pod is nil.
  • updatePodMetrics and updatePodModels also access pod.Pod without a nil check.

To make the fix more comprehensive, consider adding similar guards to these other locations to protect against a nil pod.Pod.

Copy link
Contributor Author

@WHOIM1205 WHOIM1205 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion
I’ve kept this PR focused on the specific panic reported here but I agree it’s worth reviewing other paths separately


pdGroup := m.modelServer.Spec.WorkloadSelector.PDGroup
pdGroupValue, hasPDGroupKey := pod.Pod.Labels[pdGroup.GroupKey]
if !hasPDGroupKey {
Expand Down
54 changes: 54 additions & 0 deletions pkg/kthena-router/datastore/pdgroup_pods_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -284,3 +284,57 @@ func TestPDGroupPodRemoval(t *testing.T) {
t.Errorf("Expected 0 decode pods after deletion, got %d", len(decodePods))
}
}

// TestGetPrefillPodsForDecodeGroupNilPodInfo verifies that getPrefillPodsForDecodeGroup
// handles nil PodInfo or nil Pod gracefully without panicking.
// This can occur when a pod is deleted mid-scheduling cycle.
func TestGetPrefillPodsForDecodeGroupNilPodInfo(t *testing.T) {
tests := []struct {
name string
podInfo *PodInfo
expectResult []types.NamespacedName
}{
{
name: "nil PodInfo",
podInfo: nil,
expectResult: nil,
},
{
name: "PodInfo with nil Pod",
podInfo: &PodInfo{Pod: nil},
expectResult: nil,
},
}

for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
// Create a modelServer with PDGroup configuration
ms := newModelServer(&aiv1alpha1.ModelServer{
ObjectMeta: metav1.ObjectMeta{
Name: "test-model",
Namespace: "default",
},
Spec: aiv1alpha1.ModelServerSpec{
WorkloadSelector: &aiv1alpha1.WorkloadSelector{
PDGroup: &aiv1alpha1.PDGroup{
GroupKey: "pd-group",
DecodeLabels: map[string]string{
"role": "decode",
},
PrefillLabels: map[string]string{
"role": "prefill",
},
},
},
},
})

// This should not panic
result := ms.getPrefillPodsForDecodeGroup(tc.podInfo)

if len(result) != len(tc.expectResult) {
t.Errorf("Expected %d results, got %d", len(tc.expectResult), len(result))
}
})
}
}
Loading