-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
c/autoscaling/schedulerComponent: autoscaling: k8s schedulerComponent: autoscaling: k8s schedulermigrated_to_jirat/bugIssue Type: BugIssue Type: Bug
Description
Environment
Prod
Steps to reproduce
Unknown — this happened soon after scheduler startup, and after restart it was totally fine.
There was a panic in the Score plugin, trying to call pkg/plugin/state.(*Node).AddPod(...) — presumably the Score plugin was run after the pod was already added...?
Expected result
Probably we should just fail to score, rather than panicking?
Actual result
Scheduler plugin panicked, taking down the entire scheduler, with:
E0410 17:12:36.125777 1 node.go:300] "Observed a panic" panic="cannot add Pod that already exists" stacktrace=<
goroutine 892 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2c15fa0, 0x4317600}, {0x227ebe0, 0x2be3e20})
/go/pkg/mod/k8s.io/apimachinery@v0.31.7/pkg/util/runtime/runtime.go:107 +0xbc
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x2c15fa0, 0x4317600}, {0x227ebe0, 0x2be3e20}, {0x4317600, 0x0, 0x10000000043aa45?})
/go/pkg/mod/k8s.io/apimachinery@v0.31.7/pkg/util/runtime/runtime.go:82 +0x5e
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc02a400a80?})
/go/pkg/mod/k8s.io/apimachinery@v0.31.7/pkg/util/runtime/runtime.go:59 +0x108
panic({0x227ebe0?, 0x2be3e20?})
/usr/local/go/src/runtime/panic.go:791 +0x132
github.com/neondatabase/autoscaling/pkg/plugin/state.(*Node).AddPod(0xc02a912bd0, {{{0xc02a7f6e40, 0x7}, {0xc02a82a020, 0x20}}, {0xc02a741830, 0x24}, {0x0, 0xedf89f302, 0x42f38e0}, ...})
/workspace/pkg/plugin/state/node.go:300 +0x2cf
github.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).Score.func2(0xc02a912bd0)
/workspace/pkg/plugin/framework_methods.go:327 +0x78
github.com/neondatabase/autoscaling/pkg/plugin/state.(*Node).Speculatively(0xc01b37d340, 0xc02aea9b30)
/workspace/pkg/plugin/state/node.go:199 +0x234
github.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).Score(0xc016092738, {0x2c16090?, 0xc02a402e60?}, 0xc029dc6740?, 0xc02a824908, {0xc01c2418c0, 0x2e})
/workspace/pkg/plugin/framework_methods.go:326 +0xb45
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*instrumentedScorePlugin).Score(0xc0157708e0, {0x2c16090, 0xc02a402e60}, 0xc029dc6740, 0xc02a824908, {0xc01c2418c0, 0x2e})
/go/pkg/mod/k8s.io/kubernetes@v1.31.7/pkg/scheduler/framework/runtime/instrumented_plugins.go:82 +0x75
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runScorePlugin(0x23d1580?, {0x2c16090?, 0xc02a402e60?}, {0x2c054b0?, 0xc0157708e0?}, 0x656b616c2d646567?, 0x377634773635612d?, {0xc01c2418c0?, 0x657461766972702d?})
/go/pkg/mod/k8s.io/kubernetes@v1.31.7/pkg/scheduler/framework/runtime/framework.go:1211 +0x2ed
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunScorePlugins.func2(0x2)
/go/pkg/mod/k8s.io/kubernetes@v1.31.7/pkg/scheduler/framework/runtime/framework.go:1140 +0x3b4
k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until.func1(0x2)
/go/pkg/mod/k8s.io/kubernetes@v1.31.7/pkg/scheduler/framework/parallelize/parallelism.go:60 +0x46
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
/go/pkg/mod/k8s.io/client-go@v0.31.7/util/workqueue/parallelizer.go:90 +0xf3
created by k8s.io/client-go/util/workqueue.ParallelizeUntil in goroutine 729
/go/pkg/mod/k8s.io/client-go@v0.31.7/util/workqueue/parallelizer.go:76 +0x1fb
>
Other logs, links
It's semi-new code from the scheduler rewrite in #1163 — but this was also the first release with the latest Kubernetes upgrade, so potentially #1322 ?
- Slack thread: https://neondb.slack.com/archives/C03TN5G758R/p1744300621048789
- Logs link (will expire 2025-05-10): https://neonprod.grafana.net/goto/otkmujANR?orgId=1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
c/autoscaling/schedulerComponent: autoscaling: k8s schedulerComponent: autoscaling: k8s schedulermigrated_to_jirat/bugIssue Type: BugIssue Type: Bug