Reload defunct runners #68

p1-0tr · 2025-06-05T11:19:51Z

In case a runner becomes defunct, e.g. as a result of a backend crash it would be neat to be able to reload it. So, if the loader finds runner, have it check if the runner is still alive, and create a new one if the runner is defunct.

pkg/inference/scheduling/loader.go

doringeman · 2025-06-05T13:30:03Z

pkg/inference/scheduling/loader.go

+			case <-l.slots[existing].done:
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+					l.slots[existing].err)
+				l.evictRunner(backendName, model)


Suggested change

l.evictRunner(backendName, model)

// Reset the reference count to zero so that we can evict the runner and then start a new one.

l.references[existing] = 0

l.evictRunner(backendName, model)

Makes sense. Though I wonder if it would not be safer to let the reference counting work normally, issue and idle check here, and expand the idle check logic to look for defunct or stale runners. WDYT?

expand the idle check logic to look for defunct or stale runners

I like this!

Although, in this specific case, this code which comes right after the code you're changing will evict all (1, currently, but still) runners if all the slots are full and the current one that's attempted to be loaded is defunct and not clean up, right?

// If there's not sufficient memory or all slots are full, then try // evicting unused runners. if memory > l.availableMemory || len(l.runners) == len(l.slots) { l.evict(false) }

I'm pretty sure forcing the refcount to 0 does put us at a risk of panicing in loader.release. I've opted not to force the refcount to 0, and added logic in evict to remove defunct runners.

I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

Ok, it took me a while to convince myself that run() and load() will play nice with this approach. Should be all good, though :)

Ended up having to add an error handler on the proxy, where we can wait for the runner process to exit. Otherwise we race, and often end up releaseing the runner in the gap between it closing its socket and the done channel being closed.

xenoscopic

I like the idea, but I think we'll need a slightly different approach.

pkg/inference/scheduling/loader.go

xenoscopic · 2025-06-06T17:08:51Z

pkg/inference/scheduling/loader.go

+			case <-l.slots[existing].done:
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+					l.slots[existing].err)
+				l.evictRunner(backendName, model)


I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

pkg/inference/scheduling/loader.go

-			return l.slots[existing], nil
+			select {
+			case <-l.slots[existing].done:
+				l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, model)


To fix the issue, the model variable should be sanitized before being used in the log entry. Since the logs appear to be plain text, we can remove potentially harmful characters such as newlines (\n, \r) using strings.ReplaceAll. This ensures that the log entry cannot be manipulated by malicious input. The sanitization should be applied directly before the log statement on line 389 in loader.go.

xenoscopic

Looks good, just two thoughts:

pkg/inference/scheduling/loader.go

xenoscopic · 2025-06-09T22:59:30Z

pkg/inference/scheduling/loader.go

 		select {
-		case l.idleCheck <- struct{}{}:
+		case <-runner.done:
+			l.evictRunner(runner.backend.Name(), runner.model)


If we want to get REALLY pedantic, we could also add an optional mode specification to evictRunner so that eviction of a defunct runner in one mode doesn't force early eviction of an idle runner in another mode. Most (all?) models aren't used bi-modally though, so maybe it's a non-issue.

Makes the semantics clearer, I like it :) done.

doringeman

LGTM!
I wonder if it makes sense to also change idleCheckDuration to check immediately if a defunct runner exists.

In case a runner becomes defunct, e.g. as a result of a backend crash it would be neat to be able to reload it. So, if the loader finds runner, have it check if the runner is still alive, and create a new one if the runner is defunct. Signed-off-by: Piotr Stankiewicz <[email protected]>

xenoscopic

Looks great to me!

xenoscopic · 2025-06-10T20:07:49Z

@doringeman I don't know if one defunct runner would indicate a high likelihood of other defunct runners.

doringeman · 2025-06-11T09:46:05Z

@xenoscopic I meant to modify idleCheckDuration so that when it's called randomly, it also verifies whether there's a defunct runner, and if so, returns a maximum value to trigger an immediate check.

xenoscopic · 2025-06-11T19:25:21Z

@doringeman Sorry, I misunderstood. If I'm following correctly now, then I'm definitely in agreement. Instead of "maximum value" though, just 0 (to check immediately), right?

doringeman · 2025-06-12T08:05:16Z

Yeah, 0 to check immediately, sorry.

doringeman · 2025-06-12T09:57:42Z

#77

ci: fix missing pull_request event

p1-0tr requested review from doringeman, ilopezluna and xenoscopic June 5, 2025 11:19

github-advanced-security bot found potential problems Jun 5, 2025

View reviewed changes

pkg/inference/scheduling/loader.go Fixed Show fixed Hide fixed

p1-0tr mentioned this pull request Jun 5, 2025

Return error in case of runner crash #69

Merged

doringeman reviewed Jun 5, 2025

View reviewed changes

pkg/inference/scheduling/loader.go Show resolved Hide resolved

p1-0tr force-pushed the ps-reload-defunct-runners branch from c4243a2 to 8d5a74a Compare June 5, 2025 13:03

doringeman reviewed Jun 5, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch 2 times, most recently from 869b389 to e69a618 Compare June 6, 2025 11:46

xenoscopic requested changes Jun 6, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch from e69a618 to e754b5f Compare June 9, 2025 14:06

github-advanced-security bot found potential problems Jun 9, 2025

View reviewed changes

xenoscopic approved these changes Jun 9, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch from e754b5f to 60d7e57 Compare June 10, 2025 07:44

doringeman approved these changes Jun 10, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch from 60d7e57 to e3f59dc Compare June 10, 2025 08:57

p1-0tr requested a review from xenoscopic June 10, 2025 14:06

xenoscopic approved these changes Jun 10, 2025

View reviewed changes

p1-0tr merged commit e3916bc into main Jun 11, 2025
3 of 4 checks passed

p1-0tr deleted the ps-reload-defunct-runners branch June 11, 2025 07:44

ericcurtin referenced this pull request in ericcurtin/model-runner Sep 21, 2025

Update README.md (#68)

6a458a8

doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025

Merge pull request docker#68 from crazy-max/ci-fix-missing-pr-event

50a3679

ci: fix missing pull_request event

@@ -12,2 +12,3 @@
             	"github.com/docker/model-runner/pkg/logging"
+            	"strings"
             )
@@ -388,3 +389,5 @@
             			case <-l.slots[existing].done:
-            				l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, model)
+            				sanitizedModel := strings.ReplaceAll(model, "\n", "")
+            				sanitizedModel = strings.ReplaceAll(sanitizedModel, "\r", "")
+            				l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, sanitizedModel)
             				goto WaitForChange

Reload defunct runners #68

Reload defunct runners #68

Uh oh!

Conversation

p1-0tr commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

doringeman Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doringeman Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xenoscopic Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Check failure

Uh oh!

Copilot Autofix

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xenoscopic Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

doringeman left a comment

Choose a reason for hiding this comment

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

xenoscopic commented Jun 10, 2025

Uh oh!

Uh oh!

doringeman commented Jun 11, 2025

Uh oh!

xenoscopic commented Jun 11, 2025

Uh oh!

doringeman commented Jun 12, 2025

Uh oh!

doringeman commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

p1-0tr Jun 5, 2025 •

edited

Loading

doringeman Jun 5, 2025 •

edited

Loading

xenoscopic Jun 9, 2025 •

edited

Loading