Skip to content

Conversation

@p1-0tr
Copy link

@p1-0tr p1-0tr commented Jun 5, 2025

In case a runner becomes defunct, e.g. as a result of a backend crash it would be neat to be able to reload it. So, if the loader finds runner, have it check if the runner is still alive, and create a new one if the runner is defunct.

@p1-0tr p1-0tr force-pushed the ps-reload-defunct-runners branch from c4243a2 to 8d5a74a Compare June 5, 2025 13:03
case <-l.slots[existing].done:
l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
l.slots[existing].err)
l.evictRunner(backendName, model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
l.evictRunner(backendName, model)
// Reset the reference count to zero so that we can evict the runner and then start a new one.
l.references[existing] = 0
l.evictRunner(backendName, model)

Copy link
Author

@p1-0tr p1-0tr Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Though I wonder if it would not be safer to let the reference counting work normally, issue and idle check here, and expand the idle check logic to look for defunct or stale runners. WDYT?

Copy link
Contributor

@doringeman doringeman Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand the idle check logic to look for defunct or stale runners

I like this!

Although, in this specific case, this code which comes right after the code you're changing will evict all (1, currently, but still) runners if all the slots are full and the current one that's attempted to be loaded is defunct and not clean up, right?

// If there's not sufficient memory or all slots are full, then try
// evicting unused runners.
if memory > l.availableMemory || len(l.runners) == len(l.slots) {
	l.evict(false)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure forcing the refcount to 0 does put us at a risk of panicing in loader.release. I've opted not to force the refcount to 0, and added logic in evict to remove defunct runners.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it took me a while to convince myself that run() and load() will play nice with this approach. Should be all good, though :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up having to add an error handler on the proxy, where we can wait for the runner process to exit. Otherwise we race, and often end up releaseing the runner in the gap between it closing its socket and the done channel being closed.

@p1-0tr p1-0tr force-pushed the ps-reload-defunct-runners branch 2 times, most recently from 869b389 to e69a618 Compare June 6, 2025 11:46
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I think we'll need a slightly different approach.

case <-l.slots[existing].done:
l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
l.slots[existing].err)
l.evictRunner(backendName, model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

@p1-0tr p1-0tr force-pushed the ps-reload-defunct-runners branch from e69a618 to e754b5f Compare June 9, 2025 14:06
return l.slots[existing], nil
select {
case <-l.slots[existing].done:
l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, model)

Check failure

Code scanning / CodeQL

Log entries created from user input High

This log entry depends on a
user-provided value
.

Copilot Autofix

AI 7 months ago

To fix the issue, the model variable should be sanitized before being used in the log entry. Since the logs appear to be plain text, we can remove potentially harmful characters such as newlines (\n, \r) using strings.ReplaceAll. This ensures that the log entry cannot be manipulated by malicious input. The sanitization should be applied directly before the log statement on line 389 in loader.go.

Suggested changeset 1
pkg/inference/scheduling/loader.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/pkg/inference/scheduling/loader.go b/pkg/inference/scheduling/loader.go
--- a/pkg/inference/scheduling/loader.go
+++ b/pkg/inference/scheduling/loader.go
@@ -12,2 +12,3 @@
 	"github.com/docker/model-runner/pkg/logging"
+	"strings"
 )
@@ -388,3 +389,5 @@
 			case <-l.slots[existing].done:
-				l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, model)
+				sanitizedModel := strings.ReplaceAll(model, "\n", "")
+				sanitizedModel = strings.ReplaceAll(sanitizedModel, "\r", "")
+				l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, sanitizedModel)
 				goto WaitForChange
EOF
@@ -12,2 +12,3 @@
"github.com/docker/model-runner/pkg/logging"
"strings"
)
@@ -388,3 +389,5 @@
case <-l.slots[existing].done:
l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, model)
sanitizedModel := strings.ReplaceAll(model, "\n", "")
sanitizedModel = strings.ReplaceAll(sanitizedModel, "\r", "")
l.log.Warnf("%s runner for %s is defunct. Waiting for it to be evicted.", backendName, sanitizedModel)
goto WaitForChange
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two thoughts:

select {
case l.idleCheck <- struct{}{}:
case <-runner.done:
l.evictRunner(runner.backend.Name(), runner.model)
Copy link
Contributor

@xenoscopic xenoscopic Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to get REALLY pedantic, we could also add an optional mode specification to evictRunner so that eviction of a defunct runner in one mode doesn't force early eviction of an idle runner in another mode. Most (all?) models aren't used bi-modally though, so maybe it's a non-issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the semantics clearer, I like it :) done.

@p1-0tr p1-0tr force-pushed the ps-reload-defunct-runners branch from e754b5f to 60d7e57 Compare June 10, 2025 07:44
Copy link
Contributor

@doringeman doringeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I wonder if it makes sense to also change idleCheckDuration to check immediately if a defunct runner exists.

In case a runner becomes defunct, e.g. as a result of a backend crash it
would be neat to be able to reload it. So, if the loader finds runner,
have it check if the runner is still alive, and create a new one if the
runner is defunct.

Signed-off-by: Piotr Stankiewicz <[email protected]>
@p1-0tr p1-0tr force-pushed the ps-reload-defunct-runners branch from 60d7e57 to e3f59dc Compare June 10, 2025 08:57
@p1-0tr p1-0tr requested a review from xenoscopic June 10, 2025 14:06
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@xenoscopic
Copy link
Contributor

@doringeman I don't know if one defunct runner would indicate a high likelihood of other defunct runners.

@p1-0tr p1-0tr merged commit e3916bc into main Jun 11, 2025
3 of 4 checks passed
@p1-0tr p1-0tr deleted the ps-reload-defunct-runners branch June 11, 2025 07:44
@doringeman
Copy link
Contributor

@xenoscopic I meant to modify idleCheckDuration so that when it's called randomly, it also verifies whether there's a defunct runner, and if so, returns a maximum value to trigger an immediate check.

@xenoscopic
Copy link
Contributor

@doringeman Sorry, I misunderstood. If I'm following correctly now, then I'm definitely in agreement. Instead of "maximum value" though, just 0 (to check immediately), right?

@doringeman
Copy link
Contributor

Yeah, 0 to check immediately, sorry.

@doringeman
Copy link
Contributor

#77

ericcurtin referenced this pull request in ericcurtin/model-runner Sep 21, 2025
doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants