Implement basic memory estimation in scheduler #106

p1-0tr · 2025-07-11T11:37:47Z

First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it.

Before merging:

Add support for windows (CUDA)
Add support for Linux (CUDA)
Estimate RAM usage, especially when GPU offload is not configured
Extend loader to keep track of RAM and VRAM (for now one VRAM space)

pkg/inference/scheduling/loader.go

xenoscopic

Looking good so far!

pkg/inference/scheduling/metal.m

pkg/inference/scheduling/loader.go

pkg/inference/backends/llamacpp/llamacpp.go

Dockerfile

pkg/inference/scheduling/memory_windows.go

pkg/inference/scheduling/loader.go

pkg/inference/backends/llamacpp/llamacpp.go

xenoscopic

LGTM overall, I'd like to do a bit of manual testing tomorrow, but I don't see anything that's a major blocker.

pkg/gpuinfo/nvidia.c

xenoscopic · 2025-07-21T12:18:13Z

pkg/inference/backends/llamacpp/llamacpp.go

+		if mdlConfig.Quantization == "Q4_0" {
+			// TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing
+			// one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1.
+			vram = 1


If there's potentially a risk of other quantizations besides Q4_0 becoming supported on the GPU without our intervention, maybe we should set vram requirements to 1 ubiquitously on windows/arm64 (in the same way that GetVRAMSize returns 1 ubiquitously) regardless of the quantization? Admittedly I need to think this logic through a bit and review a bit more.

Yep, I think returning 1 as estimated VRAM requirement on win/arm64 makes sense. It will still require patching in case other quantisations become supported, though.

xenoscopic · 2025-07-21T12:35:01Z

pkg/inference/scheduling/loader.go

+	// TODO(p1-0tr): improve error handling
+	vramSize, err := gpuInfo.GetVRAMSize()
+	if err != nil {
+		log.Warnf("Could not read VRAM size: %s", err)


In this case I think we should set vramSize = 1 and similarly override the memory estimation behavior below to match the previous behavior (i.e. 1 per model), otherwise I don't think we'll be able to load any models.

pkg/inference/scheduling/loader.go

doringeman · 2025-07-21T14:22:27Z

pkg/inference/scheduling/loader.go

+	if rc, ok := l.runnerConfigs[runnerKey{backendName, modelID, mode}]; ok {
+		runnerConfig = &rc
+	}
+	memory, err := backend.GetRequiredMemoryForModel(modelID, runnerConfig)


I'd like to store this so I can see it on docker model ps (GetRunningBackends).

It does get stored in allocations. I can follow up with a PR to add a field in BackendStatus

First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it. Signed-off-by: Piotr Stankiewicz <[email protected]>

Signed-off-by: Piotr Stankiewicz <[email protected]>

doringeman

LGTM!

pkg/inference/scheduling/loader.go

doringeman · 2025-07-22T11:34:34Z

pkg/inference/scheduling/loader.go

-	totalMemory := uint64(1)
-	if isGPUEnabledCloudEnvironment {
-		totalMemory = 2
+	// TODO(p1-0tr): improve error handling


Suggested change

// TODO(p1-0tr): improve error handling

? Or do you plan something more?

I'm on the fence. On the one hand the current best effort logic is neat in a way. But on the other it would be much cleaner to just fail if we can't detect read the RAM and VRAM sizes (perhaps with a chicken-switch to disable the memory estimation logic altogether). Anyway, it's something for a follow up PR :)

Co-authored-by: Dorin-Andrei Geman <[email protected]>

xenoscopic

LGTM! I ran some basic smoke tests on my Mac. I think we should merge and get it into nightlies ASAP.

xenoscopic · 2025-07-23T11:34:53Z

pkg/inference/backends/llamacpp/llamacpp.go

+	if runtime.GOOS == "windows" && runtime.GOARCH == "arm64" {
+		// TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing
+		// one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1.
+		vram = 1
+	}


nit: could just move this (or equivalent) to top of method to save a bit of overhead.

Add configure command to support Compose models implementation

github-advanced-security bot found potential problems Jul 11, 2025

View reviewed changes

pkg/inference/scheduling/loader.go Fixed Show fixed Hide fixed

xenoscopic reviewed Jul 14, 2025

View reviewed changes

p1-0tr force-pushed the ps-memory-estimation branch from fef72d8 to b841913 Compare July 14, 2025 13:16

github-advanced-security bot found potential problems Jul 15, 2025

View reviewed changes

pkg/inference/scheduling/loader.go Fixed Show fixed Hide fixed

p1-0tr commented Jul 15, 2025

View reviewed changes

pkg/inference/backends/llamacpp/llamacpp.go Outdated Show resolved Hide resolved

p1-0tr force-pushed the ps-memory-estimation branch from 45235af to ebfa2c2 Compare July 15, 2025 14:31

p1-0tr marked this pull request as ready for review July 15, 2025 14:40

p1-0tr changed the title ~~WiP: Implement basic memory estimation in scheduler~~ Implement basic memory estimation in scheduler Jul 15, 2025

p1-0tr force-pushed the ps-memory-estimation branch 2 times, most recently from f7f6a00 to ea63d60 Compare July 17, 2025 11:43

p1-0tr requested a review from xenoscopic July 17, 2025 11:52

xenoscopic mentioned this pull request Jul 18, 2025

Parallel requests against same model #108

Closed

xenoscopic reviewed Jul 21, 2025

View reviewed changes

doringeman reviewed Jul 21, 2025

View reviewed changes

Piotr Stankiewicz added 13 commits July 22, 2025 11:26

VRAM size getter for linux

d559e1b

Signed-off-by: Piotr Stankiewicz <[email protected]>

VRAM size getter for windows

9e6e41f

Signed-off-by: Piotr Stankiewicz <[email protected]>

Move VRAM size getters to a separate package

3b42bc2

Signed-off-by: Piotr Stankiewicz <[email protected]>

Use nv-gpu-info on Windows to get VRAM size

517484b

Signed-off-by: Piotr Stankiewicz <[email protected]>

gpuinfo: Release Metal device handle in VRAM size getter

f99d4a2

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference, gpuinfo: Limit allowed models to 1 on windows/arm64 for now

a3b83a8

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference: Keep track of RAM allocated by runners

ca187f9

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference: Fix failing llama_config unit tests

7d39c76

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference: Fix nv-gpu-info path and wrap errors

ac9da88

Signed-off-by: Piotr Stankiewicz <[email protected]>

gpuinfo: Use go:build instead of obsolete +build

1d066f2

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference: Always return 1 as VRAM size on win/arm64

cd5f08d

Signed-off-by: Piotr Stankiewicz <[email protected]>

inference: Fallback behaviour if reading RAM/VRAM size fails

b6d86e5

Signed-off-by: Piotr Stankiewicz <[email protected]>

p1-0tr force-pushed the ps-memory-estimation branch from ea63d60 to b6d86e5 Compare July 22, 2025 10:02

p1-0tr requested review from doringeman and xenoscopic July 22, 2025 10:07

doringeman approved these changes Jul 22, 2025

View reviewed changes

inference: Fix typo in log

2e872f9

Co-authored-by: Dorin-Andrei Geman <[email protected]>

xenoscopic approved these changes Jul 23, 2025

View reviewed changes

p1-0tr merged commit fc9b2a7 into main Jul 23, 2025
4 checks passed

p1-0tr deleted the ps-memory-estimation branch July 23, 2025 11:50

kiview mentioned this pull request Aug 15, 2025

Feature Request: Support running multiple models concurrently #82

Closed

doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025

Merge pull request docker#106 from docker/configure-command

da0b8a6

Add configure command to support Compose models implementation

Implement basic memory estimation in scheduler #106

Implement basic memory estimation in scheduler #106

Uh oh!

Conversation

p1-0tr commented Jul 11, 2025 • edited by kiview Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doringeman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

p1-0tr commented Jul 11, 2025 •

edited by kiview

Loading