Skip to content

Conversation

@p1-0tr
Copy link

@p1-0tr p1-0tr commented Jul 11, 2025

First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it.

Before merging:

  • Add support for windows (CUDA)
  • Add support for Linux (CUDA)
  • Estimate RAM usage, especially when GPU offload is not configured
  • Extend loader to keep track of RAM and VRAM (for now one VRAM space)

Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

@p1-0tr p1-0tr force-pushed the ps-memory-estimation branch from fef72d8 to b841913 Compare July 14, 2025 13:16
@p1-0tr p1-0tr force-pushed the ps-memory-estimation branch from 45235af to ebfa2c2 Compare July 15, 2025 14:31
@p1-0tr p1-0tr marked this pull request as ready for review July 15, 2025 14:40
@p1-0tr p1-0tr changed the title WiP: Implement basic memory estimation in scheduler Implement basic memory estimation in scheduler Jul 15, 2025
@p1-0tr p1-0tr force-pushed the ps-memory-estimation branch 2 times, most recently from f7f6a00 to ea63d60 Compare July 17, 2025 11:43
@p1-0tr p1-0tr requested a review from xenoscopic July 17, 2025 11:52
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I'd like to do a bit of manual testing tomorrow, but I don't see anything that's a major blocker.

if mdlConfig.Quantization == "Q4_0" {
// TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing
// one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1.
vram = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's potentially a risk of other quantizations besides Q4_0 becoming supported on the GPU without our intervention, maybe we should set vram requirements to 1 ubiquitously on windows/arm64 (in the same way that GetVRAMSize returns 1 ubiquitously) regardless of the quantization? Admittedly I need to think this logic through a bit and review a bit more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think returning 1 as estimated VRAM requirement on win/arm64 makes sense. It will still require patching in case other quantisations become supported, though.

// TODO(p1-0tr): improve error handling
vramSize, err := gpuInfo.GetVRAMSize()
if err != nil {
log.Warnf("Could not read VRAM size: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think we should set vramSize = 1 and similarly override the memory estimation behavior below to match the previous behavior (i.e. 1 per model), otherwise I don't think we'll be able to load any models.

if rc, ok := l.runnerConfigs[runnerKey{backendName, modelID, mode}]; ok {
runnerConfig = &rc
}
memory, err := backend.GetRequiredMemoryForModel(modelID, runnerConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to store this so I can see it on docker model ps (GetRunningBackends).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does get stored in allocations. I can follow up with a PR to add a field in BackendStatus

Piotr Stankiewicz added 13 commits July 22, 2025 11:26
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.

Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
@p1-0tr p1-0tr force-pushed the ps-memory-estimation branch from ea63d60 to b6d86e5 Compare July 22, 2025 10:02
@p1-0tr p1-0tr requested review from doringeman and xenoscopic July 22, 2025 10:07
Copy link
Contributor

@doringeman doringeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

totalMemory := uint64(1)
if isGPUEnabledCloudEnvironment {
totalMemory = 2
// TODO(p1-0tr): improve error handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO(p1-0tr): improve error handling

? Or do you plan something more?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence. On the one hand the current best effort logic is neat in a way. But on the other it would be much cleaner to just fail if we can't detect read the RAM and VRAM sizes (perhaps with a chicken-switch to disable the memory estimation logic altogether). Anyway, it's something for a follow up PR :)

Co-authored-by: Dorin-Andrei Geman <[email protected]>
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I ran some basic smoke tests on my Mac. I think we should merge and get it into nightlies ASAP.

Comment on lines +267 to +271
if runtime.GOOS == "windows" && runtime.GOARCH == "arm64" {
// TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing
// one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1.
vram = 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could just move this (or equivalent) to top of method to save a bit of overhead.

@p1-0tr p1-0tr merged commit fc9b2a7 into main Jul 23, 2025
4 checks passed
@p1-0tr p1-0tr deleted the ps-memory-estimation branch July 23, 2025 11:50
doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025
Add configure command to support Compose models implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants