-
Notifications
You must be signed in to change notification settings - Fork 79
Implement basic memory estimation in scheduler #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
xenoscopic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
fef72d8 to
b841913
Compare
45235af to
ebfa2c2
Compare
f7f6a00 to
ea63d60
Compare
xenoscopic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, I'd like to do a bit of manual testing tomorrow, but I don't see anything that's a major blocker.
| if mdlConfig.Quantization == "Q4_0" { | ||
| // TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing | ||
| // one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1. | ||
| vram = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's potentially a risk of other quantizations besides Q4_0 becoming supported on the GPU without our intervention, maybe we should set vram requirements to 1 ubiquitously on windows/arm64 (in the same way that GetVRAMSize returns 1 ubiquitously) regardless of the quantization? Admittedly I need to think this logic through a bit and review a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think returning 1 as estimated VRAM requirement on win/arm64 makes sense. It will still require patching in case other quantisations become supported, though.
| // TODO(p1-0tr): improve error handling | ||
| vramSize, err := gpuInfo.GetVRAMSize() | ||
| if err != nil { | ||
| log.Warnf("Could not read VRAM size: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think we should set vramSize = 1 and similarly override the memory estimation behavior below to match the previous behavior (i.e. 1 per model), otherwise I don't think we'll be able to load any models.
| if rc, ok := l.runnerConfigs[runnerKey{backendName, modelID, mode}]; ok { | ||
| runnerConfig = &rc | ||
| } | ||
| memory, err := backend.GetRequiredMemoryForModel(modelID, runnerConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to store this so I can see it on docker model ps (GetRunningBackends).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does get stored in allocations. I can follow up with a PR to add a field in BackendStatus
First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it. Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
Signed-off-by: Piotr Stankiewicz <[email protected]>
ea63d60 to
b6d86e5
Compare
doringeman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| totalMemory := uint64(1) | ||
| if isGPUEnabledCloudEnvironment { | ||
| totalMemory = 2 | ||
| // TODO(p1-0tr): improve error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // TODO(p1-0tr): improve error handling |
? Or do you plan something more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence. On the one hand the current best effort logic is neat in a way. But on the other it would be much cleaner to just fail if we can't detect read the RAM and VRAM sizes (perhaps with a chicken-switch to disable the memory estimation logic altogether). Anyway, it's something for a follow up PR :)
Co-authored-by: Dorin-Andrei Geman <[email protected]>
xenoscopic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I ran some basic smoke tests on my Mac. I think we should merge and get it into nightlies ASAP.
| if runtime.GOOS == "windows" && runtime.GOARCH == "arm64" { | ||
| // TODO(p1-0tr): For now on windows/arm64 stick to the old behaviour, of allowing | ||
| // one model at a time. This WA requires gpuinfo.GetVRAMSize to return 1. | ||
| vram = 1 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could just move this (or equivalent) to top of method to save a bit of overhead.
Add configure command to support Compose models implementation
First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it.
Before merging: