-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Closed
Labels
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Unload the model from VRAM when it has not been used for --unload-timeout 300 seconds and reload it automatically into VRAM when new requests arrive.
Motivation
Freeing up VRAM allows running other things and allows the GPU to enter deeper power saving modes, conserving energy.
Possible Implementation
- It is implemented in ollama: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately
- Old issue closed by bot: Power save mode for server --unload-timeout 120ย #4598
- PR with shutdown after timeout, but not restarting: server: Add timeout to stop the server automatically when idling for too long.ย #10742
- Workaround with proxy: https://github.com/mostlygeek/llama-swap
ghchris2021, jukofyork, scraprats, PredatorIWD, abelfodil and 10 more