-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
It would be great to add a router-group parameter for models in router mode. When loading models with models-max, models from the same router group should be treated as a single model.
Motivation
In the current router mode of llama.cpp, the models-max parameter is used to count the number of models. However, there is a scenario where, in a dual-GPU environment, the system is capable of running two smaller models simultaneously or one larger model. Here’s an example of the configuration:
[gemma3-27b]
model = /path/to/model
tensor-split = 1,1
[qwen3-vl-8b]
model = /path/to/model
tensor-split = 1,0
[ministral3-8b]
model = /path/to/model
tensor-split = 0,1In this hardware setup, the latter two models can run at the same time. But if models-max is set to 2, loading gemma3-27b will prevent the loading of other models. If we could introduce a group ID and consider models with the same group ID as a single model, this issue could be resolved.
Possible Implementation
For example, if we set models-max to 1 and configure it as follows (for any model that does not have a router-group set, an independent group ID can be assigned during loading.):
[gemma3-27b]
model = /path/to/model
tensor-split = 1,1
router-group = 0
[qwen3-vl-8b]
model = /path/to/model
tensor-split = 1,0
router-group = 1
[ministral3-8b]
model = /path/to/model
tensor-split = 0,1
router-group = 1Then the latter two models would be able to run concurrently. This configuration would allow for more flexible use of the multiple GPUs. It would also be beneficial for running multiple smaller models or one larger model on a single GPU.