Skip to content

Feature Request: Model group support (router mode / config.ini) #18312

@GitEventhandler

Description

@GitEventhandler

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

It would be great to add a router-group parameter for models in router mode. When loading models with models-max, models from the same router group should be treated as a single model.

Motivation

In the current router mode of llama.cpp, the models-max parameter is used to count the number of models. However, there is a scenario where, in a dual-GPU environment, the system is capable of running two smaller models simultaneously or one larger model. Here’s an example of the configuration:

[gemma3-27b]
model = /path/to/model
tensor-split = 1,1

[qwen3-vl-8b]
model = /path/to/model
tensor-split = 1,0

[ministral3-8b]
model = /path/to/model
tensor-split = 0,1

In this hardware setup, the latter two models can run at the same time. But if models-max is set to 2, loading gemma3-27b will prevent the loading of other models. If we could introduce a group ID and consider models with the same group ID as a single model, this issue could be resolved.

Possible Implementation

For example, if we set models-max to 1 and configure it as follows (for any model that does not have a router-group set, an independent group ID can be assigned during loading.):

[gemma3-27b]
model = /path/to/model
tensor-split = 1,1
router-group = 0

[qwen3-vl-8b]
model = /path/to/model
tensor-split = 1,0
router-group = 1

[ministral3-8b]
model = /path/to/model
tensor-split = 0,1
router-group = 1

Then the latter two models would be able to run concurrently. This configuration would allow for more flexible use of the multiple GPUs. It would also be beneficial for running multiple smaller models or one larger model on a single GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions