The mistralrs-server supports loading and serving multiple models simultaneously, allowing you to switch between different models in the same server instance.
# Traditional usage - loads one model
mistralrs-server --port 1234 plain -m meta-llama/Llama-3.2-3B-Instruct# Load multiple models from configuration file
mistralrs-server --port 1234 multi-model --config config.json --default-model-id meta-llama/Llama-3.2-3B-InstructCreate a JSON file with model configurations as object keys:
{
"llama3-3b": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}- Object keys (e.g.,
"llama3-3b","qwen3-4b"): Organizational labels (for human readability) - Actual API identifiers: Derived automatically from the model path (e.g.,
"meta-llama/Llama-3.2-3B-Instruct") - Model specification: The model type and configuration (same format as CLI subcommands)
- Optional fields:
chat_template: Custom chat templatejinja_explicit: JINJA template filenum_device_layers: Device layer configurationin_situ_quant: In-situ quantization setting
How API identifiers work:
- ✅ Object keys are organizational only (for config readability)
- ✅ Real API identifiers are derived from
model_idfield inside the model spec - ✅ Use the full model path in API requests (e.g.,
"meta-llama/Llama-3.2-3B-Instruct") - ✅ No naming conflicts - each model has its canonical identifier
Use the model field in your requests to specify which model to use:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'- Explicit model: Use the full pipeline name (e.g.,
"meta-llama/Llama-3.2-3B-Instruct") - Default model: Use
"default"to explicitly request the default model - Auto-fallback: If the
modelfield is omitted entirely, the default model will be used
# Use default model explicitly
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'The default model is either:
- The model specified with
--default-model-idwhen starting the server - The first model loaded (if no default is explicitly set)
curl http://localhost:1234/v1/modelsReturns:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "Qwen/Qwen3-4B",
"object": "model",
"created": 1234567890,
"owned_by": "local"
}
]
}Note: The "default" model is always listed first and represents the server's default model.
Use the multi-model subcommand with these options:
--config <PATH>(required): Path to the JSON configuration file--default-model-id <ID>(optional): Default model ID for requests that don't specify a model
New syntax:
mistralrs-server [GLOBAL_OPTIONS] multi-model --config <CONFIG> [--default-model-id <ID>]Legacy syntax (deprecated):
mistralrs-server [OPTIONS] --multi-model --multi-model-config <CONFIG> [--default-model-id <ID>]{
"llama3-3b": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}{
"text-model": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"vision-model": {
"VisionPlain": {
"model_id": "google/gemma-3-4b-it"
}
}
}{
"llama-gguf": {
"GGUF": {
"tok_model_id": "meta-llama/Llama-3.2-3B-Instruct",
"quantized_model_id": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"quantized_filename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
}
}
}- Each model runs in its own engine thread
- Models can have different configurations (quantization, device layers, etc.)
- Memory usage scales with the number of loaded models
- All models share the same server configuration (port, logging, etc.)
- Interactive mode uses the default model or the first model if no default is set