-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Summary
Consolidate GGUF and Transformers model configuration into a single model_config parameter in the chat completions API, replacing the scattered individual parameters and extra_body pattern.
Current State
The /chat/completions API currently has:
n_ctx- Context window size (GGUF)think/thinking_budget- Reasoning model paramsmax_tokens- Output limit
But many useful GGUF/llama.cpp parameters are not exposed:
n_batch- Batch size for prompt processingn_gpu_layers- GPU layer offload (-1 = all)flash_attn- Flash attentionuse_mmap- Memory-map model fileuse_mlock- Lock model in RAMcache_type_k/cache_type_v- KV cache quantization
And Transformers models have their own parameters not exposed:
torch_dtype- Model precision (float16, bfloat16, etc.)device_map- Device placement strategyload_in_8bit/load_in_4bit- Quantizationattn_implementation- Attention implementation (flash_attention_2, sdpa)
Proposed Solution
Add a unified model_config object to the API that works for both GGUF and Transformers:
{
"messages": [...],
"model_config": {
// Common
"context_length": 2048,
"max_tokens": 512,
// GGUF/llama.cpp specific
"n_batch": 512,
"n_gpu_layers": -1,
"flash_attn": true,
"use_mmap": true,
"use_mlock": false,
"cache_type_k": "q4_0",
"cache_type_v": "q4_0",
// Transformers specific
"torch_dtype": "bfloat16",
"device_map": "auto",
"load_in_4bit": false,
"attn_implementation": "flash_attention_2",
// Reasoning/thinking
"think": true,
"thinking_budget": 1024
}
}Implementation Plan
Phase 1: Add missing GGUF parameters
- Add new fields to
ChatRequestmodel inserver/api/routers/projects/projects.py - Pass through in
project_chat_service.pytoextra_body - Verify Universal Runtime accepts them
Phase 2: Add Transformers parameters
- Identify commonly-used Transformers config options
- Add to the same unified structure
- Route appropriately based on model type
Phase 3: Unify under model_config
- Create
ModelConfigPydantic model with all options - Add
model_configfield toChatRequest - Deprecate individual top-level params (
n_ctx, etc.) - Update
project_chat_service.pyto merge configs
Phase 4: Designer UI
- Add "Model Configuration" panel in Designer
- Show relevant options based on model type (GGUF vs Transformers)
- Reference: feat(designer): add GGUF model extra_body configuration UI #732
API Examples
GGUF Model (memory-optimized)
curl -X POST "http://localhost:8000/v1/projects/ns/proj/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello"}],
"model_config": {
"context_length": 2048,
"n_batch": 512,
"n_gpu_layers": -1,
"flash_attn": true,
"cache_type_k": "q4_0",
"cache_type_v": "q4_0"
}
}'Transformers Model (performance-optimized)
curl -X POST "http://localhost:8000/v1/projects/ns/proj/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello"}],
"model_config": {
"torch_dtype": "bfloat16",
"attn_implementation": "flash_attention_2",
"device_map": "auto"
}
}'References
- GGUF Model Configuration Docs
- llama-cpp-python parameters
- HuggingFace Transformers
from_pretrainedkwargs - Related: feat(designer): add GGUF model extra_body configuration UI #732 (Designer UI for extra_body)
Tasks
- Add GGUF params to ChatRequest (
n_batch,n_gpu_layers,flash_attn,use_mmap,use_mlock,cache_type_k,cache_type_v) - Add Transformers params to ChatRequest (
torch_dtype,device_map,load_in_4bit,attn_implementation) - Create unified
ModelConfigPydantic model - Update
project_chat_service.pyto handle new params - Add backward compatibility for existing
n_ctxparam - Update API documentation
- Add Designer UI panel (feat(designer): add GGUF model extra_body configuration UI #732)
Reactions are currently unavailable