Skip to content

feat(api): unify model configuration with model_config parameter #733

@rgthelen

Description

@rgthelen

Summary

Consolidate GGUF and Transformers model configuration into a single model_config parameter in the chat completions API, replacing the scattered individual parameters and extra_body pattern.

Current State

The /chat/completions API currently has:

  • n_ctx - Context window size (GGUF)
  • think / thinking_budget - Reasoning model params
  • max_tokens - Output limit

But many useful GGUF/llama.cpp parameters are not exposed:

  • n_batch - Batch size for prompt processing
  • n_gpu_layers - GPU layer offload (-1 = all)
  • flash_attn - Flash attention
  • use_mmap - Memory-map model file
  • use_mlock - Lock model in RAM
  • cache_type_k / cache_type_v - KV cache quantization

And Transformers models have their own parameters not exposed:

  • torch_dtype - Model precision (float16, bfloat16, etc.)
  • device_map - Device placement strategy
  • load_in_8bit / load_in_4bit - Quantization
  • attn_implementation - Attention implementation (flash_attention_2, sdpa)

Proposed Solution

Add a unified model_config object to the API that works for both GGUF and Transformers:

{
  "messages": [...],
  "model_config": {
    // Common
    "context_length": 2048,
    "max_tokens": 512,
    
    // GGUF/llama.cpp specific
    "n_batch": 512,
    "n_gpu_layers": -1,
    "flash_attn": true,
    "use_mmap": true,
    "use_mlock": false,
    "cache_type_k": "q4_0",
    "cache_type_v": "q4_0",
    
    // Transformers specific  
    "torch_dtype": "bfloat16",
    "device_map": "auto",
    "load_in_4bit": false,
    "attn_implementation": "flash_attention_2",
    
    // Reasoning/thinking
    "think": true,
    "thinking_budget": 1024
  }
}

Implementation Plan

Phase 1: Add missing GGUF parameters

  1. Add new fields to ChatRequest model in server/api/routers/projects/projects.py
  2. Pass through in project_chat_service.py to extra_body
  3. Verify Universal Runtime accepts them

Phase 2: Add Transformers parameters

  1. Identify commonly-used Transformers config options
  2. Add to the same unified structure
  3. Route appropriately based on model type

Phase 3: Unify under model_config

  1. Create ModelConfig Pydantic model with all options
  2. Add model_config field to ChatRequest
  3. Deprecate individual top-level params (n_ctx, etc.)
  4. Update project_chat_service.py to merge configs

Phase 4: Designer UI

  1. Add "Model Configuration" panel in Designer
  2. Show relevant options based on model type (GGUF vs Transformers)
  3. Reference: feat(designer): add GGUF model extra_body configuration UI #732

API Examples

GGUF Model (memory-optimized)

curl -X POST "http://localhost:8000/v1/projects/ns/proj/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "model_config": {
      "context_length": 2048,
      "n_batch": 512,
      "n_gpu_layers": -1,
      "flash_attn": true,
      "cache_type_k": "q4_0",
      "cache_type_v": "q4_0"
    }
  }'

Transformers Model (performance-optimized)

curl -X POST "http://localhost:8000/v1/projects/ns/proj/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "model_config": {
      "torch_dtype": "bfloat16",
      "attn_implementation": "flash_attention_2",
      "device_map": "auto"
    }
  }'

References

Tasks

  • Add GGUF params to ChatRequest (n_batch, n_gpu_layers, flash_attn, use_mmap, use_mlock, cache_type_k, cache_type_v)
  • Add Transformers params to ChatRequest (torch_dtype, device_map, load_in_4bit, attn_implementation)
  • Create unified ModelConfig Pydantic model
  • Update project_chat_service.py to handle new params
  • Add backward compatibility for existing n_ctx param
  • Update API documentation
  • Add Designer UI panel (feat(designer): add GGUF model extra_body configuration UI #732)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions