Chat-RAG is a high-performance, enterprise-grade chat service that combines Large Language Models (LLM) with Retrieval-Augmented Generation (RAG) capabilities. It provides intelligent context processing, tool integration, and streaming responses for modern AI applications.
- π§ Intelligent Context Processing: Advanced prompt engineering with context compression and filtering
- π§ Tool Integration: Seamless integration with semantic search, code definition lookup, and knowledge base queries
- β‘ Streaming Support: Real-time streaming responses with Server-Sent Events (SSE)
- π‘οΈ Enterprise Security: JWT-based authentication and request validation
- π Comprehensive Monitoring: Built-in metrics and logging with Prometheus support
- π Multi-Modal Support: Support for various LLM models and function calling
- π High Performance: Optimized for low-latency responses and high throughput
- π€ Semantic Router (migrated from ai-llm-router): Optional auto model selection via semantic classification; emits
x-select-llmandx-user-inputresponse headers
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β API Gateway βββββΆβ Chat Handler βββββΆβ Prompt Engine β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Authenticationβ β LLM Client β β Tool Executor β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Metrics β β Redis Cache β β Search Tools β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- Go 1.24.2 or higher
- Redis 6.0+ (optional, for caching)
- Docker (optional, for containerized deployment)
# Clone the repository
git clone https://github.com/zgsm-ai/chat-rag.git
cd chat-rag
# Install dependencies
make deps
# Build the application
make build
# Run with default configuration
make run# Build Docker image
make docker-build
# Run container
make docker-runThe service is configured via YAML files. See etc/chat-api.yaml for the default configuration:
# Server
Host: 0.0.0.0
Port: 8080
# LLM upstream (single endpoint; model is specified in the request body)
LLM:
Endpoint: "http://localhost:8000/v1/chat/completions"
# Optional: models that support function-calling
FuncCallingModels: ["gpt-4o-mini", "o4-mini"]
# LLM Timeout and Retry Configuration (for regular mode)
LLMTimeout:
idleTimeoutMs: 180000 # Single idle timeout (ms), default 180000ms (180s)
totalIdleTimeoutMs: 180000 # Total idle timeout budget (ms), default 180000ms (180s)
maxRetryCount: 1 # Maximum retry count, default 1 (total 2 attempts)
retryIntervalMs: 5000 # Retry interval (ms), default 5000ms (5s)
# Context compression
ContextCompressConfig:
EnableCompress: true
TokenThreshold: 5000
SummaryModel: "deepseek-v3"
SummaryModelTokenThreshold: 4000
RecentUserMsgUsedNums: 4
# Tool backends (RAG)
Tools:
SemanticSearch:
SearchEndpoint: "http://localhost:8002/codebase-indexer/api/v1/semantics"
ApiReadyEndpoint: "http://localhost:8002/healthz"
TopK: 5
ScoreThreshold: 0.3
DefinitionSearch:
SearchEndpoint: "http://localhost:8002/codebase-indexer/api/v1/definitions"
ApiReadyEndpoint: "http://localhost:8002/healthz"
ReferenceSearch:
SearchEndpoint: "http://localhost:8002/codebase-indexer/api/v1/references"
ApiReadyEndpoint: "http://localhost:8002/healthz"
KnowledgeSearch:
SearchEndpoint: "http://localhost:8003/knowledge/api/v1/search"
ApiReadyEndpoint: "http://localhost:8003/healthz"
TopK: 5
ScoreThreshold: 0.3
# Logging and classification
Log:
LogFilePath: "logs/chat-rag.log"
LokiEndpoint: "http://localhost:3100/loki/api/v1/push"
LogScanIntervalSec: 60
ClassifyModel: "deepseek-v3"
EnableClassification: true
# Redis (optional)
Redis:
Addr: "127.0.0.1:6379"
Password: ""
DB: 0
# Semantic Router (migrated from ai-llm-router). Triggered when request body model == "auto".
router:
enabled: true
strategy: semantic
semantic:
analyzer:
model: gpt-4o-mini
timeoutMs: 3000
# endpoint and apiToken can override global LLM only for analyzer
# endpoint: "http://higress-gateway.costrict.svc.cluster.local/v1/chat/completions"
# apiToken: "<your-token>"
# Optional advanced fields:
# totalTimeoutMs: 5000
# maxInputBytes: 8192
# promptTemplate: "" # custom classification prompt; default is built-in
# analysisLabels: ["simple_request", "planning_request", "code_modification"]
# dynamicMetrics:
# enabled: false
# redisPrefix: "ai_router:metrics:"
# metrics: ["error_rate", "p99", "circuit"]
inputExtraction:
protocol: openai
userJoinSep: "\n\n"
stripCodeFences: true
codeFenceRegex: ""
maxUserMessages: 100
maxHistoryBytes: 4096
routing:
candidates:
- modelName: "gpt-4o-mini"
enabled: true
scores:
simple_request: 10
planning_request: 5
code_modification: 3
- modelName: "o4-mini"
enabled: true
scores:
simple_request: 4
planning_request: 8
code_modification: 6
minScore: 0
tieBreakOrder: ["o4-mini", "gpt-4o-mini"]
fallbackModelName: "gpt-4o-mini"
# Timeout configuration for model degradation scenarios
idleTimeoutMs: 180000 # Single idle timeout, default 180000ms (180s)
totalIdleTimeoutMs: 180000 # Total idle timeout budget, default 180000ms (180s)
# Retry configuration for model degradation scenarios
maxRetryCount: 1 # Maximum retry count, default 1
retryIntervalMs: 5000 # Retry interval (ms), default 5000ms
ruleEngine:
enabled: false
inlineRules: []
bodyPrefix: "body."
headerPrefix: "header."
# Alternative: Priority-based Round-Robin Strategy
# Uncomment to use priority strategy instead of semantic
# priority:
# candidates:
# - modelName: "gpt-4"
# enabled: true
# priority: 1 # Lower number = higher priority (0-999)
# weight: 5 # Weight for load balancing within same priority (1-100)
#
# - modelName: "claude-3-opus"
# enabled: true
# priority: 1 # Same priority as gpt-4
# weight: 3 # Lower weight than gpt-4
#
# - modelName: "gpt-3.5-turbo"
# enabled: true
# priority: 2 # Lower priority, used when priority 1 fails
# weight: 10
#
# fallbackModelName: "gpt-3.5-turbo"
#
# # Timeout configuration (same as semantic routing)
# idleTimeoutMs: 180000
# totalIdleTimeoutMs: 180000
#
# # Retry configuration (same as semantic routing)
# maxRetryCount: 1
# retryIntervalMs: 5000- LLM
Endpoint: Single Chat Completions endpoint. Final model is carried by request bodymodel.FuncCallingModels: Models supporting function-calling to enable tools.
- LLMTimeout (for regular mode - when NOT using router or model != "auto")
idleTimeoutMs: Timeout for single idle period (ms). Default 180000ms (180s).totalIdleTimeoutMs: Total idle timeout budget across all retries (ms). Default 180000ms (180s).maxRetryCount: Maximum number of retries on retryable errors (timeout, network). Default 1 (total 2 attempts).retryIntervalMs: Interval between retries (ms). Default 5000ms (5s).
- ContextCompressConfig
EnableCompress: Whether to compress long prompts.TokenThreshold: Trigger threshold for compression (input tokens).SummaryModel/SummaryModelTokenThreshold: Model and threshold used for summarization.RecentUserMsgUsedNums: Number of recent user messages considered for compression.
- Tools (RAG)
- Each search block provides HTTP endpoints.
TopK/ScoreThresholdcontrol recall count and quality.
- Each search block provides HTTP endpoints.
- Log
LogFilePath: Local log file persisted before background upload to Loki.LokiEndpoint: Loki push endpoint.LogScanIntervalSec: Scan/upload interval in seconds.ClassifyModel/EnableClassification: Optional LLM-based log categorization.
- Redis: Optional; used by tools, router dynamic metrics, and transient statuses.
- router (Model Selection Router)
enabled/strategy: Enable router; available strategies:semantic(semantic-based),priority(priority-based round-robin).- semantic strategy configuration:
analyzer: Classification model/timeouts; can override endpoint/apiToken for analyzer-only calls; uses a separate non-streaming client in auto mode; optional custom prompt/labels; optional dynamic metrics via Redis.inputExtraction: Controls extraction of current user input and bounded history; supports stripping code fences.routing: Candidate model score table; tie-break viatieBreakOrder; fallback viafallbackModelName; supports independent timeout and retry configuration for model degradation scenarios:idleTimeoutMs: Single idle timeout for degradation retry (ms). Default 180000ms (180s).totalIdleTimeoutMs: Total idle timeout budget for degradation retry (ms). Default 180000ms (180s).maxRetryCount: Maximum retry count for degradation retry. Default 1.retryIntervalMs: Retry interval for degradation retry (ms). Default 5000ms (5s).
ruleEngine: Optional rule engine to pre-filter candidates (disabled by default).
- priority strategy configuration (alternative to semantic):
- Simple, cost-effective strategy without semantic analysis; selects models by priority (lower number = higher priority, range 0-999).
- Uses smooth weighted round-robin algorithm for load balancing within same priority group.
- Configuration fields:
candidates: List of candidate models withmodelName,enabled,priority(0-999), andweight(1-100).fallbackModelName: Fallback model when all candidates fail.- Timeout and retry settings (same as semantic routing):
idleTimeoutMs: Single idle timeout (ms). Default 180000ms (180s).totalIdleTimeoutMs: Total idle timeout budget (ms). Default 180000ms (180s).maxRetryCount: Maximum retry count. Default 1.retryIntervalMs: Retry interval (ms). Default 5000ms (5s).
- Performance optimization: Single-model priority groups use fast path with zero lock overhead.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the weather like today?"}
],
"stream": false
}'Set request body model to auto and enable router.enabled: true in config:
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Give me a detailed refactor plan with code examples"}
],
"stream": false
}'Response headers:
x-select-llm: selected downstream model namex-user-input: extracted user input for classification (sanitized and base64-encoded)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "Write a Python function"}
],
"stream": true
}'Prometheus metrics are exposed at /metrics. See METRICS.md for full metric names and labels.
chat-rag/
βββ internal/
β βββ handler/ # HTTP handlers
β βββ logic/ # Business logic
β βββ client/ # External service clients
β βββ router/ # Semantic router (strategy + factory)
β βββ promptflow/ # Prompt processing pipeline
β βββ functions/ # Tool execution engine
β βββ config/ # Configuration management
βββ etc/ # Configuration files
βββ test/ # Test files
βββ deploy/ # Deployment configurations
make help # Show available commands
make build # Build the application
make test # Run tests
make fmt # Format code
make vet # Vet code
make docker-build # Build Docker image
make dev # Run development server with auto-reload# Run all tests
make test
# Run specific test
go test -v ./internal/logic/
# Run with coverage
go test -cover ./...Intelligent context compression to handle long conversations:
ContextCompressConfig:
EnableCompress: true
TokenThreshold: 5000
SummaryModel: "deepseek-v3"
SummaryModelTokenThreshold: 4000
RecentUserMsgUsedNums: 4Support for multiple search and analysis tools:
- Semantic Search: Vector-based code and document search
- Definition Search: Code definition lookup
- Reference Search: Code reference analysis
- Knowledge Search: Document knowledge base queries
When router.enabled: true and request body model is auto, the service selects the best downstream model automatically:
- Input extraction: extract current user input and limited history per
router.semantic.inputExtraction(can strip code fences) - Semantic classification: call
router.semantic.analyzer.modelto get a label (default: simple_request / planning_request / code_modification) - Candidate scoring: score
routing.candidatesby label; supportminScoreand optional dynamic metrics - Tie-break & fallback: break ties via
tieBreakOrder; fallback tofallbackModelNameon errors or low scores - Observability: write
x-select-llmandx-user-inputto HTTP response headers
Configurable agent matching for specialized tasks:
AgentsMatch:
- AgentName: "strict"
MatchKey: "a strict strategic workflow controller"
- AgentName: "code"
MatchKey: "a highly skilled software engineer"The service exposes Prometheus metrics at /metrics endpoint (see METRICS.md for full metric names and labels):
- Request count and latency
- Token usage statistics
- Tool execution metrics
- Error rates and types
Routing observability response headers:
x-select-llm: selected model namex-user-input: base64 of extracted user input used for classification
Structured logging with Zap logger:
- Request/response logging
- Error tracking
- Performance metrics
- Debug information
- JWT-based authentication
- Request validation and sanitization
- Rate limiting support
- Secure header handling
# Build for production
CGO_ENABLED=0 GOOS=linux go build -o chat-rag .
# Run with production config
./chat-rag -f etc/prod.yamlSee deploy/ directory for Kubernetes manifests and Helm charts.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue in the GitHub repository
- Contact the maintainers