Real-time AI code completion and refactoring running on a Clockwork Pi uConsole Raspberry Pi CM4, powered by VPS inference and intelligent context management.
Streaming-First Design
Token-by-token Server-Sent Events (SSE) for real-time code generation feedback
Intelligent Context Management
Smart chunking reduces 8KB+ prompts by up to 62% while preserving refactoring instructions
Edge Orchestration
Raspberry Pi CM4 handles routing, chunking, and request coordination with <100ms overhead
Production-Grade Resilience
Circuit breakers, connection pooling, load shedding, and request correlation IDs throughout
Language-Aware Model Routing
Automatic selection between Qwen-7B (Python/JS) and CodeLLaMA-7B (Rust/C/C++) per request
Native Qt/QML Interface (Debian)
uConsole GUI with real-time streaming chat, live CPU monitoring, health indicators, and system metrics
Includes Optional Lightweight REPL CLI
Streaming REPL with multiline support, command history, and live token display for headless environments
Cerebrum was designed to run alongside uConsole cyberdeck router, running on a single Raspberry Pi CM4 handling both VPN routing and AI orchestration simultaneously.
Note: Cerebrum does not require Cyberdeck Router for the uConsole and can be run standalone on any compatible edge device.
┌─────────────────────────────────────────────────────────┐
│ Raspberry Pi CM4 (Orchestrator + VPN Router) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Cyberdeck Router (isolated) │ │
│ └───────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ FastAPI Server (Port 7000) │ │
│ │ • Instruction extraction & prompt assembly │ │
│ │ • Smart chunking (1000 char blocks, 150 overlap) │ │
│ │ • Deduplication (hash-based fingerprinting) │ │
│ │ • Load shedding (max 2 concurrent requests) │ │
│ │ • Request tracking (UUID correlation) │ │
│ │ • Zero impact on VPN throughput │ │
│ └───────────────────────────────────────────────────┘ │
└───────────────────┬─────────────────────────────────────┘
│
│ HTTP/Tailscale (Streaming SSE)
│ Chunked prompts → Token stream
│
┌───────────────────▼─────────────────────────────────────┐
│ VPS (Inference) Backend │
│ ┌───────────────────────────────────────────────────┐ │
│ │ llama.cpp Runtime (Port 9000) │ │
│ │ • Model: qwen-7b-q4.gguf / codellama-7b-q4.gguf │ │
│ │ • Inference: ~1.6 tok/s (CPU, single-threaded) │ │
│ │ • Connection pool: Persistent httpx client │ │
│ │ • Circuit breaker: 10s cooldown on failures │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Data Flow:
- User prompt → CM4 extracts instructions
- CM4 chunks large code (if >1500 chars)
- CM4 deduplicates repeated patterns
- CM4 selects top 3 relevant chunks
- CM4 assembles instruction-first prompt
- VPS streams tokens back via SSE
- CM4 proxies stream to client in real-time
Intelligent Prompt Handling
- Instruction extraction (e.g. refactor / rewrite / TODO directives)
- Instruction-first prompt assembly for base code models
- Automatic fallback to raw prompts when transformation is not beneficial
Smart Chunking & Deduplication
- Chunks only when prompts exceed safe thresholds
- Deduplicates overlapping code blocks
- Uses task-aware ranking (instruction-driven, not naive similarity)
- Skips chunking entirely when reduction is insignificant
Streaming Inference:
- Small prompts (<100 chars): ~17s for 33 tokens (1.9 tok/s)
- Large prompts (8KB): ~182s for 129 tokens (0.7 tok/s) after 62% chunking reduction
- CM4 overhead: <100ms for chunking + routing
Context Management:
- Input: 8,344 chars (repeated synchronous code)
- After chunking: 3,167 chars (62% reduction)
- Result: Actual async/await refactored code (not TODO lists!)
Resource-Aware Design:
- Max concurrent: 2 requests (load shedding)
- Circuit breaker: 10s cooldown after VPS failures
- Request timeout: Configurable per endpoint
- Connection pooling: Persistent HTTP client (no repeated initialization)
- Zero degradation in VPN connection quality or throughput (cyberdeck router/WireGuard)
Interactive REPL + API
- Bash-based interactive shell for fast iteration
- Full FastAPI surface for automation and tooling
Prerequisites
- Raspberry Pi CM4 (4GB RAM 0GB eMMC Lite), or other compatible Edge device
- VPS with 4GB+ RAM (8GB+ for multiple large models running simultaneously)
- Base OS: Debian 12 (Bookworm) installed on both CM4 and VPS
- Python 3.11+
- Deployment Models Pre-installed (See below)
# On VPS
cd ~/cerebrum-backend
./start.sh
# Verify health
curl http://localhost:9000/health
# On Raspberry Pi
cd /opt/cerebrum-pi
./start.sh
# Verify health
curl http://localhost:7000/health
cd /opt/cerebrum-pi/scripts
./cerebrum_repl.sh
REPL Commands:
>>> :help Show commands
>>> :model qwen_7b Switch model
>>> :lang python Set language
>>> :multi Toggle multiline mode
>>> def fibonacci(n): Generate code!
Cerebrum is composed of two independently deployed systems
The CM4 never runs large models. It decides what to send, how much to send, and how to stream results back efficiently.
- Runs continuously on the Pi
- Handles all user interaction
- Enforces safety and performance constraints
Runs heavy LLM inference using llama.cpp with strict resource controls
- The backend supports multiple GGUF models via llama.cpp-compatible runtimes
- Models are selected dynamically at request time
- Exposes inference and streaming endpoints
- Tuned for CPU/GPU efficiency
Note:
The root of this repository is not directly executable.
All runtime instructions live in the component-specific READMEs above.
Cerebrum/ # 🎩 Root
│
├── cerebrum-pi/ # 🔹 CM4 Orchestrator (Raspberry Pi) - Debian 12
│ ├── cerebrum/
│ │ ├── api/ # 💫 FastAPI Application (Active)
│ │ │ ├── main.py # Application entry point
│ │ │ ├── middleware/ # Request processing
│ │ │ │ ├── request_id.py # UUID correlation
│ │ │ │ ├── log_context.py # Request logging
│ │ │ │ └── load_shed.py # Concurrency limiting
│ │ │ │
│ │ │ ├── routes/ # ✨ API endpoints
│ │ │ │ ├── inference.py # Streaming code completion
│ │ │ │ ├── _chunking_helper.py # Smart prompt processing
│ │ │ │ ├── health.py # Health checks
│ │ │ │ ├── models.py # Model listing
│ │ │ │ └── stats.py # System statistics
│ │ │ │
│ │ │ └── schemas/ # 🔮 API schemas / future Pydantic models
│ │ │
│ │ ├── core/ # 🪄 VPS Integration (Active)
│ │ │ └── vps_client.py # Connection pooling, circuit breaker
│ │ │
│ │ ├── retrieval/ # 🧬 Context Management (Active)
│ │ │ ├── chunker.py # Text chunking (1000 char blocks)
│ │ │ ├── ranker.py # Relevance ranking + deduplication
│ │ │ ├── assembler.py # Prompt assembly
│ │ │ └── instruction_parser.py # Instruction extraction
│ │ │
│ │ ├── orchestration/ # 🔮 Future: Multi-step task coordination
│ │ ├── reasoning/ # 🔮 Future: Symbolic / constraint-based reasoning
│ │ ├── tasks/ # 🔮 Future: Reusable task templates
│ │ └── utils/ # 🔮 Future: Shared helper functions
│ │
│ ├── scripts/
│ │ └── cerebrum_repl.sh # Interactive streaming CLI
│ │
│ ├── config/
│ │ └── cerebrum-tunnel.service # Tailscale VPN systemd service
│ │
│ ├── data/ # 📄 Runtime data
│ │ ├── cache/
│ │ ├── embeddings/
│ │ └── knowledge_base/
│ │
│ ├── tests/ # 🧪 Test suites
│ │ ├── test_api/
│ │ ├── test_core/
│ │ └── test_integration/
│ │
│ ├── start.sh # Start orchestrator
│ ├── stop.sh # Stop orchestrator
│ └── requirements.txt # Python dependencies
│
├── cerebrum-backend/ # 🔸 VPS Inference Backend - Debian 12
│ ├── vps_server/ # ⚙️ Inference Engine (Active)
│ │ └── main.py # FastAPI + llama.cpp streaming
│ │
│ ├── scripts/
│ │ ├── start.sh
│ │ ├── test.sh # Health check tests
│ │ └── generate_api_key.sh # API key generation
│ │
│ ├── config/ # Configuration files
│ ├── logs/ # Runtime logs
│ ├── cerebrum-backend.service # Systemd service
│ └── requirements.txt # Python dependencies
│
├── deployment/ # 🔮 Future: Deployment Automation
│ ├── scripts/
│ └── systemd/
│
├── docs/ # 📚 Documentation
│ ├── api/
│ │ └── API.md
│ ├── architecture/
│ │ └── ARCHITECTURE.md
│ ├── diagrams/
│ │ └── images/
│ ├── guides/
│ │ └── DEVELOPMENT.md
│ └── optimization/
│ └── PERFORMANCE.md
│
├── scripts/ # 🔧 Development Tools
│ ├── sync_to_cm4.sh # Rsync to Raspberry Pi
│ └── sync_to_vps.sh # Rsync to VPS
│
└── shared/ # 🧺 Shared Resources
├── embeddings/ # Vector embeddings cache
├── knowledge_base/ # Curated reference material
│ ├── code_snippets/ # Reusable code examples
│ ├── documentation/ # External reference materials
│ │ └── vendor_docs/ # Third-party API docs, language specs
│ └── examples/ # Sample projects
│
└── models/
├── download_scripts/ # Model acquisition utilities
│ └── download_models.sh
└── lists/ # Model manifests / allowlists
- Edit on macOS (VS Code + VS Code Insider)
- Sync to CM4 (
rsync) - Sync to VPS (
rsync) - Test locally via REPL or API
- Iterate without redeploying the full system
See docs/ directory for detailed information:
- API - Available endpoints, request formats, and streaming behavior
- Architecture - System design, data flow, and component boundaries
- Development - Local workflow, testing, and contribution notes
- Optimization - Performance characteristics and tuning considerations
Cerebrum™ © 2025 Robert Hall. All rights reserved.
This project is licensed under the MIT License.
This project uses Qt, which is licensed under LGPL v3.
See Qt's Open Source Licensing for details.
The Cerebrum project is authored by a sole developer and maintainer.
Bug reports, documentation fixes, and design suggestions are always welcome and appreciated. If you encounter an issue or have an idea to share, please open an issue in Cerebrum/issues.
At this time, direct write access and unsolicited feature pull requests are not accepted. All code changes are curated by the maintainer to ensure architectural consistency and system stability.
Built with:
- FastAPI - High-performance async web framework
- llama.cpp - Efficient LLM inference
- httpx - Modern HTTP client with connection pooling
- Qwen - Alibaba's excellent code model
- Qt - GUI framework with Qt Design Studio for native GUI development
- Debian Project - Bookworm base system foundations
- Raspberry Pi is a trademark of Raspberry Pi Ltd
Inspired by the challenge of running production AI on a Raspberry Pi.
