I spent the last few months obsessing over one question: if someone asked me to build them an AI agent today, what would I actually recommend? Not the theoretical answer. Not the "it depends" consultant-speak. The specific tools, the exact architecture, the stack I'd bet my reputation on for their situation. So I pulled apart every framework, memory system, LLM routing strategy, and data layer I could find. I talked to builders. I tested combinations. I read documentation until my eyes glazed over. What came out is this guide. It's opinionated. It's specific. And it maps real client profiles to real stacks - with setup checklists you can actually follow. Whether you're building agents for clients or trying to understand what your AI consultant should be recommending, this is the architecture playbook.
Here's the mental model that makes everything click. Every custom AI agent - no matter how simple or complex - breaks down into six layers. Think of it like building a house: you don't pick the paint color before you pour the foundation.
┌──────────────────┐
│ AGENT HARNESS │ ← Orchestration & glue
├──────────────────┤
│ LLM │ ← The reasoning engine
├──────────────────┤
│ SKILLS │ ← Domain expertise modules
├──────────────────┤
│ SESSION & MEMORY │ ← Context persistence & recall
├──────────────────┤
│ TOOLS │ ← Single-purpose integrations
├──────────────────┤
│ DATA │ ← Your personalized foundation
└──────────────────┘
Each layer depends on the one beneath it. Get the foundation wrong and nothing above it works right. Most people start at the top -"which AI model should I use?" - and work their way down. That's backwards. You start with data. Always.
The data layer is the bedrock. Every decision above it depends on what data you have, where it lives, how sensitive it is, and how it needs to be accessed. Skip this layer or rush through it, and you'll rebuild later. I've seen it happen too many times. The Three Types of Data Storage You Need to Know Vector Databases are what make "find me something similar" work. They store embeddings - mathematical representations of your content - and retrieve them based on meaning, not just keywords. This is the engine behind RAG (Retrieval Augmented Generation), which is how agents pull in relevant context from your data.
Database Deployment Best For Compliance
─────────── ────────────────────── ─────────────────────────────────── ──────────────────────────
Pinecone Managed cloud Enterprise RAG, sub-10ms latency SOC 2, HIPAA, ISO 27001
Qdrant Self-hosted or cloud Privacy-first, advanced filtering Self-hosted = full control
Weaviate Self-hosted or cloud Hybrid search + knowledge graphs Self-hosted option
ChromaDB Local/embedded Prototyping, developer speed Local only
pgvector PostgreSQL extension Teams already on Postgres Inherits Postgres compliance
Milvus Distributed Billions of vectors, GPU accel. Enterprise options
Document Stores hold the raw source material and structured metadata alongside it.
Store Best For Agent Integration
─────────────── ─────────────────────────────────────── ──────────────────────────────────────────
MongoDB Converged data (vectors + docs + mem) Native vector search, flexible schema
Elasticsearch Hybrid BM25 + vector search, analytics Agent Builder framework, real-time indexing
PostgreSQL Relational data + vector via pgvector Familiar, ACID-compliant, battle-tested
Graph Databases capture relationships between entities. When your agent needs to answer "how are these two things connected?" or traverse complex webs of information, that's graph territory.
Database Best For Query Language
──────────────── ───────────────────────────────────── ─────────────────
Neo4j Knowledge graphs, 75% of Fortune 100 Cypher
Amazon Neptune AWS-native, serverless scaling Gremlin + SPARQL
FalkorDB Lightweight, fast graph operations Cypher-compatible
The first question with any client - before tools, before models, before anything: where does the data live and who can see it?
This single question determines half your architecture.
Public/Non-Sensitive Data - Cloud-managed services are fine. Pinecone, managed Qdrant Cloud, or Weaviate Cloud. Lower operational burden, faster time-to-value. Most solopreneurs and small businesses live here.
Proprietary but Non-Regulated - Hybrid approach. Self-host the vector DB (Qdrant or Milvus) while using managed LLM APIs. Your data stays on your infrastructure; only queries hit the cloud.
Regulated (HIPAA, SOC 2, GDPR) - Full self-hosting or private cloud. Self-hosted Qdrant + on-premise Neo4j + customer-managed encryption keys. LLM inference via Azure Private Endpoints or self-hosted open-source models. Budget $8K-25K additional for compliance implementation. This is where things get real.
Air-Gapped / Government - Everything on-premise. Self-hosted open-source LLMs (Kim K2.5 Thinking, DeepSeek, MiniMax, Mistral), local vector DB, local graph DB. Zero external API calls. If you're in this world, you already know it.
The Data Pipeline: How Raw Stuff Becomes Agent-Ready
Raw data must be parsed, chunked, embedded, and indexed before agents can use it. Here's the flow:
Raw Sources → Parsing → Chunking → Enrichment → Embedding → Storage
(PDF, DOCX, (Extract (Semantic (Metadata, (Model- (Vector DB
HTML, CSV) text) segments) cleaning) specific) + Doc Store)
Here's the thing nobody tells you: chunking strategy matters more than most people realize. Smaller chunks give focused retrieval but lose context. Larger chunks preserve context but may include noise. Semantic chunking - splitting by meaning rather than character count - is the current best practice.
Keep tables together. Preserve code blocks intact. Use overlap between chunks to maintain continuity.
Tool to try: Unstructured.io - open-source platform that handles multi-format parsing, intelligent chunking, and metadata preservation.
Data Layer Setup Checklist
- [ ] Audit client data sources (types, volume, sensitivity classification)
- [ ] Determine data residency requirements (region, on-premise, cloud)
- [ ] Select compliance framework (SOC 2, HIPAA, GDPR, or none)
- [ ] Choose vector database based on scale and deployment model
- [ ] Choose document store for raw content and metadata
- [ ] Evaluate need for knowledge graph (if relationship queries are critical)
- [ ] Design chunking strategy (semantic preferred, with overlap)
- [ ] Set up embedding pipeline (model selection: OpenAI ada-002, Cohere, or self-hosted)
- [ ] Implement encryption at rest (AES-256) and in transit (TLS 1.2+)
- [ ] Define data retention and deletion policies
- [ ] Set up incremental update pipeline for ongoing ingestion
Tools are how agents interact with the outside world. Each tool does exactly one thing, takes defined inputs, and produces predictable outputs.
The Single Responsibility Principle is non-negotiable here. A tool that does two things is two tools that should be separated. Period.
A well-designed tool is four things: atomic (one operation), deterministic (predictable output), typed (JSON schema inputs/outputs), and least-privileged (minimum necessary permissions).
Vague parameter descriptions cause agents to make wrong choices. Strong typing and clear documentation are the difference between a reliable agent and one that hallucinates its way through your workflow.
Python Tools via Astral UV - This is my favorite for speed-to-build. UV enables truly self-contained Python scripts using PEP 723 inline metadata. Dependencies are declared inside the script itself, and uv run installs them in an ephemeral virtual environment.
No virtualenv management. No requirements.txt. No dependency conflicts between tools.
# /// script
# requires-python = ">=3.12"
# dependencies = ["httpx", "beautifulsoup4"]
# ///
import httpx
from bs4 import BeautifulSoup
def scrape_title(url: str) -> str:
resp = httpx.get(url)
return BeautifulSoup(resp.text, "html.parser").title.stringUV is specifically designed for agent skill distribution: write once, run identically everywhere. Use uv lock for deterministic reproducibility.
Go CLI Tools - Go compiles to a single binary with zero runtime dependencies. No "did you install Python 3.12?" conversations. The Cobra framework (used by Docker, Kubernetes, Hugo) provides battle-tested CLI scaffolding. Go's performance characteristics make tight agent feedback loops viable at scale.
MCP (Model Context Protocol) Servers - MCP is the open standard for connecting LLM clients to external tools and data sources. An MCP server exposes tools via tools/list (discovery) and tools/call (execution) endpoints with JSON Schema contracts. Pre-built MCP servers exist for Google Drive, Slack, GitHub, Postgres, and dozens of other services.
Approach When to Use Trade-off
────────────────── ────────────────────────────────────────────────── ──────────────────────────────
UV Python scripts Rapid development, data processing, API integrations Slower execution than compiled
Go CLI binaries Performance-critical, cross-platform distribution Longer development cycle
MCP servers Standardized integrations, multi-agent ecosystems Protocol overhead
HTTP callables Existing REST APIs, microservices Network latency
- [ ] Inventory required integrations (APIs, databases, file systems, services)
- [ ] Classify each integration as a single-purpose tool
- [ ] Choose implementation language per tool (UV Python for speed, Go for performance)
- [ ] Define JSON schemas for all tool inputs and outputs
- [ ] Implement least-privilege credentials (scoped API keys per tool)
- [ ] Set up MCP server for tools that need standardized discovery
- [ ] Write tool documentation (clear parameter descriptions reduce agent errors)
- [ ] Create error handling patterns (tools should return structured errors, not crash)
- [ ] Test each tool in isolation before agent integration
- [ ] Set up tool versioning strategy
Skills represent a genuine paradigm shift. Released as an open standard by Anthropic in December 2025, skills are now adopted by Microsoft (VS Code, GitHub Copilot), OpenAI (Codex), Google (Gemini CLI), Cursor, Goose, and dozens of other platforms.
This is the layer most people don't know about yet. And it's a game-changer for how agents actually perform.
Tools execute operations and return results. Skills inject domain expertise and behavioral guidance into the agent's reasoning.
A tool is "call this API." A skill is "here's how an expert approaches this entire domain - including which tools to use, in what order, and what to watch out for."
Aspect Tools Skills
───────────────── ──────────────────────────── ──────────────────────────────────────────────────
Context loading All schemas loaded upfront Progressive disclosure (~100 bytes metadata upfront)
Token cost High schema overhead 500x more efficient than MCP for large libraries
Function Execute specific actions Guide reasoning and approach
Extensibility Requires code changes Drop a SKILL.md file, no code
Portability Framework-specific Works across all compatible agents
Here's the key insight: enterprises need both. Tools provide action capability. Skills provide reasoning strategy. An agent with only tools lacks domain expertise. An agent with only skills can't execute.
.agent/skills/
├── contract-review/
│ ├── SKILL.md # Entry point: YAML frontmatter + markdown
│ ├── scripts/ # Executable code (Python, Bash)
│ ├── references/ # Supporting documentation
│ ├── assets/ # Templates, fonts, images
│ └── templates/ # Structured prompts or forms
The SKILL.md file has two parts: YAML frontmatter (name, description, compatibility) and a markdown body with full instructions. At startup, only the frontmatter metadata loads - about 100 bytes per skill. When the agent determines a skill is relevant, it loads the full SKILL.md. Referenced resources load on-demand.
This is the core innovation that most people miss.
Traditional MCP servers with 90+ tools consume 50,000+ tokens of JSON schema before the agent even starts reasoning. That's expensive, slow, and wasteful.
Skills use a three-tier loading model:
- Metadata - Name + description only (~100 bytes). Always in context.
- Full instructions - Complete SKILL.md body. Loaded when activated.
- Resources - Scripts, templates, references. Loaded on-demand.
An agent can have access to thousands of skills without context window bloat. That's 500x more efficient. Not an exaggeration actual measured difference.
- [ ] Identify domain expertise areas the agent needs
- [ ] Audit existing 40,000+ community skills at agentskills.io
- [ ] Write custom SKILL.md files for client-specific domains
- [ ] Keep SKILL.md body under 500 lines (best practice)
- [ ] Structure skills with scripts/, references/, and templates/ subdirectories
- [ ] Implement progressive disclosure in the agent framework
- [ ] Test skill activation triggers (does the agent load the right skill?)
- [ ] Set up cross-platform distribution if needed
- [ ] Create skill evaluation pipeline (test prompts, grading criteria)
- [ ] Version control all skills alongside the codebase
Memory is the line between "helpful tool" and "genuinely useful agent." The session layer persists context across interactions. The memory layer enables the agent to recall, learn, and evolve.
Without memory, every conversation starts from zero. With it, your agent gets smarter over time.
This maps to cognitive science, and understanding it will change how you think about agent design:
Type Purpose Storage Lifetime
───────────────────── ───────────────────────────────────────── ────────────────────────── ──────────────
Working Memory Active context for current decision In-context window Single turn
Short-Term (Episodic) Full conversation history, tool call logs JSON session files One session
Medium-Term Compressed summaries, extracted facts Vector DB + session store Days to weeks
Long-Term (Semantic) User preferences, relationships, patterns Vector DB + knowledge graph Persistent
Procedural Agent behavior rules and strategies Skills + system prompts Permanent
JSON File-Based Sessions - The simplest approach. Each session is a directory with metadata, individual turn files, and accumulated agent state. Git-friendly, zero infrastructure, easy to debug. Not scalable for concurrent users, but perfect for prototyping.
Database-Backed Sessions - For production. Options include SQLite (development), PostgreSQL (production ACID), Redis (sub-1ms latency), MongoDB (flexible schema), and DynamoDB (AWS serverless).
Vector Search (Semantic Recovery) - Store conversation turns and facts as embeddings. Retrieve by semantic similarity. Best for "find me something related to X" queries. This is your broad recall system.
Knowledge Graphs (Structured Recovery) - Store entities and relationships with temporal metadata. Best for "how are X and Y connected?" and "what changed about X over time?" queries. Graphiti (by Zep) is the leading engine here — it combines semantic embeddings with keyword search and graph traversal.
The production recommendation: use both. Hybrid memory combines vector search for broad semantic recall with knowledge graphs for structured relationship queries and session files for conversation continuity.
┌─────────────────────────────────────┐
│ WORKING MEMORY (Context Window) │
│ Current conversation + task state │
│ Lifetime: Single interaction │
└────────────┬────────────────────────┘
↓
┌─────────────────────────────────────┐
│ SEMANTIC LAYER (Vector Search) │
│ Embeddings of past interactions │
│ Metadata filtering + similarity │
└────────────┬────────────────────────┘
↓
┌─────────────────────────────────────┐
│ KNOWLEDGE LAYER (Graph Database) │
│ Entity relationships + temporal │
│ validity + logical inference │
└─────────────────────────────────────┘
Mem0 - Universal memory layer. Works with OpenAI, LangGraph, CrewAI. SOC 2 and HIPAA compliant. Self-improving through priority scoring. 26% improvement over OpenAI baseline, 91% lower p95 latency, 90% token cost savings. This is the easiest on-ramp.
Letta (formerly MemGPT) - OS-inspired memory management where agents actively edit their own memory. Core Memory (always visible) + Archival Memory (unlimited, search-on-demand). Enables personality evolution and user learning over time. Fascinating technology.
Zep / Graphiti - Temporal knowledge graph platform. Real-time graph updates without batch recomputation. Outperforms MemGPT by 18.5% accuracy in LongMemEval benchmarks.
Supermemory - Visualizes memories as interconnected knowledge graphs with backend API for storage, processing, and retrieval.
- [ ] Define memory requirements (how far back does the agent need to remember?)
- [ ] Choose session persistence (JSON files for dev, database for production)
- [ ] Select vector database for semantic memory
- [ ] Evaluate need for knowledge graph (Graphiti/Neo4j for relationship-heavy domains)
- [ ] Choose memory framework (Mem0 for general use, Zep for temporal, Letta for self-editing)
- [ ] Implement hierarchical summarization (recent turns verbatim, older ones compressed)
- [ ] Set up memory consolidation pipeline
- [ ] Define memory cleanup policies (TTL, relevance decay, conflict resolution)
- [ ] Implement multi-agent memory sharing if applicable
- [ ] Test memory recovery accuracy
- [ ] Monitor memory growth and token costs over time
The LLM layer is where reasoning happens. But the choice here isn't just "which model." It's "which models, when, and how do they work together."
This is where most people overspend or underperform. Getting the routing right is the difference between a $200/month agent and a $2,000/month one that produces the same results.
Model In / Out (/M tokens) Speed Reasoning Context Best For
──────────────────────── ───────────────────── ───────────── ─────────────────────── ──────────── ──────────────────────────────────
Claude Opus 4.6 $5 / $25 71 tok/s Elite (1,606 Elo) 200K (1M β) Autonomous agents, complex reasoning
Claude Sonnet 4.5 $3 / $15 ~95 tok/s Excellent (77.2% SWE) 200K (1M β) Fast iteration, dev tools, code
GPT-5.2 $2.50 / $10 Good Elite (80% SWE) 400K Enterprise code, multi-doc reasoning
GPT-o1 $15 / $60 Slow Expert-level reasoning 128K Deep reasoning (27x cost of alternatives)
Gemini 3 Pro $2 / $12 ~130 tok/s Very High (90.1% MMLU) 1M+ Production agents, knowledge tasks
Gemini 3 Flash $0.075 / $0.40 Very fast High (88.6% MMLU) 200K+ High-volume, budget agents
Llama 4 Maverick (OSS) Self-hosted Variable Very Strong (400B) 1M Privacy-first, on-premise agents
Llama 4 Scout (OSS) Self-hosted Variable Strong 10M Extreme context (entire codebases)
DeepSeek R1 (OSS) $0.55 / $2.19 Moderate Expert reasoning 128K Budget reasoning (27x cheaper than o1)
Qwen3-Coder-Next (OSS) Self-hosted Very fast Strong (70.6% SWE) 262K Coding agents, edge deployment
Single Vendor (e.g., Claude-only via Claude Agent SDK) - Simpler implementation, deterministic performance, easier compliance. Risk: vendor lock-in. Choose this when you're starting out, trust the vendor's roadmap, and want to move fast.
Multi-Model Orchestration - Route requests to the best model for each task. Up to 85% cost savings documented. Higher availability through fallback. This is the production play.
Task-Type Routing - Route by what the agent is doing:
- Classification/summarization → Gemini Flash (cheap, fast)
- Code generation → OpenAI Codex (best-in-class)
- Complex reasoning → Claude Opus (best-in-class)
- Multimodal → Gemini Pro (1M context, vision)
- Real-time chat → fastest available model (xAI for socials)
Cascade Routing - Start cheap, escalate on uncertainty:
- Run query through small model with confidence threshold (e.g., 0.7)
- If confident, accept the response
- If not, escalate to mid-tier
- If still uncertain, escalate to frontier
- Result: 85% cost reduction with 95% quality maintenance
Fallback Routing - Primary → secondary → tertiary. Monitor provider health every 30 seconds. Automatic circuit breaking on failures. This keeps your agent running when providers go down.
OpenRouter - Fully managed, zero infrastructure. Monitors provider health, automatically routes away from degraded providers. Best for teams that don't want to manage routing.
LiteLLM - Open-source proxy you self-host. Redis-based usage tracking, multiple routing strategies. Best for teams wanting control. Can use OpenRouter as a backend provider for a hybrid approach.
Helicone - Observability layer. 8ms P50 latency overhead. Native cost tracking, Redis caching (up to 95% cost reduction on repeated queries). Essential for production multi-model deployments.
Self-host when:
- Processing millions of tokens/day (cost savings exceed hardware costs)
- Data cannot leave your infrastructure
- Operating in regulated industries
Don't self-host for: prototyping, low volume (under 100K tokens/day), limited MLOps expertise, or when you need capabilities only proprietary models offer.
- [ ] Profile workload types (reasoning, classification, generation, multimodal)
- [ ] Decide single-vendor vs multi-model based on volume and complexity
- [ ] If multi-model: set up routing infrastructure (OpenRouter or LiteLLM)
- [ ] Define model assignments per task type
- [ ] Implement cascade routing with confidence thresholds (if cost-sensitive)
- [ ] Set up fallback chains with health monitoring
- [ ] Implement rate limiting per provider
- [ ] Set up observability (Helicone or equivalent)
- [ ] If self-hosting: provision GPU infrastructure, set up vLLM
- [ ] Benchmark routing strategy on held-out traffic before production
The agent harness is the orchestration layer that binds everything below it into a functioning system. It manages the agent loop (prompt → reason → act → observe → repeat), coordinates tools and skills, and handles state management.
This is where your philosophy about building shows up.
The Frameworks Worth Knowing
Pi-Mono - Minimal, opinionated coding agent toolkit. Four core tools only (read, write, edit, bash). System prompt under 1,000 tokens. Supports 20+ LLM providers. Open source.
Pi-Mono
─────────────── ────────────────────────────────────────────────────────────
Best For Developers who want minimal foundations to extend
LLM Support 20+ providers (OpenAI, Anthropic, Google, Mistral, Bedrock)
Limitation Smaller ecosystem, requires CLI comfort
The philosophy: "What you leave out matters more than what you put in." If you believe in stack simplicity, this one's for you.
Claude Agent SDK - Anthropic's official SDK. Same agent loop and tools system that powers Claude Code, packaged as a programmable library. Python and TypeScript. Automatic context management with message summarization. Native MCP support.
Claude Agent SDK
─────────────── ────────────────────────────────────────────────────────────
Best For Production Claude deployments with security requirements
LLM Support Claude models only
Limitation Vendor lock-in to Anthropic
MindStudio - Visual, no-code agent builder. 200+ model access, 1,000+ pre-built integrations. Average build time 15 minutes to 1 hour. If you're building for non-technical clients, this is your answer.
MindStudio
─────────────── ────────────────────────────────────────────────────────────
Best For Business users, rapid prototyping, non-technical teams
LLM Support 200+ models from multiple providers
Limitation Proprietary SaaS, limited customization, no self-hosting
OpenClaw - Free, open-source (MIT), self-hosted agent that connects to messaging platforms (WhatsApp, Telegram, Discord, Slack). 100+ pre-configured skills. Semantic Snapshots for web automation that use 100x fewer tokens than screenshots.
OpenClaw
─────────────── ────────────────────────────────────────────────────────────
Best For multi-channel agents, self-hosting
LLM Support Any API-compatible LLM
Limitation Requires infrastructure knowledge, steeper learning curve
CrewAI - Multi-agent orchestration with role-based teams. 100+ built-in tools, 5.76x faster than LangGraph for certain tasks.
LangGraph - Low-level DAG-based orchestration from LangChain. Durable execution with failure recovery. Human-in-the-loop with state inspection.
Dify - Open-source platform combining visual workflow builder, RAG engine, agent capabilities, and model management.
Single Agent - One agent, all tools. Works well with up to 10-15 tools. Simpler to build and debug. Sufficient for well-bounded tasks.
Multi-Agent with Subagents - Primary agent delegates to specialized subagents. Each gets its own context window and tools. Anthropic research shows multi-agent Claude Opus 4 + Sonnet subagents outperformed single-agent Opus 4 by 90.2%. That's not a typo.
Router Pattern — Lightweight agent classifies incoming requests and dispatches to specialists. Each specialist handles one domain. Clean separation of concerns.
Pipeline Pattern -Sequential agents, each performing one step. Output of one becomes input to the next. Good for workflows with clear stages.
Human-in-the-Loop - Agent proposes actions at critical checkpoints. Human approves, modifies, or rejects. Essential for high-stakes decisions. Non-negotiable for anything touching money, legal, or customer-facing communication.
- [ ] Assess team technical capability (no-code, low-code, full-code)
- [ ] Evaluate single-agent vs multi-agent requirement
- [ ] Select framework based on LLM strategy (vendor-specific vs multi-model)
- [ ] Implement agent loop with proper error handling and retry logic
- [ ] Configure tool and skill discovery and loading
- [ ] Set up session management and memory integration
- [ ] Implement human-in-the-loop checkpoints for high-stakes actions
- [ ] Set up logging and observability for agent decisions
- [ ] Define agent permission boundaries
- [ ] Test end-to-end: user request → reasoning → tool use → response
- [ ] Set up CI/CD for agent deployment and skill updates
This is where the guide gets practical. Instead of wading through options, find the pattern that matches your client and follow the prescription.
Who this is for: Early-stage company, small team, limited budget, needs to move fast, non-sensitive data.
- Data: ChromaDB (embedded) + PostgreSQL
- Tools: UV Python scripts + MCP servers for SaaS integrations
- Skills: Community skills from agentskills.io + 2-3 custom SKILL.md files
- Memory: JSON file sessions + ChromaDB for semantic search
- LLM: Single vendor (Claude via API or GPT-5.3)
- Harness: Pi-Mono or Claude Agent SDK
- Why this works: Minimal infrastructure, fast iteration, low cost. ChromaDB embeds directly in the application. UV scripts deploy instantly. Single-vendor = - LLM avoids routing complexity. Everything runs on a single server.
- Monthly cost: $50-200 (LLM API costs only)
- [ ] Set up PostgreSQL for application data
- [ ] Initialize ChromaDB for vector storage (embedded mode)
- [ ] Create UV Python tools for core integrations
- [ ] Write 2-3 custom SKILL.md files for domain expertise
- [ ] Configure Pi-Mono or Claude Agent SDK with single model
- [ ] Implement JSON session files for conversation persistence
- [ ] Deploy to single server or container
- [ ] Set up basic logging
- [ ] Test end-to-end with representative user scenarios
- [ ] Monitor API costs weekly
Who this is for: Mid-to-large company, proprietary data, compliance requirements (SOC 2 minimum), needs multi-department access, internal knowledge base.
- Data: Qdrant (self-hosted) + MongoDB + Neo4j
- Tools: Go CLI binaries + MCP servers + internal REST APIs
- Skills: Custom enterprise skills + curated community skills
- Memory: Mem0 (SOC 2 compliant) + Graphiti knowledge graph + PostgreSQL sessions
- LLM: Multi-model via LiteLLM (Claude for reasoning, Gemini for multimodal, GPT for chat)
- Harness: LangGraph or CrewAI (multi-agent for departmental specialization)
Why this works: Self-hosted vector DB keeps data on-premise. Knowledge graph captures organizational relationships. Multi-model routing optimizes cost and capability. Multi-agent architecture lets departments have specialized agents that share institutional memory.
Monthly cost: $2,000-10,000 (infrastructure + API costs)
- [ ] Deploy Qdrant cluster on internal infrastructure
- [ ] Set up MongoDB for document storage with vector search
- [ ] Deploy Neo4j for organizational knowledge graph
- [ ] Build data ingestion pipeline (Unstructured.io + embedding pipeline)
- [ ] Develop Go CLI tools for internal system integrations
- [ ] Set up MCP servers for approved SaaS tools
- [ ] Write enterprise SKILL.md files per department
- [ ] Deploy Mem0 with SOC 2 configuration
- [ ] Set up Graphiti for temporal knowledge graph
- [ ] Configure LiteLLM proxy with model routing rules
- [ ] Implement role-based agent access controls
- [ ] Set up Helicone for cost and usage monitoring
- [ ] Deploy multi-agent system
- [ ] Implement human-in-the-loop for sensitive operations
- [ ] Set up audit logging for compliance
- [ ] Run security review and penetration testing
- [ ] Create disaster recovery and failover procedures
Who this is for: Regulated industry (healthcare, finance, government), strict data residency, cannot use cloud LLM APIs, needs autonomous operation.
- Data: Qdrant (self-hosted) + PostgreSQL + Neo4j (all on-premise)
- Tools: Go CLI binaries (zero external dependencies) + local MCP servers
- Skills: Fully custom skills (no external marketplace dependencies)
- Memory: Letta (MemGPT) for self-editing memory + local Graphiti + SQLite sessions
- LLM: Self-hosted Llama 4 Maverick or DeepSeek-V3 via vLLM
- Harness: OpenClaw (self-hosted, MIT licensed) or Pi-Mono
Why this works: Zero data leaves the network. Open-source everything, auditable top to bottom. Self-hosted LLMs via vLLM for inference. Letta's OS-inspired memory management enables long-running autonomous agents.
Monthly cost: $5,000-20,000 (GPU infrastructure + engineering time)
- [ ] Provision GPU infrastructure (NVIDIA A100/H100 or equivalent)
- [ ] Deploy vLLM for model inference
- [ ] Set up Qdrant on dedicated servers with encryption at rest
- [ ] Deploy PostgreSQL with row-level security
- [ ] Set up Neo4j Community Edition for knowledge graph
- [ ] Build all tools as Go CLI binaries (no external API calls)
- [ ] Write comprehensive custom skills for all agent capabilities
- [ ] Deploy Letta for memory management
- [ ] Set up Graphiti with local graph backend
- [ ] Configure OpenClaw or Pi-Mono with local model endpoints
- [ ] Implement network isolation (no egress to public internet)
- [ ] Set up comprehensive audit logging
- [ ] Conduct security audit and compliance review
- [ ] Create operational runbooks for model updates and maintenance
- [ ] Set up monitoring and alerting (Prometheus + Grafana)
- [ ] Document data flow diagrams for compliance auditors
Who this is for: Non-technical team, needs workflow automation (customer support, lead qualification, data processing), speed-to-value is the priority.
- Data: Managed cloud (Pinecone or Weaviate Cloud)
- Tools: Pre-built integrations (1,000+ via MindStudio or Dify)
- Skills: Template-based workflows
- Memory: Platform-managed sessions
- LLM: Multi-model via platform (200+ models in MindStudio)
- Harness: MindStudio or Dify
Why this works: 15 minutes to first working agent. No infrastructure to manage. Pre-built integrations with existing business software. Visual workflow builder makes iteration accessible to non-developers.
Monthly cost: $100-500 (platform subscription + per-use API costs)
- [ ] Sign up for MindStudio or deploy Dify Cloud
- [ ] Identify top 3 automation workflows to build
- [ ] Connect existing business tools (CRM, email, Slack, etc.)
- [ ] Select LLM model per workflow (cost vs quality trade-off)
- [ ] Build first agent using template or visual builder
- [ ] Test with representative scenarios
- [ ] Set up webhook triggers for automated activation
- [ ] Train team on workflow editing and monitoring
- [ ] Set up usage alerts and cost monitoring
- [ ] Iterate based on production feedback
### Pattern 5: The Multi-Channel Personal Assistant
Who this is for: Power user or small team wanting a personal AI assistant across messaging platforms, with local control and privacy.
### The Stack:
- Data: SQLite + ChromaDB (local) (also can use QMD)
- Tools: OpenClaw's 100+ pre-configured agent skills
- Skills: Mix of built-in and custom SKILL.md files
- Memory: OpenClaw persistent memory + local ChromaDB for semantic search
- LLM: Any API (Claude, GPT, or local via Ollama)
- Harness: OpenClaw
Why this works: OpenClaw connects to WhatsApp, Telegram, Discord, Slack, and more from a single self-hosted instance. 100+ pre-built skills cover common automation tasks. Semantic Snapshots enable web automation at 100x fewer tokens than screenshot-based approaches.
Monthly cost: $10-70 (API costs only, runs on existing hardware)
### Setup Checklist:
- Set up server or repurpose existing hardware
- Deploy OpenClaw
- Configure messaging channels
- Set up LLM API keys (or local Ollama for privacy)
- Enable desired pre-built skills
- Write custom SKILL.md files for personal workflows
- Set up scheduled automations
- Test cross-channel conversations
- Set up automatic updates and backup
### Pattern 6: The High-Performance Code Agent
Who this is for: Engineering team needing an AI coding assistant that integrates with development workflows, handles large codebases, and operates with minimal latency.
### The Stack:
- Data: pgvector (code embeddings in existing Postgres) + Git repositories
- Tools: UV Python scripts + Go CLI tools + MCP servers (GitHub, Jira, CI/CD)
- Skills: Code-specific skills (refactoring, review, testing, documentation)
- Memory: JSON sessions + vector search over codebase + AGENTS.md for project context
- LLM: Claude Sonnet (primary, highest SWE-bench) + Gemini Flash (fast iteration)
- Harness: Claude Agent SDK or Pi-Mono
Why this works: Claude Opus 4.6 and Codex 5.3. AGENTS.md files provide persistent project knowledge without retrieval overhead. pgvector leverages existing Postgres infrastructure. Multi-model routing uses Claude for complex refactoring and Gemini Flash for quick completions.
Monthly cost: $200-2,000 (API costs scaled by team size)
### Setup Checklist:
- Set up pgvector extension in existing PostgreSQL
- Index codebase with embedding pipeline (function-level chunking)
- Create AGENTS.md files for key repositories
- Write SKILL.md files for code review, refactoring, test generation, documentation
- Set up MCP servers for GitHub, Jira, and CI/CD
- Configure dual-model routing
- Implement codebase-aware context loading (git blame, file history)
- Set up IDE integration (VS Code, Cursor)
- Create evaluation pipeline (run agent on known code tasks, measure quality)
- Monitor token usage and optimize context loading
Don't overthink it. Find your row:
Criteria Pattern 1: Pattern 2: Pattern 3: Pattern 4: Pattern 5: Pattern 6:
MVP Enterprise Privacy No-Code Personal Code
─────────────────── ─────────── ───────────── ─────────── ─────────── ─────────── ───────────
Team size 1-5 10-1000+ 5-100 1-50 1-5 5-100
Technical depth Developers Mixed Deep eng. Non-tech Power users Developers
Data sensitivity Low Medium-High Critical Low-Medium Low Medium
Budget (monthly) $50-200 $2K-10K $5K-20K $100-500 $10-70 $200-2K
Time to 1st agent Days Weeks Months Hours Days Days
LLM approach Single Multi-model Self-hosted Platform Flexible Dual-model
Every production agent needs three types of visibility: what the agent decided (reasoning traces), what it cost (token and API usage), and how long it took (latency per step).
Helicone provides all three with 8ms overhead. For self-hosted stacks, use Prometheus + Grafana with custom agent metrics.
If you can't see what your agent is doing, you can't improve it. It's that simple.
Implement least-privilege at every layer. Tools get scoped API keys. Agents get permission boundaries. Human-in-the-loop for irreversible actions. Audit logging for compliance. Credential rotation on schedule.
Never, ever store secrets in skills or session files.
Test each layer independently, then test integration. Tools should have unit tests. Skills should have evaluation prompts with expected outcomes. Memory should have recall accuracy tests. End-to-end tests should cover representative user journeys.
The pattern: write skill → create test prompts → run agent → evaluate → iterate. Boring? Yes. Essential? Also yes.
LLM costs dominate most agent budgets. Three strategies that compound: Cache repeated queries - up to 95% reduction
- Cascade to cheaper models for simple tasks - up to 85% reduction
- Optimize context loading - skills architecture is 500x more efficient than loading all tool schemas
- Monitor costs weekly. Set alerts for anomalies. A runaway agent loop at $15/M tokens can ruin your weekend.
The AI agent space is moving fast. New frameworks, new models, new patterns every week. But the six-layer mental model holds steady. Data, tools, skills, memory, LLM, harness - in that order, from the ground up.
Don't get caught up chasing shiny objects. Pick the pattern that matches your situation. Follow the checklist. Ship something real.
The best architecture is the one that actually gets built.