A production system where 4 AI agents collaborate on software development without direct communication, coordinating through stigmergy (indirect communication via shared environment).
This system enables multiple AI agents to work autonomously on a shared codebase. Instead of complex message-passing protocols, agents coordinate through stigmergy - the same principle that allows ant colonies to build complex structures without central coordination.
Key Achievement: 80% reduction in token usage through context optimization and knowledge caching.
┌─────────────────────────────────────┐
│ SHARED ENVIRONMENT │
│ (Git Repo) │
└─────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌───────────┐
│ THINKER │ │ BUILDERS │ │ GUARDIAN │
│ Architect│ │ UI & DDD │ │ Reviewer │
└──────────┘ └──────────────┘ └───────────┘
│ │ │
│ Creates │ Claims & │ Reviews &
│ Tasks │ Implements │ Approves
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌───────────┐
│ queue/ │ ──────────▶ │ active/ │ ──────────▶ │ pending/ │
│ tasks │ │ work │ │ reviews │
└──────────┘ └──────────────┘ └───────────┘
| Agent | Role | Responsibility |
|---|---|---|
| THINKER | Architect | Creates tasks, analyzes stuck items, runs self-improvement cycles |
| BUILDER-UI | Frontend | React components, TypeScript, styling, browser testing |
| BUILDER-DDD | Backend | Domain logic, API routes, database services |
| GUARDIAN | Reviewer | Code review, quality gates, security checks |
No direct agent-to-agent communication. All coordination happens through file changes:
- Tasks are created in
queue.json, claimed by moving toactive.json - Reviews are submitted to
pending/, approved by moving toapproved/ - Knowledge accumulates in shared
patterns.jsonlandlessons.jsonl - Events propagate through timestamped log files
1. Agent reads tasks/queue.json
2. Finds unclaimed task matching their skills
3. Moves task to tasks/active.json with their ID
4. Commits and pushes immediately
5. If push fails (conflict) → another agent claimed → retry
Git's built-in conflict detection acts as a distributed mutex.
| Scenario | Recovery |
|---|---|
| Agent crash | 4-hour task timeout, auto-release |
| Stale lock | 2-hour expiry, can override |
| Stuck task (3+ rejections) | THINKER analyzes, decomposes or escalates |
| Review overflow (>10 pending) | Alert triggered |
| Git conflict | Auto-rebase with exponential backoff |
Every 24 hours:
- Collect all rejections
- Group by pattern
- If pattern occurs 3+ times → draft prompt improvement
- Apply to agent, track metrics
- Evaluate after 24h
├── config/
│ └── system.json # Agent definitions, safety rules, timing
├── docs/
│ ├── COORDINATION.md # Full protocol specification
│ └── SELF-HEALING.md # Recovery mechanisms
├── examples/
│ ├── task.json # Task structure
│ ├── lock.json # Resource locking
│ └── event.jsonl # Event propagation
├── knowledge/
│ ├── patterns.jsonl # Validated code patterns
│ └── lessons.jsonl # Recorded mistakes and fixes
└── prompts/
└── thinker.md # Agent work instructions
The system maintains a self-improving knowledge base:
Patterns - Reusable code solutions:
{
"id": "P-005",
"name": "Tenant Isolation Pattern",
"description": "Always include tenant_id in WHERE clauses",
"code": "SELECT * FROM jobs WHERE id = $1 AND tenant_id = $2",
"confidence": 1.0,
"uses": 25
}Lessons - Recorded mistakes:
{
"id": "L-002",
"trigger": "database query, multi-tenant lookup",
"mistake": "Missing tenant_id in WHERE clause",
"fix": "Always add AND tenant_id = $N to every query",
"success_rate": 1.0
}- 80% token reduction through incremental context loading and knowledge caching
- Zero conflicts with proper locking protocol
- Autonomous operation for extended periods
- Self-improving through pattern recognition
- Scalability - Adding agents doesn't require protocol changes
- Resilience - Agent crashes don't break coordination
- Simplicity - No complex message routing
- Auditability - All state in files, git history tracks everything
- Built-in conflict detection via push failures
- Full history of all state changes
- Distributed - agents can work offline
- Human-readable state files
- Coordination: Git + JSON files
- Agent Runtime: Claude API (Anthropic)
- Target Project: TypeScript, React, Node.js, PostgreSQL
This approach draws from:
- Swarm intelligence (ant colony optimization)
- Distributed systems consensus algorithms
- Event sourcing patterns
- GitOps principles
Designed and implemented by Vladyslav Shapovalov as part of building an AI-powered field service platform. The multi-agent approach enables continuous development with minimal human oversight.
MIT License - see LICENSE for details.
For questions and support, please open an issue on GitHub.