CUGAR Agent is a production-grade, modular agent stack that embraces 2025’s best practices for LangGraph/LangChain orchestration, LlamaIndex-powered RAG, CrewAI/AutoGen-style multi-agent patterns, and modern observability (Langfuse/OpenInference/Traceloop). The repository is optimized for rapid setup, reproducible demos, and safe extension into enterprise environments. Policy and change-management guardrails are maintained in AGENTS.md and must be reviewed before modifying agents or tools.
- Composable agent graph: Planner → Tool/User executor → Memory+Observability hooks, wired for LangGraph.
- RAG-ready: LlamaIndex loader/retriever scaffolding with pluggable vector stores (Chroma, Qdrant, Weaviate, Milvus).
- Multi-agent: CrewAI/AutoGen-compatible patterns and coordination helpers.
- Observability-first: Langfuse/OpenInference emitters, structured audit logs, profile-aware sandboxing.
- Developer experience: Typer CLI, Makefile tasks, uv-based env management, Ruff/Black/isort + mypy, pytest+coverage, pre-commit.
- Deployment: Dockerfile, GitHub Actions CI/CD, sample configs and .env.example for cloud/on-prem setups.
- Added Watsonx Granite provider stub with deterministic defaults and JSONL audit trail to simplify enterprise alignment.
- Added Langflow component placeholders (planner, executor, guard, Granite LLM) to prep for flow export/import commands.
- Added registry validation, sandbox profile starter, and documentation shells for security and guardrail mapping.
┌──────────────────────────┐
│ Controller │
│ (policy + correlation ID)│
└────────────┬─────────────┘
│
plan(goal, registry)
│
┌──────────────┐ ┌─────────▼─────────┐ ┌────────────────────┐
│ Registry/CFG │──sandbox▶│ Planner │──steps──▶│ Executor/Tools │
│ (Hydra/Dyn) │ │ (ReAct/Plan&Exec) │ │ (LCEL, MCP, HTTP) │
└──────────────┘ └─────────┬─────────┘ └─────────┬──────────┘
│ │
traces + memory writes Langfuse/OpenInference
│ │
┌───────▼────────┐ ┌─▼────────┐
│ Memory / RAG │◀────context───────│ Clients │
│ (LlamaIndex) │ │ (CLI/API)│
└────────────────┘ └──────────┘
For a role-by-role, mode-aware walkthrough of how the controller, planners, executors, and MCP tool packs fit together (plus configuration keys), see docs/agents/architecture.md. For an MCP + LangChain web stack overview that covers the FastAPI backend, Vue 3 frontend, streaming flows, and configuration surfaces, see docs/MCP_LANGCHAIN_OVERVIEW.md. A step-by-step stable local launch checklist (registry + sandbox + Langflow readiness) lives in docs/local_stable_launch.md.
📘 System Execution Narrative - Complete request → response flow for contributor onboarding (3 entry points: CLI/FastAPI/MCP, 8 execution phases with security boundaries, observability integration, debugging tips, testing guidance)
🔧 FastAPI Role Clarification - Defines FastAPI as transport layer only (HTTP/SSE, auth, budget enforcement) vs orchestration (planning, coordination, execution) to prevent mixing concerns
⚙️ Orchestrator Interface and Semantics - Formal specification for orchestrator API with lifecycle callbacks, failure taxonomy, retry semantics, execution context, routing authority, and implementation patterns
🏢 Enterprise Workflow Examples - Comprehensive end-to-end workflow examples for typical enterprise use cases (customer onboarding, incident response, data pipelines) with planning, error recovery, HITL gates, and external API automation
📊 Observability and Debugging Guide - Comprehensive instrumentation guide with structured logging, distributed tracing (OpenTelemetry/LangFuse/LangSmith), metrics collection, error introspection, replayable traces, dashboards, and troubleshooting playbooks
🧪 Test Coverage Map - Comprehensive test coverage aligned with architectural components showing what's tested (orchestrator 80%, routing 85%, failures 90%) and critical gaps (tools 30%, memory 20%, config 0%, observability 0%) with priorities for additional testing
👋 Developer Onboarding Guide - Step-by-step walkthrough for newcomers: environment setup (15 min), first agent interaction (10 min), create custom tool (20 min), build custom agent (30 min), wire components together (15 min) with full working examples (calculator tool, math tutor agent, tutoring workflow)
# 1) Install (Python >=3.10)
uv sync --all-extras --dev
uv run playwright install --with-deps chromium
# 2) Configure environment
cp .env.example .env
# set OPENAI_API_KEY / LANGFUSE_SECRET / etc inside .env
# 3) Run demo agent locally
uv run cuga start demo
# 4) Try modular stack example
uv run python examples/run_langgraph_demo.py --goal "triage a support ticket"- Dependencies:
uv(orpip), optional browsers for Playwright, optional vector DB service (Chroma/Weaviate/Qdrant/Milvus). - Development:
uv sync --all-extras --devinstalls dev + optional extras (memory,sandbox,groq, etc.). - Pre-commit:
uv run pre-commit installthenuv run pre-commit run --all-files.
.env.examplelists required variables for LLMs, tracing, and storage.configs/holds YAML/TOML profiles for agents, LangGraph graphs, memory backends, and observability.registry.yamlandconfig/house MCP/registry defaults; usescripts/verify_guardrails.pybefore shipping changes.
- Review AGENTS.md before altering planners, tools, or registry entries; it is the single source of truth for allowlists, sandbox expectations, budgets, and redaction.
- Guardrail and registry changes are enforced by CI:
scripts/verify_guardrails.py --base <branch>collects diffs and fails ifREADME.md,PRODUCTION_READINESS.md,CHANGELOG.md, ortodo1.mdare not updated alongside guardrail changes or if## vNextlacks a guardrail note. - Keep production checklists (PRODUCTION_READINESS.md) and security docs in sync with guardrail adjustments so downstream users understand the default policies and where to override them.
- Developer checklist: ensure registry entries declare sandboxes +
/workdirpinning for exec scopes, budget/observability env keys (AGENT_*,OTEL_*, LangFuse/LangSmith, Traceloop) are wired,docs/mcp/tiers.mdis regenerated fromdocs/mcp/registry.yaml, and new/updated tests exercise planner ranking, import guardrails, and registry hot-swap determinism.
- Planner: ReAct or Plan-and-Execute; emits steps with policy-aware cost/latency hints.
- Tool Executor: LCEL/LangChain tools, MCP adapters, HTTP/OpenAPI runners with sandboxed registry resolution.
- RAG/Data Agent: LlamaIndex loader+retriever (docs in
rag/), vector memory connectors inmemory/. - Coordinator: CrewAI/AutoGen-like orchestrator for multi-agent hand-offs.
- Observer: Langfuse/OpenInference emitters with correlation IDs and redaction hooks.
See AGENTS.md for role details and USAGE.md for end-to-end flows.
- Drop documents into
rag/sources/or configure a remote store. - Choose a backend in
configs/memory.yaml(chroma|qdrant|weaviate|milvus|local). - Run
uv run python scripts/load_corpus.py --source rag/sources --backend chroma. - Query via
uv run python examples/rag_query.py --query "How do I add a new MCP tool?".
memory/exposesVectorMemory(in-memory fallback), summarization hooks, and profile-scoped stores.- State keys are namespaced by profile to preserve sandbox isolation.
- Persistence is opt-in; see
configs/memory.yamlandTESTING.mdfor guidance.
- Langfuse client is wired via
observability/langfuse.pywith sampling + PII redaction hooks. - OpenInference/Traceloop emitters are optional and can be toggled per profile.
- Structured audit logs live under
logs/when enabled; avoid committing artifacts. - Watsonx Granite calls validate credentials up front and append JSONL audit rows with timestamp, actor, parameters, and outcome for offline review.
- The FastAPI orchestrator exposes a Prometheus-compatible metrics endpoint at
/metrics(default port 8000). This endpoint exports golden-signal metrics such ascuga_requests_total,cuga_success_rate,cuga_latency_ms{percentile="p50|p95|p99"},cuga_tool_error_rate,cuga_budget_warnings_total, andcuga_budget_exceeded_total. - Configure OpenTelemetry (OTLP) or console exporters via environment variables. Common envs:
OTEL_EXPORTER_OTLP_ENDPOINT— OTLP HTTP/gRPC endpoint for traces/metrics (optional; when unset the console exporter is used).OTEL_SERVICE_NAME— service name to appear in traces (default:cuga-orchestrator).OTEL_TRACES_EXPORTER/OTEL_METRICS_EXPORTER— exporter type (otlp, logging, none).
Example: curl the metrics endpoint locally
# If running the orchestrator locally on port 8000
curl -sS http://localhost:8000/metrics | head -n 80
# Expected sample lines (Prometheus format):
# cuga_requests_total 42
# cuga_success_rate 0.95
# cuga_latency_ms{percentile="p50"} 150.0
# cuga_latency_ms{percentile="p95"} 450.0
# cuga_tool_error_rate 0.02
# cuga_budget_warnings_total 3
# cuga_budget_exceeded_total 0agents/outlines planner/worker/tool-user patterns and how to register them with CrewAI/AutoGen.examples/multi_agent_dispatch.pydemonstrates round-robin delegation with shared vector context.- Hand-offs carry correlation IDs and redacted summaries, not raw prompts.
- Run
make lint test typechecklocally. - Pytest with coverage is configured (see
TESTING.md). - CI (GitHub Actions) runs lint, type-check, tests, and guardrail verification on pushes/PRs.
CUGAR Agent enforces security-first design with deny-by-default policies per AGENTS.md:
- Allowlist-First Tool Selection: Only explicitly allowed tools from
cuga.modular.tools.*can execute - Deny-by-Default Network: Network egress restricted to domain allowlist; localhost/private networks blocked by default
- Sandbox Isolation: All tool execution in isolated sandboxes (py/node slim|full, orchestrator profiles) with read-only mounts
- Budget Enforcement: Cost ceilings (default: 100 units/task) with
warnorblockpolicies - Human-in-the-Loop Approval: High-risk operations (DELETE, FINANCIAL) require explicit approval before execution
Request → Budget Guard → Tool Allowlist → Parameter Validation → Network Policy → Sandbox Execution
↓ ↓ ↓ ↓ ↓
(ceiling=100) (cuga.modular (type/range/pattern) (domain allowlist) (read-only)
.tools.* only) (no localhost)
Approval Flow (HITL):
- Low-risk (READ): Auto-approved, logged
- Medium-risk (WRITE): Auto-approved with audit trail
- High-risk (DELETE, FINANCIAL): Requires human approval (5min timeout, reject on timeout)
Budget Policy:
AGENT_BUDGET_CEILING=100(default): Max cost units per taskAGENT_BUDGET_POLICY=warn|block: Warn and continue, or block executionAGENT_ESCALATION_MAX=2: Max approval escalations before admin approval required
See SECURITY.md for complete security controls and docs/security/GOVERNANCE.md for governance architecture.
CUGAR Agent enforces security-first design with deny-by-default policies per AGENTS.md § 4 Sandbox Expectations:
- Policy Gates: HITL approval points for WRITE/DELETE/FINANCIAL actions (Slack send, file delete, stock orders)
- Per-Tenant Capability Maps: 8 organizational roles (marketing/trading/engineering/support) with tool allowlists/denylists
- Runtime Health Checks: Tool discovery ping, schema drift detection, cache TTLs to prevent huge cold-start lists
- Layered Access Control: Tool registration → Tenant map → Tool-level restrictions → Rate limits
See docs/security/GOVERNANCE.md for complete governance architecture, configuration files, and integration patterns.
- No eval/exec: All
eval()andexec()calls eliminated from production code paths - AST-based expression evaluation: Use
safe_eval_expression()fromcuga.backend.tools_env.code_sandbox.safe_evalfor mathematical expressions- Allowlisted operators: Add/Sub/Mul/Div/FloorDiv/Mod/Pow
- Allowlisted functions: math.sin/cos/tan/sqrt/log/exp, abs/round/min/max/sum
- Denies: assignments, imports, attribute access, eval/exec/import
- SafeCodeExecutor: All code execution routed through
SafeCodeExecutororsafe_execute_code()fromcuga.backend.tools_env.code_sandbox.safe_exec- Import allowlist: Only
cuga.modular.tools.*permitted - Import denylist: os/sys/subprocess/socket/pickle/eval/exec/compile
- Restricted builtins: Safe operations (math/types/iteration) allowed; eval/exec/open/import denied
- Filesystem deny-default: No file operations unless explicitly allowed
- Timeout enforcement: 30s default, configurable
- Audit trail: All imports/executions logged with trace_id
- Import allowlist: Only
- SafeClient wrapper: All HTTP requests MUST use
SafeClientfromcuga.security.http_client- Enforced timeouts: 10.0s read, 5.0s connect, 10.0s write, 10.0s total
- Automatic retry: Exponential backoff (4 attempts max, 8s max wait)
- URL redaction: Query params and credentials stripped from logs
- Env-only secrets: Credentials MUST be loaded from environment variables
- CI enforces
.env.exampleparity validation (no missing keys) - Secret scanning: trufflehog + gitleaks on every push/PR
- Hardcoded API keys/tokens trigger CI failure
- CI enforces
- Import restrictions: Dynamic imports limited to
cuga.modular.tools.*namespace only - Profile isolation: Memory and tool access namespaced per profile; no cross-profile leakage
- Sandbox profiles: All registry entries declare sandbox profile (py/node slim|full, orchestrator)
- Read-only defaults: Mounts are read-only by default;
/workdirpinning for exec scopes
See AGENTS.md for complete guardrail specifications and docs/security/ for detailed security controls.
CUGAR Agent provides production-grade observability with structured events, golden signals, and multi-backend export:
- Structured Events:
plan_created,route_decision,tool_call_start/complete/error,budget_warning/exceeded,approval_requested/received/timeout - Golden Signals: Success rate (%), latency (P50/P95/P99), tool error rate (%), mean steps/task, approval wait time, budget utilization
- Trace Propagation:
trace_idflows through CLI → planner → worker → coordinator → tools with parent-child relationships - PII Redaction: Auto-redact sensitive keys (
secret,token,password,api_key,credential,auth) before emission
# Prometheus metrics endpoint (scrape target)
curl http://localhost:8000/metrics
# Expected metrics:
cuga_requests_total # Total requests handled
cuga_success_rate # % successful requests
cuga_latency_ms{percentile} # P50/P95/P99 latency
cuga_tool_error_rate # % failed tool calls
cuga_steps_per_task # Mean planning steps
cuga_budget_warnings_total # Budget warnings emitted
cuga_budget_exceeded_total # Budget hard blocks
cuga_approval_requests_total{status} # Approval flow tracking- OpenTelemetry (OTLP): Set
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318for Jaeger/Zipkin/Tempo - LangFuse: Set
LANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY,LANGFUSE_HOSTfor LLM tracing - LangSmith: Set
LANGCHAIN_API_KEY,LANGCHAIN_PROJECT,LANGCHAIN_ENDPOINT - Console (Default): Offline-first JSON logs to stdout (no network required)
Import pre-built dashboard from observability/grafana_dashboard.json:
- Request rate & success rate panels
- Latency percentile charts (P50/P95/P99)
- Tool error breakdown by tool/type
- Budget utilization gauge
- Approval queue depth
- Event timeline with filtering
Configuration:
# Enable OTEL export
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_SERVICE_NAME="cuga-orchestrator"
# Start with observability
uv run cuga start demoSee docs/observability/OBSERVABILITY_GUIDE.md for detailed instrumentation guide and PRODUCTION_READINESS.md for metrics scraping setup.
- Which LLMs are supported?
- OpenAI (GPT-4o, GPT-4 Turbo)
- Azure OpenAI
- Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
- IBM Watsonx / Granite 4.0 (granite-4-h-small, granite-4-h-micro, granite-4-h-tiny) — Default provider with deterministic temperature=0.0
- Groq (Mixtral)
- Google GenAI
- Any LangChain-compatible model via adapters
- Do I need a vector DB? Not for quickstarts; an in-memory store is bundled. For production use Chroma/Qdrant/Weaviate/Milvus.
- How do I add a new tool? Implement
ToolSpecintools/registry.pyor wrap an MCP server; seeUSAGE.md. - Is this production-ready? Core stack follows sandboxed, profile-scoped design with observability. Harden configs before internet-facing use.
- How do I configure Watsonx/Granite? Set environment variables:
WATSONX_API_KEY,WATSONX_PROJECT_ID, and optionallyWATSONX_URL. Seedocs/configuration/ENVIRONMENT_MODES.mdfor details.
For a complete understanding of system execution flow:
- 📘 System Execution Narrative - Complete request → response flow for contributor onboarding (CLI/FastAPI/MCP modes, routing, agents, memory, tools)
- 🏗️ Architecture - High-level design overview
- 🚀 Quick Start - Get up and running quickly
- 🤝 Contributing - How to contribute to the project
- 🔒 Production Readiness - Deployment considerations
- Streaming-first ReAct policies with beta support for Strands/semantic state machines.
- Built-in eval harness for self-play and regression suites.
- Optional LangServe or FastAPI hosting for SaaS-style deployments (see
ROADMAP.md).
Apache 2.0. See LICENSE.
