CUGAR Agent (2025 Edition)

CUGAR Agent is a production-grade, modular agent stack that embraces 2025’s best practices for LangGraph/LangChain orchestration, LlamaIndex-powered RAG, CrewAI/AutoGen-style multi-agent patterns, and modern observability (Langfuse/OpenInference/Traceloop). The repository is optimized for rapid setup, reproducible demos, and safe extension into enterprise environments. Policy and change-management guardrails are maintained in AGENTS.md and must be reviewed before modifying agents or tools.

At a Glance

Composable agent graph: Planner → Tool/User executor → Memory+Observability hooks, wired for LangGraph.
RAG-ready: LlamaIndex loader/retriever scaffolding with pluggable vector stores (Chroma, Qdrant, Weaviate, Milvus).
Multi-agent: CrewAI/AutoGen-compatible patterns and coordination helpers.
Observability-first: Langfuse/OpenInference emitters, structured audit logs, profile-aware sandboxing.
Developer experience: Typer CLI, Makefile tasks, uv-based env management, Ruff/Black/isort + mypy, pytest+coverage, pre-commit.
Deployment: Dockerfile, GitHub Actions CI/CD, sample configs and .env.example for cloud/on-prem setups.

Recent updates (scaffolding)

Added Watsonx Granite provider stub with deterministic defaults and JSONL audit trail to simplify enterprise alignment.
Added Langflow component placeholders (planner, executor, guard, Granite LLM) to prep for flow export/import commands.
Added registry validation, sandbox profile starter, and documentation shells for security and guardrail mapping.

Architecture

                       ┌──────────────────────────┐
                       │        Controller        │
                       │ (policy + correlation ID)│
                       └────────────┬─────────────┘
                                    │
                           plan(goal, registry)
                                    │
┌──────────────┐          ┌─────────▼─────────┐          ┌────────────────────┐
│ Registry/CFG │──sandbox▶│    Planner        │──steps──▶│   Executor/Tools   │
│ (Hydra/Dyn)  │          │ (ReAct/Plan&Exec) │          │ (LCEL, MCP, HTTP)  │
└──────────────┘          └─────────┬─────────┘          └─────────┬──────────┘
                                    │                              │
                          traces + memory writes         Langfuse/OpenInference
                                    │                              │
                            ┌───────▼────────┐                   ┌─▼────────┐
                            │ Memory / RAG   │◀────context───────│  Clients │
                            │ (LlamaIndex)   │                   │ (CLI/API)│
                            └────────────────┘                   └──────────┘

For a role-by-role, mode-aware walkthrough of how the controller, planners, executors, and MCP tool packs fit together (plus configuration keys), see docs/agents/architecture.md. For an MCP + LangChain web stack overview that covers the FastAPI backend, Vue 3 frontend, streaming flows, and configuration surfaces, see docs/MCP_LANGCHAIN_OVERVIEW.md. A step-by-step stable local launch checklist (registry + sandbox + Langflow readiness) lives in docs/local_stable_launch.md.

Documentation

📘 System Execution Narrative - Complete request → response flow for contributor onboarding (3 entry points: CLI/FastAPI/MCP, 8 execution phases with security boundaries, observability integration, debugging tips, testing guidance)

🔧 FastAPI Role Clarification - Defines FastAPI as transport layer only (HTTP/SSE, auth, budget enforcement) vs orchestration (planning, coordination, execution) to prevent mixing concerns

⚙️ Orchestrator Interface and Semantics - Formal specification for orchestrator API with lifecycle callbacks, failure taxonomy, retry semantics, execution context, routing authority, and implementation patterns

🏢 Enterprise Workflow Examples - Comprehensive end-to-end workflow examples for typical enterprise use cases (customer onboarding, incident response, data pipelines) with planning, error recovery, HITL gates, and external API automation

📊 Observability and Debugging Guide - Comprehensive instrumentation guide with structured logging, distributed tracing (OpenTelemetry/LangFuse/LangSmith), metrics collection, error introspection, replayable traces, dashboards, and troubleshooting playbooks

🧪 Test Coverage Map - Comprehensive test coverage aligned with architectural components showing what's tested (orchestrator 80%, routing 85%, failures 90%) and critical gaps (tools 30%, memory 20%, config 0%, observability 0%) with priorities for additional testing

👋 Developer Onboarding Guide - Step-by-step walkthrough for newcomers: environment setup (15 min), first agent interaction (10 min), create custom tool (20 min), build custom agent (30 min), wire components together (15 min) with full working examples (calculator tool, math tutor agent, tutoring workflow)

Quickstart

# 1) Install (Python >=3.10)
uv sync --all-extras --dev
uv run playwright install --with-deps chromium

# 2) Configure environment
cp .env.example .env
# set OPENAI_API_KEY / LANGFUSE_SECRET / etc inside .env

# 3) Run demo agent locally
uv run cuga start demo

# 4) Try modular stack example
uv run python examples/run_langgraph_demo.py --goal "triage a support ticket"

Installation

Dependencies: uv (or pip), optional browsers for Playwright, optional vector DB service (Chroma/Weaviate/Qdrant/Milvus).
Development: uv sync --all-extras --dev installs dev + optional extras (memory, sandbox, groq, etc.).
Pre-commit: uv run pre-commit install then uv run pre-commit run --all-files.

Configuration

.env.example lists required variables for LLMs, tracing, and storage.
configs/ holds YAML/TOML profiles for agents, LangGraph graphs, memory backends, and observability.
registry.yaml and config/ house MCP/registry defaults; use scripts/verify_guardrails.py before shipping changes.

Guardrails & Change Management

Review AGENTS.md before altering planners, tools, or registry entries; it is the single source of truth for allowlists, sandbox expectations, budgets, and redaction.
Guardrail and registry changes are enforced by CI: scripts/verify_guardrails.py --base <branch> collects diffs and fails if README.md, PRODUCTION_READINESS.md, CHANGELOG.md, or todo1.md are not updated alongside guardrail changes or if ## vNext lacks a guardrail note.
Keep production checklists (PRODUCTION_READINESS.md) and security docs in sync with guardrail adjustments so downstream users understand the default policies and where to override them.
Developer checklist: ensure registry entries declare sandboxes + /workdir pinning for exec scopes, budget/observability env keys (AGENT_*, OTEL_*, LangFuse/LangSmith, Traceloop) are wired, docs/mcp/tiers.md is regenerated from docs/mcp/registry.yaml, and new/updated tests exercise planner ranking, import guardrails, and registry hot-swap determinism.

Agent Types

Planner: ReAct or Plan-and-Execute; emits steps with policy-aware cost/latency hints.
Tool Executor: LCEL/LangChain tools, MCP adapters, HTTP/OpenAPI runners with sandboxed registry resolution.
RAG/Data Agent: LlamaIndex loader+retriever (docs in rag/), vector memory connectors in memory/.
Coordinator: CrewAI/AutoGen-like orchestrator for multi-agent hand-offs.
Observer: Langfuse/OpenInference emitters with correlation IDs and redaction hooks.

See AGENTS.md for role details and USAGE.md for end-to-end flows.

RAG Setup

Drop documents into rag/sources/ or configure a remote store.
Choose a backend in configs/memory.yaml (chroma|qdrant|weaviate|milvus|local).
Run uv run python scripts/load_corpus.py --source rag/sources --backend chroma.
Query via uv run python examples/rag_query.py --query "How do I add a new MCP tool?".

Memory & State

memory/ exposes VectorMemory (in-memory fallback), summarization hooks, and profile-scoped stores.
State keys are namespaced by profile to preserve sandbox isolation.
Persistence is opt-in; see configs/memory.yaml and TESTING.md for guidance.

Observability

Langfuse client is wired via observability/langfuse.py with sampling + PII redaction hooks.
OpenInference/Traceloop emitters are optional and can be toggled per profile.
Structured audit logs live under logs/ when enabled; avoid committing artifacts.
Watsonx Granite calls validate credentials up front and append JSONL audit rows with timestamp, actor, parameters, and outcome for offline review.

Observability preview

The FastAPI orchestrator exposes a Prometheus-compatible metrics endpoint at /metrics (default port 8000). This endpoint exports golden-signal metrics such as cuga_requests_total, cuga_success_rate, cuga_latency_ms{percentile="p50|p95|p99"}, cuga_tool_error_rate, cuga_budget_warnings_total, and cuga_budget_exceeded_total.
Configure OpenTelemetry (OTLP) or console exporters via environment variables. Common envs:
- OTEL_EXPORTER_OTLP_ENDPOINT — OTLP HTTP/gRPC endpoint for traces/metrics (optional; when unset the console exporter is used).
- OTEL_SERVICE_NAME — service name to appear in traces (default: cuga-orchestrator).
- OTEL_TRACES_EXPORTER / OTEL_METRICS_EXPORTER — exporter type (otlp, logging, none).

Example: curl the metrics endpoint locally

# If running the orchestrator locally on port 8000
curl -sS http://localhost:8000/metrics | head -n 80

# Expected sample lines (Prometheus format):
# cuga_requests_total 42
# cuga_success_rate 0.95
# cuga_latency_ms{percentile="p50"} 150.0
# cuga_latency_ms{percentile="p95"} 450.0
# cuga_tool_error_rate 0.02
# cuga_budget_warnings_total 3
# cuga_budget_exceeded_total 0

Multi-Agent & Coordination

agents/ outlines planner/worker/tool-user patterns and how to register them with CrewAI/AutoGen.
examples/multi_agent_dispatch.py demonstrates round-robin delegation with shared vector context.
Hand-offs carry correlation IDs and redacted summaries, not raw prompts.

Testing & Quality Gates

Run make lint test typecheck locally.
Pytest with coverage is configured (see TESTING.md).
CI (GitHub Actions) runs lint, type-check, tests, and guardrail verification on pushes/PRs.

Security Model

CUGAR Agent enforces security-first design with deny-by-default policies per AGENTS.md:

Core Security Principles

Allowlist-First Tool Selection: Only explicitly allowed tools from cuga.modular.tools.* can execute
Deny-by-Default Network: Network egress restricted to domain allowlist; localhost/private networks blocked by default
Sandbox Isolation: All tool execution in isolated sandboxes (py/node slim|full, orchestrator profiles) with read-only mounts
Budget Enforcement: Cost ceilings (default: 100 units/task) with warn or block policies
Human-in-the-Loop Approval: High-risk operations (DELETE, FINANCIAL) require explicit approval before execution

Security Architecture

Request → Budget Guard → Tool Allowlist → Parameter Validation → Network Policy → Sandbox Execution
            ↓               ↓                    ↓                      ↓               ↓
       (ceiling=100)   (cuga.modular   (type/range/pattern)    (domain allowlist) (read-only)
                        .tools.* only)                          (no localhost)

Approval Flow (HITL):

Low-risk (READ): Auto-approved, logged
Medium-risk (WRITE): Auto-approved with audit trail
High-risk (DELETE, FINANCIAL): Requires human approval (5min timeout, reject on timeout)

Budget Policy:

AGENT_BUDGET_CEILING=100 (default): Max cost units per task
AGENT_BUDGET_POLICY=warn|block: Warn and continue, or block execution
AGENT_ESCALATION_MAX=2: Max approval escalations before admin approval required

See SECURITY.md for complete security controls and docs/security/GOVERNANCE.md for governance architecture.

Security & Safe Execution

CUGAR Agent enforces security-first design with deny-by-default policies per AGENTS.md § 4 Sandbox Expectations:

MCP & OpenAPI Governance

Policy Gates: HITL approval points for WRITE/DELETE/FINANCIAL actions (Slack send, file delete, stock orders)
Per-Tenant Capability Maps: 8 organizational roles (marketing/trading/engineering/support) with tool allowlists/denylists
Runtime Health Checks: Tool discovery ping, schema drift detection, cache TTLs to prevent huge cold-start lists
Layered Access Control: Tool registration → Tenant map → Tool-level restrictions → Rate limits

See docs/security/GOVERNANCE.md for complete governance architecture, configuration files, and integration patterns.

Eval/Exec Elimination

No eval/exec: All eval() and exec() calls eliminated from production code paths
AST-based expression evaluation: Use safe_eval_expression() from cuga.backend.tools_env.code_sandbox.safe_eval for mathematical expressions
- Allowlisted operators: Add/Sub/Mul/Div/FloorDiv/Mod/Pow
- Allowlisted functions: math.sin/cos/tan/sqrt/log/exp, abs/round/min/max/sum
- Denies: assignments, imports, attribute access, eval/exec/import
SafeCodeExecutor: All code execution routed through SafeCodeExecutor or safe_execute_code() from cuga.backend.tools_env.code_sandbox.safe_exec
- Import allowlist: Only cuga.modular.tools.* permitted
- Import denylist: os/sys/subprocess/socket/pickle/eval/exec/compile
- Restricted builtins: Safe operations (math/types/iteration) allowed; eval/exec/open/import denied
- Filesystem deny-default: No file operations unless explicitly allowed
- Timeout enforcement: 30s default, configurable
- Audit trail: All imports/executions logged with trace_id

HTTP & Secrets Hardening

SafeClient wrapper: All HTTP requests MUST use SafeClient from cuga.security.http_client
- Enforced timeouts: 10.0s read, 5.0s connect, 10.0s write, 10.0s total
- Automatic retry: Exponential backoff (4 attempts max, 8s max wait)
- URL redaction: Query params and credentials stripped from logs
Env-only secrets: Credentials MUST be loaded from environment variables
- CI enforces .env.example parity validation (no missing keys)
- Secret scanning: trufflehog + gitleaks on every push/PR
- Hardcoded API keys/tokens trigger CI failure

Import & Sandbox Controls

Import restrictions: Dynamic imports limited to cuga.modular.tools.* namespace only
Profile isolation: Memory and tool access namespaced per profile; no cross-profile leakage
Sandbox profiles: All registry entries declare sandbox profile (py/node slim|full, orchestrator)
Read-only defaults: Mounts are read-only by default; /workdir pinning for exec scopes

See AGENTS.md for complete guardrail specifications and docs/security/ for detailed security controls.

Observability & Monitoring

CUGAR Agent provides production-grade observability with structured events, golden signals, and multi-backend export:

Observability Stack

Structured Events: plan_created, route_decision, tool_call_start/complete/error, budget_warning/exceeded, approval_requested/received/timeout
Golden Signals: Success rate (%), latency (P50/P95/P99), tool error rate (%), mean steps/task, approval wait time, budget utilization
Trace Propagation: trace_id flows through CLI → planner → worker → coordinator → tools with parent-child relationships
PII Redaction: Auto-redact sensitive keys (secret, token, password, api_key, credential, auth) before emission

Monitoring Endpoints

# Prometheus metrics endpoint (scrape target)
curl http://localhost:8000/metrics

# Expected metrics:
cuga_requests_total              # Total requests handled
cuga_success_rate               # % successful requests
cuga_latency_ms{percentile}     # P50/P95/P99 latency
cuga_tool_error_rate            # % failed tool calls
cuga_steps_per_task             # Mean planning steps
cuga_budget_warnings_total      # Budget warnings emitted
cuga_budget_exceeded_total      # Budget hard blocks
cuga_approval_requests_total{status}  # Approval flow tracking

Multi-Backend Support

OpenTelemetry (OTLP): Set OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318 for Jaeger/Zipkin/Tempo
LangFuse: Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST for LLM tracing
LangSmith: Set LANGCHAIN_API_KEY, LANGCHAIN_PROJECT, LANGCHAIN_ENDPOINT
Console (Default): Offline-first JSON logs to stdout (no network required)

Grafana Dashboard

Import pre-built dashboard from observability/grafana_dashboard.json:

Request rate & success rate panels
Latency percentile charts (P50/P95/P99)
Tool error breakdown by tool/type
Budget utilization gauge
Approval queue depth
Event timeline with filtering

Configuration:

# Enable OTEL export
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_SERVICE_NAME="cuga-orchestrator"

# Start with observability
uv run cuga start demo

See docs/observability/OBSERVABILITY_GUIDE.md for detailed instrumentation guide and PRODUCTION_READINESS.md for metrics scraping setup.

FAQ

Which LLMs are supported?
- OpenAI (GPT-4o, GPT-4 Turbo)
- Azure OpenAI
- Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
- IBM Watsonx / Granite 4.0 (granite-4-h-small, granite-4-h-micro, granite-4-h-tiny) — Default provider with deterministic temperature=0.0
- Groq (Mixtral)
- Google GenAI
- Any LangChain-compatible model via adapters
Do I need a vector DB? Not for quickstarts; an in-memory store is bundled. For production use Chroma/Qdrant/Weaviate/Milvus.
How do I add a new tool? Implement ToolSpec in tools/registry.py or wrap an MCP server; see USAGE.md.
Is this production-ready? Core stack follows sandboxed, profile-scoped design with observability. Harden configs before internet-facing use.
How do I configure Watsonx/Granite? Set environment variables: WATSONX_API_KEY, WATSONX_PROJECT_ID, and optionally WATSONX_URL. See docs/configuration/ENVIRONMENT_MODES.md for details.

Documentation

For a complete understanding of system execution flow:

📘 System Execution Narrative - Complete request → response flow for contributor onboarding (CLI/FastAPI/MCP modes, routing, agents, memory, tools)
🏗️ Architecture - High-level design overview
🚀 Quick Start - Get up and running quickly
🤝 Contributing - How to contribute to the project
🔒 Production Readiness - Deployment considerations

Roadmap Highlights

Streaming-first ReAct policies with beta support for Strands/semantic state machines.
Built-in eval harness for self-play and regression suites.
Optional LangServe or FastAPI hosting for SaaS-style deployments (see ROADMAP.md).

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
.cra		.cra
.github		.github
.run		.run
.vscode		.vscode
agents		agents
build		build
config		config
configs		configs
configurations		configurations
docs		docs
examples		examples
fastapi_testutils		fastapi_testutils
mcp-foundation		mcp-foundation
mcp-music-agent/logic		mcp-music-agent/logic
mcp-plumbing		mcp-plumbing
mcp-trading-agent/logic		mcp-trading-agent/logic
mcp-wordpress-agent/logic		mcp-wordpress-agent/logic
memory		memory
observability		observability
ops		ops
policies		policies
rag		rag
roadmap		roadmap
routing		routing
sandbox		sandbox
scripts		scripts
src		src
tests		tests
tools		tools
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.env.mcp		.env.mcp
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.secrets.baseline		.secrets.baseline
.whitesource		.whitesource
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLEANUP_EXECUTION.md		CLEANUP_EXECUTION.md
CLEANUP_PLAN.md		CLEANUP_PLAN.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DETERMINISTIC_ROUTING_PLANNING_SUMMARY.md		DETERMINISTIC_ROUTING_PLANNING_SUMMARY.md
DOCUMENTATION.md		DOCUMENTATION.md
Dockerfile		Dockerfile
E2E_TEST_STATUS.md		E2E_TEST_STATUS.md
EVALUATION.md		EVALUATION.md
HARDENING_SUMMARY.md		HARDENING_SUMMARY.md
IMPLEMENTATION_CHECKLIST.md		IMPLEMENTATION_CHECKLIST.md
LICENSE		LICENSE
MIGRATION.md		MIGRATION.md
Makefile		Makefile
NEXT.md		NEXT.md
PATCH_SUMMARY.txt		PATCH_SUMMARY.txt
PRODUCTION_READINESS.md		PRODUCTION_READINESS.md
PROTOCOL_INTEGRATION_STATUS.md		PROTOCOL_INTEGRATION_STATUS.md
QUICK_START.md		QUICK_START.md
README.md		README.md
READYOU.md		READYOU.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SECURITY_HARDENING_SUMMARY.md		SECURITY_HARDENING_SUMMARY.md
SESSION_SUMMARY_2026-01-03.md		SESSION_SUMMARY_2026-01-03.md
TASK_10_COMPLETION.md		TASK_10_COMPLETION.md
TASK_5_COMPLETION.md		TASK_5_COMPLETION.md
TASK_6_COMPLETION.md		TASK_6_COMPLETION.md
TASK_7_COMPLETION.md		TASK_7_COMPLETION.md
TASK_9_COMPLETION.md		TASK_9_COMPLETION.md
TASK_9_STATUS.md		TASK_9_STATUS.md
TESTING.md		TESTING.md
TODO.md		TODO.md
USAGE.md		USAGE.md
V1_0_0_SHIP_STATUS.md		V1_0_0_SHIP_STATUS.md
V1_0_0_TEST_VALIDATION_COMPLETE.md		V1_0_0_TEST_VALIDATION_COMPLETE.md
V1_1_0_COMPLETION_SUMMARY.md		V1_1_0_COMPLETION_SUMMARY.md
VERSION.txt		VERSION.txt
__init__.py		__init__.py
constraints.txt		constraints.txt
future-features.md		future-features.md
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
registry.yaml		registry.yaml
ruff.toml		ruff.toml
run_stability_tests.py		run_stability_tests.py
todo1.md		todo1.md
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CUGAR Agent (2025 Edition)

At a Glance

Recent updates (scaffolding)

Architecture

Documentation

Quickstart

Installation

Configuration

Guardrails & Change Management

Agent Types

RAG Setup

Memory & State

Observability

Observability preview

Multi-Agent & Coordination

Testing & Quality Gates

Security Model

Core Security Principles

Security Architecture

Security & Safe Execution

MCP & OpenAPI Governance

Eval/Exec Elimination

HTTP & Secrets Hardening

Import & Sandbox Controls

Observability & Monitoring

Observability Stack

Monitoring Endpoints

Multi-Backend Support

Grafana Dashboard

FAQ

Documentation

Roadmap Highlights

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages