This checklist ensures the CUGAR Agent system is hardened, documented, and version-controlled for production release.
- Supported Python Versions:
>=3.10, <3.14 - Tested Versions: 3.10, 3.11, 3.12, 3.13
- Recommended for Production: Python 3.11 or 3.12
- Version Enforcement: Defined in
pyproject.tomlrequires-pythonfield - CI Matrix: GitHub Actions tests across Python 3.10–3.12 on every push/PR
- Python 3.10: Minimum supported version (union types, match statements)
- Python 3.11: Recommended (performance improvements, better error messages)
- Python 3.12: Fully supported (improved typing, faster execution)
- Python 3.13: Supported with latest dependencies
- Python 3.14+: Not yet supported (dependency constraints)
- Kubernetes: v1.24+ (required for production deployments)
- Minimum 3 worker nodes
- Resource quotas per namespace
- Pod security policies enforced
- Network policies for service isolation
- Persistent volume claims for stateful components
-
PostgreSQL: v14+ for persistent storage
- Minimum 2 cores, 4GB RAM
- Replication enabled (primary + 1 replica minimum)
- Regular backups configured (see Disaster Recovery)
-
Redis: v7.0+ for caching and session management
- Minimum 1 core, 2GB RAM
- Persistence enabled (AOF + RDB)
- Replication for high availability
-
Message Queue (optional but recommended): RabbitMQ or Kafka
- For async task processing
- Minimum 2 cores, 4GB RAM
-
Headless Chromium: Required for web automation tasks
- Installed via Playwright:
uv run playwright install --with-deps chromium - Container image:
mcr.microsoft.com/playwright:v1.49.0-jammy - Minimum 2 cores, 4GB RAM per browser instance
- Sandboxed execution environment
- Installed via Playwright:
-
Playwright: v1.49.0 (pinned in
pyproject.toml)- System dependencies automatically installed
- Supports headless and headed modes
- Screenshot and PDF generation capabilities
- Network interception and request mocking
- CPU: Minimum 2 cores, Recommended 4 cores
- Memory: Minimum 4GB, Recommended 8GB
- Disk: Minimum 20GB, Recommended 50GB (includes model cache, logs, temp files)
- Network: Low latency to model APIs (<100ms recommended)
- Horizontal scaling supported via replica count
- Load balancer required for multi-instance deployments
- Shared storage for tool registry cache and embeddings
- Session affinity recommended for stateful workflows
-
AGENT_BUDGET_CEILING: Default100cost units per task- Tracks cumulative tool execution costs
- Enforced before tool execution begins
- Configurable per profile/tenant in
configurations/profiles/
-
AGENT_ESCALATION_MAX: Default2escalations allowed- Escalations triggered when budget warnings occur
- Each escalation requires approval (if HITL enabled)
- After max escalations, tasks are blocked or require admin approval
-
AGENT_BUDGET_POLICY: Defaultwarn(options:warn,block)warn: Log budget exceedance and emit observability event, allow continuationblock: Hard stop execution when budget ceiling is reached
| Tool Category | Cost per Call | Latency Weight |
|---|---|---|
| Filesystem Read | 1.0 | 1.0 |
| Web Search | 5.0 | 2.5 |
| Code Execution | 3.0 | 1.5 |
| LLM Call (small) | 10.0 | 3.0 |
| LLM Call (large) | 50.0 | 5.0 |
| Database Query | 2.0 | 1.2 |
-
Trigger Conditions:
- High-impact operations (write, delete, modify)
- Budget escalations beyond threshold
- Access to sensitive scopes (exec, db, finance)
- Tool combinations flagged as risky
-
Approval Flow:
- Agent emits
approval_requestedevent with operation details - Notification sent to configured approvers (Slack, email, webhook)
- Approver reviews operation context and trace_id
- Approval decision recorded in audit trail
- Agent proceeds or rejects based on decision
- Agent emits
-
Approval Timeout:
- Default: 5 minutes for operations, 10 minutes for budget escalations
- Fallback action:
reject(configurable toallowin dev profiles)
-
Approver Roles (defined in
routing/guards.yaml):operator: Day-to-day operationsadmin: System-level changesfinance_admin: Budget escalationsplatform_admin: Infrastructure changes
- Purpose: Automatic delay before high-risk operations execute
- Default Delay: 30 seconds for budget warnings
- Use Case: Allows monitoring systems to raise alerts before action completes
- Auto-Approval: Configurable (default:
false, requires explicit approval)
- Request Rate Limiting: 100 requests/minute per user (default)
- Concurrent Executions: Max 5 concurrent tasks per agent instance
- Token Limits: Model-specific (e.g., GPT-4: 128k context, output capped at 4k)
- Disk Quota: 1GB per user workspace
- Memory Quota: 512MB per tool execution sandbox
-
Golden Signals:
budget_utilization: % of ceiling used per taskbudget_warnings_total: Count of warnings emittedbudget_exceeded_total: Count of hard blocksapproval_wait_time: P50/P95/P99 latency for human approvals
-
Prometheus Metrics:
cuga_budget_warnings_total{profile="production",policy="allowlist_hitl"} cuga_budget_exceeded_total{profile="production",policy="allowlist_only"} cuga_approval_requests_total{approver_role="operator",status="approved"} cuga_approval_wait_ms{percentile="p95"} -
Alerts (example Prometheus rules):
- alert: HighBudgetUtilization expr: cuga_budget_utilization > 0.9 for: 5m annotations: summary: "Budget utilization above 90% for {{ $labels.profile }}" - alert: ApprovalQueueBacklog expr: cuga_approval_requests_total{status="pending"} > 10 for: 2m annotations: summary: "Approval queue has {{ $value }} pending requests"
-
What to Back Up:
- User conversation history
- Memory embeddings metadata
- Audit trail logs
- Agent configuration snapshots
-
Backup Frequency:
- Full backup: Daily at 02:00 UTC
- Incremental backup: Every 6 hours
- Transaction logs: Continuous (WAL archiving)
-
Backup Procedure:
# Full backup with pg_dump pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -F c -f backup_$(date +%Y%m%d).dump cuga_db # Incremental with WAL archiving # Configure in postgresql.conf: # wal_level = replica # archive_mode = on # archive_command = 'cp %p /backup/wal_archive/%f'
-
Retention Policy:
- Daily backups: 30 days
- Weekly backups: 90 days
- Monthly backups: 1 year
-
What to Back Up:
- Session tokens and state
- Tool registry cache
- Rate limiting counters
- Temporary computation results
-
Backup Frequency:
- RDB snapshots: Every hour
- AOF persistence: Enabled with
everysecfsync policy
-
Backup Procedure:
# Trigger manual RDB snapshot redis-cli -h $REDIS_HOST BGSAVE # Copy RDB and AOF files cp /var/lib/redis/dump.rdb /backup/redis/dump_$(date +%Y%m%d_%H%M).rdb cp /var/lib/redis/appendonly.aof /backup/redis/appendonly_$(date +%Y%m%d_%H%M).aof
-
Retention Policy:
- Hourly snapshots: 48 hours
- Daily snapshots: 7 days
- No long-term retention (cache is ephemeral)
-
What to Back Up:
docs/mcp/registry.yaml(canonical source)- Compiled registry indexes
- Tool capability metadata
- Sandbox policy configurations
-
Backup Frequency:
- On every registry update (version controlled)
- Automated snapshots: Daily
-
Backup Procedure:
# Version-controlled backup (primary method) git commit -m "registry: update tool definitions" docs/mcp/registry.yaml git push origin main # Filesystem backup (secondary) tar -czf registry_backup_$(date +%Y%m%d).tar.gz docs/mcp/registry.yaml config/ configurations/
-
What to Back Up:
- Memory embeddings (Milvus/FAISS/Chroma collections)
- User conversation vectors
- RAG knowledge base indexes
-
Backup Frequency:
- Incremental: Every 12 hours
- Full snapshot: Weekly
-
Backup Procedure (example for Milvus):
# Milvus backup via API curl -X POST "http://milvus:9091/api/v1/backup/create" \ -d '{"collection_name":"user_memory","backup_name":"mem_$(date +%Y%m%d)"}'
# Stop dependent services
kubectl scale deployment cuga-agent --replicas=0
# Restore from backup
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d cuga_db -c backup_20260101.dump
# Verify data integrity
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d cuga_db -c "SELECT COUNT(*) FROM conversations;"
# Restart services
kubectl scale deployment cuga-agent --replicas=3# Stop Redis
kubectl scale statefulset redis --replicas=0
# Replace RDB/AOF files
kubectl cp dump_20260101_0200.rdb redis-0:/data/dump.rdb
kubectl cp appendonly_20260101_0200.aof redis-0:/data/appendonly.aof
# Restart Redis
kubectl scale statefulset redis --replicas=1
# Verify cache
redis-cli -h redis PING# Restore from git (preferred)
git checkout v1.0.0 # or specific commit
git pull origin main
# Or restore from filesystem backup
tar -xzf registry_backup_20260101.tar.gz -C /
# Reload registry in running agents (if hot-reload supported)
curl -X POST http://cuga-agent:8000/api/v1/registry/reload| Component | RTO | RPO | Priority |
|---|---|---|---|
| PostgreSQL | < 1 hour | < 6 hours | Critical |
| Redis | < 15 minutes | < 1 hour | High |
| Tool Registry | < 5 minutes | 0 (version controlled) | Critical |
| Embeddings | < 2 hours | < 12 hours | Medium |
| Application Pods | < 10 minutes | 0 (stateless) | High |
- Stop all agent instances
- Restore PostgreSQL from most recent clean backup
- Replay WAL logs up to corruption point
- Run integrity checks (
pg_verify) - Restart agents and verify operation
- Identify discrepancy via
scripts/verify_guardrails.py - Restore canonical registry from git history
- Regenerate
docs/mcp/tiers.mdviabuild/gen_tiers_table.py - Hot-reload registry in agents (or rolling restart)
- Verify tool resolution via test queries
- Provision new K8s cluster
- Restore PostgreSQL from off-site backup
- Restore Redis snapshots (optional, can rebuild cache)
- Deploy agent manifests via Helm/Kustomize
- Restore tool registry from git
- Verify all health checks pass
- Resume traffic via DNS/load balancer update
-
Backup Health Checks:
backup_success_totalmetric per component- Alert if no successful backup in 25 hours (daily + margin)
-
Replication Lag:
- PostgreSQL:
pg_stat_replication.replay_lag < 1 minute - Redis:
info replication | grep lag< 5 seconds
- PostgreSQL:
-
Backup Size Anomalies:
- Alert if backup size deviates >20% from 7-day average
- May indicate data corruption or schema changes
-
/src/has modular structure (agents,mcp,tools,config) -
/docs/contains architecture, security, integration, and tooling references -
/config/stores Hydra-composed registry defaults and fragment overrides with inheritance markers -
/tests/directory exists with core coverage -
/examples/directory demonstrates agent usage
-
.env.exampleincluded and redacted -
USE_EMBEDDED_ASSETSfeature flag documented - Watsonx Granite provider validates env-based credentials (
WATSONX_API_KEY,WATSONX_PROJECT_ID,WATSONX_URL) with deterministic defaults and audit logging. - No hardcoded keys or tokens
- Secrets validated before use (
assert key) -
detect-secretsbaseline committed -
SECURITY.mddefines CI & runtime rules - HTTP Client Hardening: All HTTP requests use
SafeClientwrapper with enforced timeouts (10.0s), automatic retry (4 attempts, exponential backoff), and URL redaction in logs. No raw httpx/requests/urllib usage. - Secrets Management: Env-only credential enforcement via
cuga.security.secretsmodule..env.exampleparity validated in CI (no missing keys).SECRET_SCANNER=onruns trufflehog + gitleaks on every push/PR. Hardcoded API keys/tokens/passwords trigger CI failure. - Mode-Specific Validation: Startup validation enforces required env vars per mode: LOCAL (model API key), SERVICE (AGENT_TOKEN + budget + model key), MCP (servers file + profile + model key), TEST (no requirements).
-
MakefileandDockerfiletested -
uvworkflows for stability & asset builds - Embedded asset pipeline (
build_embedded.py) verified - Compression ratios documented
-
AGENTS.md– entrypoint for all contributors -
AGENT-CORE.md– agent lifecycle, pipeline -
TOOLS.md– structure, schema, usage -
MCP_INTEGRATION.md– tool bus and lifecycle -
REGISTRY_MERGE.md– Hydra-based registry fragment handling and enablement rules -
SECURITY.md– production secret handling -
EMBEDDED_ASSETS.md– compression and distribution
- Registry entries declare sandbox profile (
py/node slim|full,orchestrator) with/workdirpinning for exec scopes and read-only defaults. - Budget and observability env keys (
AGENT_*,OTEL_*,LANGFUSE_*,OPENINFERENCE_*,TRACELOOP_*) wired with defaultwarnbudget policy and ceiling/escalation caps. -
docs/mcp/registry.yamlkept in sync with generateddocs/mcp/tiers.md; hot-swap reload path tested and deterministic ordering verified. - Guardrail updates accompanied by README/CHANGELOG/todo1 updates and
scripts/verify_guardrails.py --base <ref>runs.
- Core modules tested (
controller,planner,executor) -
run_stability_tests.pyexecuted cleanly - Functional test coverage ≥ 80% orchestrator & routing, ≥ 60% tools/memory (📍 target, see TESTING.md)
- Lint passes (
ruff,pre-commit, CI) - Legacy agent versions isolated or removed
-
VERSION.txtpresent →1.0.0 -
CHANGELOG.mddocuments all v1.0.0 features -
Git tag proposed:
git tag -a v1.0.0 -m "Initial production release" git push origin v1.0.0 -
/src/has modular structure (agents,mcp,tools,config) -
/docs/contains architecture, security, integration, and tooling references -
/config/stores Hydra-composed registry defaults and fragment overrides with inheritance markers -
/tests/directory exists with core coverage -
/examples/directory demonstrates agent usage
Per AGENTS.md § Registry Hygiene: No :latest tags in production.
- Enforcement: CI checks block
:latesttags inops/docker-compose.proposed.yaml - Pinning Strategy:
- Use semantic versioning tags (e.g.,
v0.2.0,v1.0.0) - Pin to specific digests after validation:
image@sha256:abc123...
- Use semantic versioning tags (e.g.,
- Current Pinned Versions:
cuga/orchestrator:v0.2.0fastmcp/filesystem:v1.0.0fastmcp/git:v2.0.0fastmcp/browser:v1.0.0otel/opentelemetry-collector:0.91.0qdrant/qdrant:v1.7.0ollama/ollama:0.1.17
All K8s manifests located in ops/k8s/:
namespace.yaml: Namespace, quotas, resource limits, PVCsorchestrator-deployment.yaml: Orchestrator with health checks, HPA, resource limitsmcp-services-deployment.yaml: MCP Tier 1 & Tier 2 servicesconfigmaps.yaml: Config data (settings, registry, OTEL collector config)secrets.yaml: Secret templates (DO NOT commit actual secrets)README.md: Complete deployment guide
Quick Deploy:
cd ops/k8s/
kubectl apply -f namespace.yaml
kubectl apply -f configmaps.yaml
kubectl create secret generic cuga-orchestrator-secrets \
--from-env-file=../env/orchestrator.env --namespace=cugar
kubectl apply -f orchestrator-deployment.yaml
kubectl apply -f mcp-services-deployment.yamlOrchestrator deployment uses RollingUpdate strategy with maxUnavailable: 0:
# Update to new version
kubectl set image deployment/cuga-orchestrator -n cugar \
orchestrator=cuga/orchestrator:v0.3.0
# Watch rollout progress
kubectl rollout status deployment/cuga-orchestrator -n cugar
# Monitor health during rollout
watch kubectl get pods -n cugar -l app=cuga-orchestratorHealth Check Gates:
- Startup Probe: 60s grace period for initialization
- Readiness Probe: Pod must pass
/healthcheck before receiving traffic - Liveness Probe: Pod restarted if
/healthfails after 3 attempts (30s interval)
Fast Rollback (Automated):
# Rollback to previous version (takes ~30s)
kubectl rollout undo deployment/cuga-orchestrator -n cugar
# Verify rollback
kubectl rollout status deployment/cuga-orchestrator -n cugar
kubectl get pods -n cugar -l app=cuga-orchestrator -o wideRollback to Specific Revision:
# View deployment history
kubectl rollout history deployment/cuga-orchestrator -n cugar
# Rollback to revision N
kubectl rollout undo deployment/cuga-orchestrator -n cugar --to-revision=3Rollback Verification:
# Check logs for errors
kubectl logs -n cugar deploy/cuga-orchestrator --tail=50
# Test health endpoint
kubectl exec -n cugar deploy/cuga-orchestrator -- curl http://localhost:8000/health
# Check metrics
kubectl exec -n cugar deploy/cuga-orchestrator -- curl http://localhost:8000/metrics | grep cuga_requests_totalFast Rollback:
cd ops/
# Revert docker-compose.proposed.yaml to previous version
git checkout HEAD~1 -- docker-compose.proposed.yaml
# Restart services
docker-compose -f docker-compose.proposed.yaml down
docker-compose -f docker-compose.proposed.yaml up -d
# Verify health
docker-compose -f docker-compose.proposed.yaml ps
curl -f http://localhost:8000/healthRegistry Changes (tool additions/removals):
# Rollback registry.yaml
git checkout HEAD~1 -- docs/mcp/registry.yaml
# Reload registry without restart (if hot-reload enabled)
curl -X POST http://localhost:8000/admin/registry/reload \
-H "Authorization: Bearer ${AGENT_TOKEN}"
# Or restart orchestrator to pick up changes
kubectl rollout restart deployment/cuga-orchestrator -n cugarConfigMap Changes:
# Rollback ConfigMap
kubectl rollout undo configmap/cuga-orchestrator-config -n cugar
# Restart pods to pick up new config
kubectl rollout restart deployment/cuga-orchestrator -n cugarMulti-Region Failover (if secondary region deployed):
# Switch DNS/load balancer to secondary region
# (Implementation depends on infrastructure: Route53, CloudFlare, etc.)
# Verify secondary is healthy
kubectl config use-context prod-us-east-1
kubectl get pods -n cugar -l app=cuga-orchestrator
kubectl logs -n cugar deploy/cuga-orchestrator --tail=20
# Scale up secondary (if in standby mode)
kubectl scale deployment cuga-orchestrator -n cugar --replicas=2Database Failover (PostgreSQL):
# Promote replica to primary
pg_ctl promote -D /var/lib/postgresql/data
# Update app config to point to new primary
kubectl set env deployment/cuga-orchestrator -n cugar \
POSTGRES_HOST=new-primary-host
# Restart pods
kubectl rollout restart deployment/cuga-orchestrator -n cugar| Failure Type | Rollback Method | Time to Recover | Data Loss Risk |
|---|---|---|---|
| Bad container image | K8s rollback | ~30s | None (stateless) |
| Registry config error | Git revert + reload | ~10s | None |
| Database migration failure | Restore from backup | ~5min | Depends on backup age |
| Network policy change | Revert YAML + kubectl apply | ~15s | None |
| Secret rotation issue | Restore old secret | ~30s | None |
| Multi-service failure | Full environment restore | ~10min | Depends on backup |
Pre-Deploy Validation:
# Test in staging environment first
kubectl config use-context staging
kubectl apply -f ops/k8s/
# Run smoke tests
./tests/smoke/test_deployment.sh
# If passing, deploy to production
kubectl config use-context production
kubectl apply -f ops/k8s/Rollback Drills (monthly):
# Deploy canary version
kubectl apply -f ops/k8s/orchestrator-canary.yaml
# Simulate failure and rollback
kubectl rollout undo deployment/cuga-orchestrator-canary -n cugar
# Document rollback time and issues- Root Cause Analysis: Document what went wrong in incident report
- Audit Trail Review: Check observability events for failure patterns
- Update Tests: Add regression tests to prevent recurrence
- Update Runbook: Document new rollback scenarios
- Notify Stakeholders: Send incident summary with resolution
Golden Signals to Monitor:
cuga_requests_total: Should remain steady during rolloutcuga_success_rate: Should not drop below 99% during rolloutcuga_latency_ms: P95 should not spike >10% during rolloutcuga_tool_error_rate: Should remain <1% during rollout
Prometheus Alerts (example):
- alert: RolloutDegradedPerformance
expr: cuga_success_rate < 0.99
for: 2m
annotations:
summary: "Rollout causing degraded performance (success rate {{ $value }})"
action: "Consider rollback: kubectl rollout undo deployment/cuga-orchestrator -n cugar"
- alert: RolloutLatencySpike
expr: cuga_latency_ms{percentile="p95"} > 1.1 * cuga_latency_ms{percentile="p95"} offset 10m
for: 3m
annotations:
summary: "Rollout causing P95 latency spike"Grafana Dashboard: Import observability/grafana_dashboard.json for real-time rollout monitoring.
-
.env.exampleincluded and redacted -
USE_EMBEDDED_ASSETSfeature flag documented - Watsonx Granite provider validates env-based credentials (
WATSONX_API_KEY,WATSONX_PROJECT_ID,WATSONX_URL) with deterministic defaults and audit logging. - No hardcoded keys or tokens
- Secrets validated before use (
assert key) -
detect-secretsbaseline committed -
SECURITY.mddefines CI & runtime rules - HTTP Client Hardening: All HTTP requests use
SafeClientwrapper with enforced timeouts (10.0s), automatic retry (4 attempts, exponential backoff), and URL redaction in logs. No raw httpx/requests/urllib usage. - Secrets Management: Env-only credential enforcement via
cuga.security.secretsmodule..env.exampleparity validated in CI (no missing keys).SECRET_SCANNER=onruns trufflehog + gitleaks on every push/PR. Hardcoded API keys/tokens/passwords trigger CI failure. - Mode-Specific Validation: Startup validation enforces required env vars per mode: LOCAL (model API key), SERVICE (AGENT_TOKEN + budget + model key), MCP (servers file + profile + model key), TEST (no requirements).
-
MakefileandDockerfiletested -
uvworkflows for stability & asset builds - Embedded asset pipeline (
build_embedded.py) verified - Compression ratios documented
-
AGENTS.md– entrypoint for all contributors -
AGENT-CORE.md– agent lifecycle, pipeline -
TOOLS.md– structure, schema, usage -
MCP_INTEGRATION.md– tool bus and lifecycle -
REGISTRY_MERGE.md– Hydra-based registry fragment handling and enablement rules -
SECURITY.md– production secret handling -
EMBEDDED_ASSETS.md– compression and distribution
- Registry entries declare sandbox profile (
py/node slim|full,orchestrator) with/workdirpinning for exec scopes and read-only defaults. - Budget and observability env keys (
AGENT_*,OTEL_*,LANGFUSE_*,OPENINFERENCE_*,TRACELOOP_*) wired with defaultwarnbudget policy and ceiling/escalation caps. -
docs/mcp/registry.yamlkept in sync with generateddocs/mcp/tiers.md; hot-swap reload path tested and deterministic ordering verified. - Guardrail updates accompanied by README/CHANGELOG/todo1 updates and
scripts/verify_guardrails.py --base <ref>runs.
- Core modules tested (
controller,planner,executor) -
run_stability_tests.pyexecuted cleanly - Functional test coverage ≥ 80% (📍 target)
- Lint passes (
ruff,pre-commit, CI) - Legacy agent versions isolated or removed
-
VERSION.txtpresent →1.0.0 -
CHANGELOG.mddocuments all v1.0.0 features - Git tag proposed:
git tag -a v1.0.0 -m "Initial production release" git push origin v1.0.0
