Production-grade multi-agent AI system for infrastructure reliability analysis and execution intelligence, with governed execution available in Enterprise deployments.
"ARF: advisory AI for reliability, Enterprise execution for operational outcomes."
Battle-tested architecture for autonomous incident detection and advisory remediation intelligence.
⚠️ IMPORTANT OSS DISCLAIMERIt does NOT execute actions, does NOT auto-heal, and does NOT perform remediation.
All execution, automation, persistence, and learning loops are Enterprise-only features.
Modern systems do not fail because metrics are missing.
They fail because decisions arrive too late.
ARF is a graph-native, agentic reliability platform that treats incidents as memory and reasoning problems, not alerting problems. It captures operational experience, reasons over it using AI agents, and enforces stable, production-grade execution boundaries for autonomous healing.
This is not another monitoring tool.
This is operational intelligence.
A dual-architecture reliability framework where OSS analyzes and creates intent, and Enterprise safely executes intent.
This repository contains the Apache 2.0 OSS edition (v3.3.9). Enterprise components are distributed separately under a commercial license.
v3.3.9 Production Stability Release
This release finalizes import compatibility, eliminates circular dependencies, and enforces clean OSS/Enterprise boundaries.
All public imports are now guaranteed stable for production use.
The Problem
- AI Agents Fail in Production: 73% of AI agent projects fail due to unpredictability, lack of memory, and unsafe execution
- MTTR is Too High: Average incident resolution takes 14+ minutes in traditional systems. *Measured MTTR reductions are Enterprise-only and require execution + learning loops.
- Alert Fatigue: Teams ignore 40%+ of alerts due to false positives and lack of context
- No Learning: Systems repeat the same failures because they don't remember past incidents
Traditional reliability stacks optimize for:
- Detection latency
- Alert volume
- Dashboard density
But the real business loss happens between:
“Something is wrong” → “We know what to do.”
ARF collapses that gap by providing a hybrid intelligence system that advises safely in OSS and executes deterministically in Enterprise.
Why This Matters:
Unlike commercial solutions:
• Apache 2.0 licensed - Free, forever
• Self-hostable - Your infrastructure, your data
• Open source - See the code, trust the logic
• Purpose-built - Not a bolt-on feature
Unlike other open-source frameworks:
• Production-ready NOW - Clone → Run → Works (5 minutes)
• End-to-end autonomy - Full detect-diagnose-predict-heal cycle *Available for Enterprises
• Business impact tracking - Revenue loss calculations, ROI metrics
The Vision: AI infrastructure should operate like critical infrastructure: predictable, observable, and reliably helpful. Not reacting to failures. Preventing them.
OSS Edition (Apache 2.0)
The open-source edition of the Agentic Reliability Framework is designed for advisory intelligence, incident understanding, and safe decision support not autonomous execution.
-
Reliability Event Intake Accepts structured reliability and operational events with configurable thresholds and metadata for downstream analysis.
-
Multi-Stage Analysis Pipeline Performs detection, diagnosis, and predictive reasoning to assess incident context, potential causes, and downstream risk.
-
Historical Pattern Recall Identifies similar past incidents using graph-based similarity techniques to surface precedent and comparative outcomes.
-
Advisory Planning Output Produces structured, immutable remediation plans that describe what could be done, why, and with what expected impact—without taking action.
-
Deterministic Safety Guardrails Applies explicit, configuration-driven policies to constrain recommendations (e.g., scope limits, restricted actions, compliance boundaries).
-
Business Impact Estimation Estimates user, revenue, or operational impact based on event metadata and configurable models.
-
In-Memory Operation Only Operates entirely in memory with bounded retention suitable for development, research, and evaluation use cases.
OSS Advisory Flow (Advisory-Only)
🟢 Green = OSS Advisory Capabilities 🔵 Blue = Enterprise Execution (not included in OSS)
-
Detection – Identify anomalies and operational events in real time 🟢
-
Recall – Retrieve historical incidents and context for informed reasoning 🟢
-
Decision – Apply deterministic, explainable rules for advisory guidance 🟢
-
HealingIntent – Generate structured, safe remediation recommendations 🟢
-
Execution – Enterprise-only; OSS stops before this step 🔵
| Feature | OSS 🟢 | Enterprise 🔵 |
|---|---|---|
| Detection & Anomaly Monitoring | ✅ | ✅ |
| Historical Recall & RAG Context | ✅ | ✅ |
| Deterministic Decision Policies | ✅ | ✅ |
| Advisory Remediation Plans | ✅ | ✅ |
| Autonomous Execution | ❌ | ✅ |
| Learning & Self-Optimization Loops | ❌ | ✅ |
| Persistent Storage & Memory | ❌ | ✅ |
| Compliance & Audit Workflows | ❌ | ✅ |
| Multi-Tenant Control / Scoped Operations | ❌ | ✅ |
| Business Impact Measurement & Analytics | ❌ | ✅ |
Quick visual reference for OSS vs Enterprise capabilities. OSS delivers full intelligence, stopping safely at advisory intent, while Enterprise extends that intelligence to execution and outcome optimization.
-
Advisory-Only The OSS edition never executes changes, deploys fixes, or mutates production systems.
-
No Autonomous Learning Historical data is used for recall and comparison only; the system does not self-train or update models over time.
-
No Persistent Storage Incident context and memory are ephemeral and capped to prevent long-term retention.
-
Single-Context Operation No multi-tenant isolation, enterprise policy layering, or cross-environment orchestration.
-
Reliability experimentation and research
-
Incident postmortems and what-if analysis
-
Agentic system prototyping
-
Safety-constrained AI planning demonstrations
-
Evaluation of agent reasoning quality without execution risk
-
Advisory-Only by Design — no hidden execution paths
-
Deterministic & Explainable — no silent learning loops
-
Thread-Safe & Production-Ready
-
Configuration-Driven Behavior via OSSConfig
-
Type-Safe APIs using Pydantic v2 and Python 3.10+
-
Extensible Agent Architecture with explicit interfaces
Execution, persistence, and autonomous actions are exclusive to Enterprise.
Enterprise Edition (Commercial)
The Enterprise Edition of ARF transforms advisory intelligence into governed, auditable execution at scale.
️ Why Choose ARF Over Alternatives
| Solution | Intelligence | Safety | Determinism | Execution |
|---|---|---|---|---|
| 🟢 ARF (OSS) | Context-aware analysis | High (advisory-only) | High | ❌ |
| 🔵 ARF (Enterprise) | Advanced reliability intelligence | High (governed execution) | High | ✅ |
| Traditional Monitoring | Alert-based | High | High | ❌ |
| LLM-Only Agents | Heuristic | Low | Low |
**Governed execution modes (Enterprise-only)**Enterprise deployments support multiple permissioned execution configurations with varying levels of human oversight. Specific modes, controls, and workflows are not part of the OSS distribution.
Migration Paths
| Current Solution | Migration Strategy | Expected Benefit | Applies To |
|---|---|---|---|
| Traditional Monitoring | Layer ARF on top for predictive insights | Shift from reactive to proactive with 6x faster detection | 🔵 |
| Manual Operations | Start with ARF in Advisory mode | Reduce toil while maintaining control during transition | 🟢 |
Decision Framework
Choose ARF if you need:
-
✅ Autonomous operation with safety guarantees
-
✅ Continuous improvement through learning
-
✅ Quantifiable business impact measurement
-
✅ Hybrid intelligence (AI + rules)
-
✅ Production-grade reliability (circuit breakers, thread safety, graceful degradation)
Consider alternatives if you:
-
❌ Only need basic alerting (use traditional monitoring)
-
❌ Require simple, static automation (use scripts)
-
❌ Are experimenting with AI agents (use LLM frameworks)
-
❌ Have regulatory requirements prohibiting any autonomous action
ARF provides the intelligence of AI agents with the reliability of traditional automation, creating a new category of "Reliable AI Systems."
flowchart LR
A[Detection 🟢 OSS] --> B[Recall 🟢 OSS]
B --> C[Decision 🟢 OSS]
C --> D[HealingIntent 🟢 OSS]
D --> E[Execution 🔵 Enterprise Only]
Key insight: Reliability improves when systems remember.
Architecture Philosophy: Each layer addresses a critical failure mode of current AI systems:
-
Cognitive Layer prevents "reasoning from scratch" for each incident
-
Memory Layer prevents "forgetting past learnings"
-
Execution Layer prevents "unsafe, unconstrained actions"
OSS creates intent.
Enterprise executes intent. The framework separates intent creation from execution
+----------------+ +---------------------+
| OSS Layer | | Enterprise Layer |
| (Analysis Only)| | (Execution & GNN) |
+----------------+ +---------------------+
| ^
| HealingIntent |
+-------------------------->|
Key Orchestration Steps:
-
Event Ingestion & Validation - Accepts telemetry, validates with Pydantic models
-
Multi-Agent Analysis - Parallel execution of specialized agents
-
RAG Context Retrieval - Semantic search for similar historical incidents
-
Policy Evaluation - Deterministic rule-based action determination
-
Action Enhancement - Historical effectiveness data informs priority
-
Later execution, outcome evaluation, and learning stages exist exclusively in Enterprise deployments and are intentionally omitted from OSS documentation.
- Detection, Recall, Decision → present in both OSS and Enterprise
| Agent | Responsibility | 🟢 OSS | 🔵 Enterprise |
|---|---|---|---|
| Detection Agent | Detect anomalies, monitor telemetry, perform time-series forecasting | ✅ | ✅ |
| Recall Agent | Retrieve similar incidents/actions/outcomes from RAG graph + FAISS | ✅ | ✅ |
| Decision Agent | Apply deterministic policies, reasoning over historical outcomes | ✅ | ✅ |
- Safety, Execution, Learning → Enterprise only
- Full intelligence
- Advisory-only execution
- Hard safety limits
- Perfect for trust-building
- Autonomous healing
- Learning loops
- Compliance (SOC2, HIPAA, GDPR)
- Audit trails
- Multi-tenant control
OSS proves value.
Enterprise captures it.
Quantitative performance metrics, benchmarks, and ROI analyses are derived exclusively from Enterprise deployments and are not disclosed in the OSS distribution.
Quantitative productivity, ROI, and MTTR improvements are measured in Enterprise deployments and shared privately during evaluations.
ARF v3.3.9 introduces hard stability guarantees for OSS users:
- ✅ No circular imports
- ✅ Direct, absolute imports for all public APIs
- ✅ Pydantic v2 ↔ Dataclass compatibility wrapper
- ✅ Graceful fallback behavior (no runtime crashes)
- ✅ Advisory-only execution enforced at runtime
If you can import it, it is safe to use in production.
- Fewer pages
- Better decisions
- Confidence in automation
- Reliability without headcount
- Faster scaling
- Reduced churn
- Predictable uptime
- Quantified risk
- Board-ready narratives
- Defensible IP
- Enterprise expansion path
- OSS → Paid flywheel
graph LR
ARF["ARF v3.0"] --> Finance
ARF --> Healthcare
ARF --> SaaS
ARF --> Media
ARF --> Logistics
Finance --> |Real-time monitoring| F1[HFT Systems]
Finance --> |Compliance| F2[Risk Management]
Healthcare --> |Patient safety| H1[Medical Devices]
Healthcare --> |HIPAA compliance| H2[Health IT]
SaaS --> |Uptime SLA| S1[Cloud Services]
SaaS --> |Multi-tenant| S2[Enterprise SaaS]
Media --> |Content delivery| M1[Streaming]
Media --> |Ad tech| M2[Real-time bidding]
Logistics --> |Supply chain| L1[Inventory]
Logistics --> |Delivery| L2[Tracking]
style ARF fill:#7c3aed
style Finance fill:#3b82f6
style Healthcare fill:#10b981
style SaaS fill:#f59e0b
style Media fill:#ef4444
style Logistics fill:#8b5cf6
Layer Breakdown:
-
Action Blacklisting – Prevent dangerous operations
-
Blast Radius Limiting – Limit impact scope (max: 3 services)
-
Human Approval Workflows – Manual review for sensitive changes
-
Business Hour Restrictions – Control deployment windows
-
Circuit Breakers & Cooldowns – Automatic rate limiting
-
Audit Trail: Every MCP request/response logged with justification
-
Approval Workflows: Human review for sensitive actions
-
Data Retention: Configurable retention policies (default: 30 days)
-
Access Control: Tool-level permission requirements
-
Change Management: Business hour restrictions for production changes
-
Start in Advisory Mode
- Begin with analysis-only mode to understand potential actions without execution risks.
-
Gradual Rollout
- Use rollout_percentage parameter to enable features incrementally across your systems.
-
Regular Audits
-
Review learned patterns and outcomes monthly
-
Adjust safety parameters based on historical data
-
Validate compliance with organizational policies
-
-
Environment Segregation
-
Configure different MCP modes per environment:
-
Development: autonomous or advisory
-
Staging: approval
-
Production: advisory or approval
-
-
**Enterprise Safety Model (High-Level)**Enterprise deployments apply multiple layers of safety controls, including permission boundaries, scope constraints, approval workflows, and rate-limiting mechanisms. These controls are configurable per organization and environment and are intentionally not exposed in the OSS edition.
- Initial Setup: Configure action blacklists and blast radius limits
- Testing Phase: Run in advisory mode to analyze behavior
- Gradual Enablement: Move to approval mode with human oversight
- Production: Maintain approval workflows for critical systems
- Optimization: Adjust parameters based on audit findings
pip install agentic-reliability-framework==3.3.9Run locally or deploy as a service.
If you use the Agentic Reliability Framework in production or research, please cite:
BibTeX:
@software{ARF2026,
title = {Agentic Reliability Framework: Production-Grade Multi-Agent AI for autonomous system reliability intelligence},
author = {Juan Petter and Contributors},
year = {2026},
version = {3.3.9},
url = {https://github.com/petterjuan/agentic-reliability-framework}
}- Live Demo: Try ARF on Hugging Face
- Full Documentation: ARF Docs
- PyPI Package: agentic-reliability-framework
Additional Resources:
-
GitHub Issues: For bug reports and technical issues
-
Documentation: Check the docs for common questions
Response Time: Typically within 24-48 hours
Agentic Reliability Framework is developed as sustainable open-source software.
Ways to support the project:
- ⭐ Star the repository - Helps with visibility
- 🐛 Report issues - Improve stability for everyone
- 📣 Share with colleagues - Spread the word
- 🔧 Contribute code - PRs welcome for OSS features
For production deployments with execution, learning loops, and business analytics:
- Explore Enterprise Edition
- Email: [email protected] for commercial inquiries
- LinkedIn: petterjuan
- GitHub Sponsors - Support ongoing OSS development
- One-time donations - Contact for invoice-based support
- OSS Issues: GitHub Issues
- Commercial: [email protected]
- Professional: LinkedIn
Sustainability Model: OSS edition remains free forever. Enterprise edition funds ongoing development, security updates, and new features that eventually trickle down to OSS.
