Skip to content

ARF is an agentic reliability intelligence platform that separates decision intelligence (OSS) from governed execution (Enterprise), enabling autonomous operations with deterministic safety guarantees.

License

Notifications You must be signed in to change notification settings

petterjuan/agentic-reliability-framework

Repository files navigation

AGENTIC RELIABILITY FRAMEWORK

Production-grade multi-agent AI system for infrastructure reliability analysis and execution intelligence, with governed execution available in Enterprise deployments.

"ARF: advisory AI for reliability, Enterprise execution for operational outcomes."

Battle-tested architecture for autonomous incident detection and advisory remediation intelligence.


Agentic Reliability Framework (ARF) v3.3.9 — Production Stability Release

⚠️ IMPORTANT OSS DISCLAIMER

It does NOT execute actions, does NOT auto-heal, and does NOT perform remediation.

All execution, automation, persistence, and learning loops are Enterprise-only features.

Executive Summary

Modern systems do not fail because metrics are missing.

They fail because decisions arrive too late.

ARF is a graph-native, agentic reliability platform that treats incidents as memory and reasoning problems, not alerting problems. It captures operational experience, reasons over it using AI agents, and enforces stable, production-grade execution boundaries for autonomous healing.

This is not another monitoring tool.

This is operational intelligence.

A dual-architecture reliability framework where OSS analyzes and creates intent, and Enterprise safely executes intent.

This repository contains the Apache 2.0 OSS edition (v3.3.9). Enterprise components are distributed separately under a commercial license.

v3.3.9 Production Stability Release

This release finalizes import compatibility, eliminates circular dependencies, and enforces clean OSS/Enterprise boundaries.
All public imports are now guaranteed stable for production use.


Why ARF Exists

The Problem

  • AI Agents Fail in Production: 73% of AI agent projects fail due to unpredictability, lack of memory, and unsafe execution
  • MTTR is Too High: Average incident resolution takes 14+ minutes in traditional systems. *Measured MTTR reductions are Enterprise-only and require execution + learning loops.
  • Alert Fatigue: Teams ignore 40%+ of alerts due to false positives and lack of context
  • No Learning: Systems repeat the same failures because they don't remember past incidents

Traditional reliability stacks optimize for:

  • Detection latency
  • Alert volume
  • Dashboard density

But the real business loss happens between:

“Something is wrong” → “We know what to do.”

ARF collapses that gap by providing a hybrid intelligence system that advises safely in OSS and executes deterministically in Enterprise.

Why This Matters:

Unlike commercial solutions:

• Apache 2.0 licensed - Free, forever

• Self-hostable - Your infrastructure, your data

• Open source - See the code, trust the logic

• Purpose-built - Not a bolt-on feature

Unlike other open-source frameworks:

• Production-ready NOW - Clone → Run → Works (5 minutes)

• End-to-end autonomy - Full detect-diagnose-predict-heal cycle *Available for Enterprises

• Business impact tracking - Revenue loss calculations, ROI metrics

The Vision: AI infrastructure should operate like critical infrastructure: predictable, observable, and reliably helpful. Not reacting to failures. Preventing them.


🎯 What This Actually Does

OSS Edition (Apache 2.0)

The open-source edition of the Agentic Reliability Framework is designed for advisory intelligence, incident understanding, and safe decision support not autonomous execution.

Core Capabilities

  • Reliability Event Intake Accepts structured reliability and operational events with configurable thresholds and metadata for downstream analysis.

  • Multi-Stage Analysis Pipeline Performs detection, diagnosis, and predictive reasoning to assess incident context, potential causes, and downstream risk.

  • Historical Pattern Recall Identifies similar past incidents using graph-based similarity techniques to surface precedent and comparative outcomes.

  • Advisory Planning Output Produces structured, immutable remediation plans that describe what could be done, why, and with what expected impact—without taking action.

  • Deterministic Safety Guardrails Applies explicit, configuration-driven policies to constrain recommendations (e.g., scope limits, restricted actions, compliance boundaries).

  • Business Impact Estimation Estimates user, revenue, or operational impact based on event metadata and configurable models.

  • In-Memory Operation Only Operates entirely in memory with bounded retention suitable for development, research, and evaluation use cases.

💡 How ARF OSS Adds Value Today

OSS Advisory Flow (Advisory-Only)

🟢 Green = OSS Advisory Capabilities 🔵 Blue = Enterprise Execution (not included in OSS)

  • Detection – Identify anomalies and operational events in real time 🟢

  • Recall – Retrieve historical incidents and context for informed reasoning 🟢

  • Decision – Apply deterministic, explainable rules for advisory guidance 🟢

  • HealingIntent – Generate structured, safe remediation recommendations 🟢

  • Execution – Enterprise-only; OSS stops before this step 🔵

Feature OSS 🟢 Enterprise 🔵
Detection & Anomaly Monitoring
Historical Recall & RAG Context
Deterministic Decision Policies
Advisory Remediation Plans
Autonomous Execution
Learning & Self-Optimization Loops
Persistent Storage & Memory
Compliance & Audit Workflows
Multi-Tenant Control / Scoped Operations
Business Impact Measurement & Analytics

Quick visual reference for OSS vs Enterprise capabilities. OSS delivers full intelligence, stopping safely at advisory intent, while Enterprise extends that intelligence to execution and outcome optimization.

Explicit OSS Constraints (By Design)

  • Advisory-Only The OSS edition never executes changes, deploys fixes, or mutates production systems.

  • No Autonomous Learning Historical data is used for recall and comparison only; the system does not self-train or update models over time.

  • No Persistent Storage Incident context and memory are ephemeral and capped to prevent long-term retention.

  • Single-Context Operation No multi-tenant isolation, enterprise policy layering, or cross-environment orchestration.

Intended Use Cases

  • Reliability experimentation and research

  • Incident postmortems and what-if analysis

  • Agentic system prototyping

  • Safety-constrained AI planning demonstrations

  • Evaluation of agent reasoning quality without execution risk

Architectural Guarantees

  • Advisory-Only by Design — no hidden execution paths

  • Deterministic & Explainable — no silent learning loops

  • Thread-Safe & Production-Ready

  • Configuration-Driven Behavior via OSSConfig

  • Type-Safe APIs using Pydantic v2 and Python 3.10+

  • Extensible Agent Architecture with explicit interfaces

Execution, persistence, and autonomous actions are exclusive to Enterprise.


Enterprise Edition (Commercial)

The Enterprise Edition of ARF transforms advisory intelligence into governed, auditable execution at scale.

️ Why Choose ARF Over Alternatives

Solution Intelligence Safety Determinism Execution
🟢 ARF (OSS) Context-aware analysis High (advisory-only) High
🔵 ARF (Enterprise) Advanced reliability intelligence High (governed execution) High
Traditional Monitoring Alert-based High High
LLM-Only Agents Heuristic Low Low ⚠️

**Governed execution modes (Enterprise-only)**Enterprise deployments support multiple permissioned execution configurations with varying levels of human oversight. Specific modes, controls, and workflows are not part of the OSS distribution.

Migration Paths

Current Solution Migration Strategy Expected Benefit Applies To
Traditional Monitoring Layer ARF on top for predictive insights Shift from reactive to proactive with 6x faster detection 🔵
Manual Operations Start with ARF in Advisory mode Reduce toil while maintaining control during transition 🟢

Decision Framework 

Choose ARF if you need: 

  • ✅ Autonomous operation with safety guarantees 

  • ✅ Continuous improvement through learning 

  • ✅ Quantifiable business impact measurement  

  • ✅ Hybrid intelligence (AI + rules) 

  • ✅ Production-grade reliability (circuit breakers, thread safety, graceful degradation) 

Consider alternatives if you: 

  • ❌ Only need basic alerting (use traditional monitoring) 

  • ❌ Require simple, static automation (use scripts) 

  • ❌ Are experimenting with AI agents (use LLM frameworks) 

  • ❌ Have regulatory requirements prohibiting any autonomous action 

ARF provides the intelligence of AI agents with the reliability of traditional automation, creating a new category of "Reliable AI Systems."


🔧 Architecture

Conceptual Architecture (Mental Model)

flowchart LR
    A[Detection 🟢 OSS] --> B[Recall 🟢 OSS]
    B --> C[Decision 🟢 OSS]
    C --> D[HealingIntent 🟢 OSS]
    D --> E[Execution 🔵 Enterprise Only]

Loading

Key insight: Reliability improves when systems remember.

Architecture Philosophy: Each layer addresses a critical failure mode of current AI systems: 

  1. Cognitive Layer prevents "reasoning from scratch" for each incident 

  2. Memory Layer prevents "forgetting past learnings" 

  3. Execution Layer prevents "unsafe, unconstrained actions"

Healing Intent Boundary

OSS creates intent.
Enterprise executes intent. The framework separates intent creation from execution

+----------------+         +---------------------+
|   OSS Layer    |         |  Enterprise Layer   |
| (Analysis Only)|         |  (Execution & GNN)  |
+----------------+         +---------------------+
          |                           ^
          |       HealingIntent       |
          +-------------------------->|

Key Orchestration Steps: 

  1. Event Ingestion & Validation - Accepts telemetry, validates with Pydantic models 

  2. Multi-Agent Analysis - Parallel execution of specialized agents 

  3. RAG Context Retrieval - Semantic search for similar historical incidents 

  4. Policy Evaluation - Deterministic rule-based action determination 

  5. Action Enhancement - Historical effectiveness data informs priority

  6. Later execution, outcome evaluation, and learning stages exist exclusively in Enterprise deployments and are intentionally omitted from OSS documentation.


Multi-Agent Design (ARF v3.0) – Coverage Overview

  • Detection, Recall, Decision → present in both OSS and Enterprise
Agent Responsibility 🟢 OSS 🔵 Enterprise
Detection Agent Detect anomalies, monitor telemetry, perform time-series forecasting
Recall Agent Retrieve similar incidents/actions/outcomes from RAG graph + FAISS
Decision Agent Apply deterministic policies, reasoning over historical outcomes
  • Safety, Execution, Learning → Enterprise only

OSS vs Enterprise Philosophy

OSS (Apache 2.0)

  • Full intelligence
  • Advisory-only execution
  • Hard safety limits
  • Perfect for trust-building

Enterprise

  • Autonomous healing
  • Learning loops
  • Compliance (SOC2, HIPAA, GDPR)
  • Audit trails
  • Multi-tenant control

OSS proves value.
Enterprise captures it.


💰 Business Value and ROI

Quantitative performance metrics, benchmarks, and ROI analyses are derived exclusively from Enterprise deployments and are not disclosed in the OSS distribution.

Quantitative productivity, ROI, and MTTR improvements are measured in Enterprise deployments and shared privately during evaluations.

🔒 Stability Guarantees (v3.3.9+)

ARF v3.3.9 introduces hard stability guarantees for OSS users:

  • ✅ No circular imports
  • ✅ Direct, absolute imports for all public APIs
  • ✅ Pydantic v2 ↔ Dataclass compatibility wrapper
  • ✅ Graceful fallback behavior (no runtime crashes)
  • ✅ Advisory-only execution enforced at runtime

If you can import it, it is safe to use in production.


Who Uses ARF

Engineers

  • Fewer pages
  • Better decisions
  • Confidence in automation

Founders

  • Reliability without headcount
  • Faster scaling
  • Reduced churn

Executives

  • Predictable uptime
  • Quantified risk
  • Board-ready narratives

Investors

  • Defensible IP
  • Enterprise expansion path
  • OSS → Paid flywheel
graph LR 
   ARF["ARF v3.0"] --> Finance 
   ARF --> Healthcare 
   ARF --> SaaS 
   ARF --> Media 
   ARF --> Logistics 
    
   Finance --> |Real-time monitoring| F1[HFT Systems] 
   Finance --> |Compliance| F2[Risk Management] 
    
   Healthcare --> |Patient safety| H1[Medical Devices] 
   Healthcare --> |HIPAA compliance| H2[Health IT] 
    
   SaaS --> |Uptime SLA| S1[Cloud Services] 
   SaaS --> |Multi-tenant| S2[Enterprise SaaS] 
    
   Media --> |Content delivery| M1[Streaming] 
   Media --> |Ad tech| M2[Real-time bidding] 
    
   Logistics --> |Supply chain| L1[Inventory] 
   Logistics --> |Delivery| L2[Tracking] 
    
   style ARF fill:#7c3aed 
   style Finance fill:#3b82f6 
   style Healthcare fill:#10b981 
   style SaaS fill:#f59e0b 
   style Media fill:#ef4444 
   style Logistics fill:#8b5cf6
Loading

🔒 Security & Compliance

Layer Breakdown:

  • Action Blacklisting – Prevent dangerous operations

  • Blast Radius Limiting – Limit impact scope (max: 3 services)

  • Human Approval Workflows – Manual review for sensitive changes

  • Business Hour Restrictions – Control deployment windows

  • Circuit Breakers & Cooldowns – Automatic rate limiting

Compliance Features

  • Audit Trail: Every MCP request/response logged with justification

  • Approval Workflows: Human review for sensitive actions

  • Data Retention: Configurable retention policies (default: 30 days)

  • Access Control: Tool-level permission requirements

  • Change Management: Business hour restrictions for production changes

Security Best Practices

  1. Start in Advisory Mode

    • Begin with analysis-only mode to understand potential actions without execution risks.
  2. Gradual Rollout

    • Use rollout_percentage parameter to enable features incrementally across your systems.
  3. Regular Audits

    • Review learned patterns and outcomes monthly

    • Adjust safety parameters based on historical data

    • Validate compliance with organizational policies

  4. Environment Segregation

    • Configure different MCP modes per environment:

      • Development: autonomous or advisory

      • Staging: approval

      • Production: advisory or approval

**Enterprise Safety Model (High-Level)**Enterprise deployments apply multiple layers of safety controls, including permission boundaries, scope constraints, approval workflows, and rate-limiting mechanisms. These controls are configurable per organization and environment and are intentionally not exposed in the OSS edition.

Recommended Implementation Order

  1. Initial Setup: Configure action blacklists and blast radius limits
  2. Testing Phase: Run in advisory mode to analyze behavior
  3. Gradual Enablement: Move to approval mode with human oversight
  4. Production: Maintain approval workflows for critical systems
  5. Optimization: Adjust parameters based on audit findings

🚀 Quick Start

OSS (≈5 minutes)

pip install agentic-reliability-framework==3.3.9

Run locally or deploy as a service.

License

Apache 2.0 (OSS) Commercial license required for Enterprise features.


Citing ARF

If you use the Agentic Reliability Framework in production or research, please cite:

BibTeX:

@software{ARF2026,
  title = {Agentic Reliability Framework: Production-Grade Multi-Agent AI for autonomous system reliability intelligence},
  author = {Juan Petter and Contributors},
  year = {2026},
  version = {3.3.9},
  url = {https://github.com/petterjuan/agentic-reliability-framework}
}

Quick Links

Additional Resources: 

  • GitHub Issues: For bug reports and technical issues 

  • Documentation: Check the docs for common questions 

Response Time: Typically within 24-48 hours

🤝 Support & Sponsorship

Agentic Reliability Framework is developed as sustainable open-source software.

Ways to support the project:

🆓 Open Source Community

  • Star the repository - Helps with visibility
  • 🐛 Report issues - Improve stability for everyone
  • 📣 Share with colleagues - Spread the word
  • 🔧 Contribute code - PRs welcome for OSS features

💼 Enterprise Edition

For production deployments with execution, learning loops, and business analytics:

❤️ Financial Support

  • GitHub Sponsors - Support ongoing OSS development
  • One-time donations - Contact for invoice-based support

📞 Contact


Sustainability Model: OSS edition remains free forever. Enterprise edition funds ongoing development, security updates, and new features that eventually trickle down to OSS.

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  

Languages