A benchmark and development framework for case-based agents.
Replay historical decision points, compare agent policies against baselines, and evaluate operational outcomes before production deployment.
pip install system-arena
# With process mining integrations
pip install system-arena[integrations]- Event Log — Canonical schema for process events (entity.created, state.changed, interaction.occurred, etc.)
- Decision Point — A moment where a policy could intervene
- Policy — Interface for agent decision-making (threshold, rule-based, or custom)
- Replay — Offline evaluation with no-future-leakage guarantee
- Metrics — Precision@K, conversion, policy violations
Each workflow/use-case should be a separate repository that depends on system-arena.
my-workflow/
├── manifest.yaml # Workflow definition
├── data/
│ └── events.parquet # Historical event log
├── policies/
│ └── my_policy.py # Custom policies
└── analysis/
└── notebooks/ # Analysis notebooks
name: my_workflow
version: "1.0"
case:
id_field: lead_id
timestamp_field: event_timestamp
decision_points:
- name: followup_due
trigger: "timer.elapsed"
condition: "days_since_last_contact > 3"
allowed_actions:
- action: wait
- action: send_reminder
- action: escalate
outcomes:
positive: [converted, meeting_booked]
negative: [went_cold, lost]from arena.datasets import load_event_log, load_manifest
from arena.replay import BenchmarkRunner, BenchmarkConfig
from arena.policy import ThresholdPolicy, ThresholdConfig
# Load data
manifest = load_manifest("manifest.yaml")
event_log = load_event_log("data/events.parquet", manifest)
# Configure policy
policy = ThresholdPolicy(
thresholds=[
ThresholdConfig(
field="days_since_last_contact",
threshold=3.0,
action_above="send_reminder",
action_below="wait",
)
]
)
# Run benchmark
config = BenchmarkConfig(
decision_rules=manifest.get_decision_rules(),
constraints=manifest.get_constraints(),
)
runner = BenchmarkRunner(config)
result = runner.run(event_log, policy)
# View results
print(f"Decisions: {result.total_decisions}")
print(f"Violations: {result.total_violations}")Events can reference multiple objects (not just a single case_id):
from arena.core.types import ObjectRef
event = Event(
case_id=lead_id,
event_type=EventType.INTERACTION_OCCURRED,
object_refs=(
ObjectRef("lead", "lead_123"),
ObjectRef("clinic", "clinic_456", role="target"),
ObjectRef("assistant", "asst_789", role="owner"),
),
)
# Filter by object
clinic_events = event_log.filter_by_object("clinic", "clinic_456")from arena.integrations import export_to_pm4py, discover_process
# Export for PM4Py analysis
pm4py_log = export_to_pm4py(event_log)
# Discover process model
net, im, fm = discover_process(event_log, algorithm="inductive")from arena.integrations import compare_variants
# Compare successful vs unsuccessful trajectories
comparison = compare_variants(event_log)
print(comparison.summary())from arena.integrations import PerformanceAnalyzer
analyzer = PerformanceAnalyzer()
report = analyzer.analyze(event_log)
print(report.summary())| Module | Purpose |
|---|---|
arena.core |
Event, EventLog, CaseSnapshot, ObjectRef |
arena.decision |
DecisionPoint extraction, labeling |
arena.policy |
Policy interface, baselines (Random, Rule, Threshold) |
arena.replay |
BenchmarkRunner, no-future-leakage timeline |
arena.eval |
Metrics, comparison, reports |
arena.datasets |
Parquet/JSONL loading, manifest parsing |
arena.integrations |
PM4Py, variant analysis, performance overlays |
arena.cli |
Command-line interface |
System Arena builds on ideas from these open-source projects:
| Project | Purpose |
|---|---|
| Temporal | Workflow orchestration and durable execution |
| PM4Py | Process mining algorithms and analysis |
| Retentioneering | User behavior and clickstream analysis |
| Langfuse | LLM observability and tracing |
| LangGraph | Agent orchestration with state machines |
| SimPy | Discrete-event process simulation |
| Camunda/Zeebe | BPMN workflow engine |
Recommended reading for understanding the architectural patterns used:
- awesome-software-architecture — Comprehensive architecture patterns
- architecture-decision-record — ADR templates and examples
- domain-driven-design-roadmap — DDD learning path
- awesome-cqrs-event-sourcing — CQRS and event sourcing resources
Each workflow/use-case should be a separate repository that depends on system-arena:
# In your workflow repository
pip install system-arenaYour repository structure:
my-workflow/
├── manifest.yaml # Workflow definition
├── data/
│ └── events.parquet # Historical event log
├── policies/
│ └── my_policy.py # Custom policies
└── analysis/
└── notebooks/ # Analysis notebooks
See the Usage section for code examples.
Contributions are welcome! Here's how to get started:
- Fork the repository
- Clone your fork:
git clone https://github.com/YOUR_USERNAME/system-arena.git - Install dev dependencies:
pip install -e ".[dev]" - Create a branch:
git checkout -b feature/your-feature - Make changes and add tests
- Run checks:
ruff check . && mypy src/ - Submit a pull request
Please ensure your PR:
- Follows existing code style
- Includes tests for new functionality
- Updates documentation if needed
MIT