OpenHands
diff --git a/‎scripts/event_sourcing_benchmarks/README.md‎
Lines changed: 132 additions & 0 deletions b/‎scripts/event_sourcing_benchmarks/README.md‎
Lines changed: 132 additions & 0 deletions
diff --git a/‎scripts/event_sourcing_benchmarks/bench_persist_latency.py‎
Lines changed: 227 additions & 0 deletions b/‎scripts/event_sourcing_benchmarks/bench_persist_latency.py‎
Lines changed: 227 additions & 0 deletions
@@ -0,0 +1,132 @@
+# Event-Sourced State: Systems Metrics
+
+We report four SDK-attributable systems metrics for the event-sourced state management design described in Section 4.2, including its persistence and crash recovery paths. We extract real event payloads from 433 SWE-Bench Verified evaluation conversations (39,870 total events) and replay them through the SDK's production I/O code path on a local machine. The SDK does not instrument persist or replay timing internally, so storage metrics are measured directly from the traces while latency metrics are obtained by re-executing the same `LocalFileStore` lock-and-write path with the original payloads under a fixed deployment configuration.
+
+## Metrics
+
+1. **Persist latency per event / action cycle.** The wall-clock time to durably append a single event to the log. Each append acquires a file lock, serializes the event to JSON, and writes a new file. An action cycle comprises one ActionEvent write followed by one ObservationEvent write — the two persists that bracket every tool invocation.
+
+2. **Replay time vs. log size.** The time to reconstruct in-memory state from the on-disk event log. This has two phases: index rebuild (listing the events directory and parsing filenames via regex) and full replay (reading and deserializing every event file). This cost is paid once on process startup or after a crash.
+
+3. **Storage growth.** The cumulative on-disk footprint of the event log as a function of conversation length, broken down by event type. Since each event is an independent JSON file, total storage grows linearly with event count.
+
+4. **Time-to-recover via replay after failures.** The end-to-end latency of the crash recovery path: load all persisted events, then scan in reverse for actions that lack a matching observation (unmatched-action detection, as implemented in `ConversationState.get_unmatched_actions()`). An unmatched action indicates the agent crashed mid-execution and must re-dispatch.
+
+## Setup
+
+**Workload:** Event payloads extracted from a full SWE-Bench Verified evaluation run (433 instances, `litellm_proxy` backend, max 500 iterations). Events range from 190B to 260KB, with a median of 1.5KB.
+**I/O path:** All persist measurements exercise the production code path — `LocalFileStore.lock()` followed by `LocalFileStore.write()` — with the original JSON payloads from the evaluation traces.
+
+## Data
+
+The evaluation traces used for these benchmarks are from a SWE-Bench Verified run (433 instances, SDK commit `cfe52af`, GitHub Actions run `21870831025`). To download:
+
+```bash
+curl -L -o results.tar.gz \
+  https://results.eval.all-hands.dev/swtbench/litellm_proxy-jade-spark-2862/21870831025/results.tar.gz
+tar xzf results.tar.gz
+```
+
+After extraction, pass the inner run directory as `--eval-dir`. It should contain `conversations/` (with `.tar.gz` traces) and `output.jsonl`.
+
+## Scripts
+
+All scripts accept `--eval-dir <path>` pointing to the extracted evaluation run directory.
+
+| Script | Metrics | Usage |
+|---|---|---|
+| `bench_persist_latency.py` | Persist latency per event / action cycle | `python bench_persist_latency.py --eval-dir <path>` |
+| `bench_replay_and_recovery.py` | Replay time vs. log size, time-to-recover | `python bench_replay_and_recovery.py --eval-dir <path>` |
+| `bench_storage_growth.py` | Storage growth and composition | `python bench_storage_growth.py --eval-dir <path>` |
+
+---
+
+## Results
+
+### 1. Persist Latency per Event / Action Cycle
+
+**Method:** Extract persisted event files from 29 sampled SWE-Bench conversations. Replay each through the `LocalFileStore.lock()` + `LocalFileStore.write()` path with the original JSON payloads.
+
+#### Per-Event Persist Latency
+
+| Event Type | N | Median | Mean | P95 | Median Size |
+|---|---|---|---|---|---|
+| SystemPromptEvent | 29 | 0.351ms | 0.374ms | 0.582ms | 24,500B |
+| MessageEvent | 29 | 0.201ms | 0.206ms | 0.261ms | 3,239B |
+| ActionEvent | 1,264 | 0.163ms | 0.175ms | 0.244ms | 1,071B |
+| ObservationEvent | 1,264 | 0.167ms | 0.180ms | 0.255ms | 2,254B |
+| ConversationStateUpdateEvent | 58 | 0.168ms | 0.172ms | 0.218ms | 191B |
+| **All Events** | **2,644** | **0.166ms** | **0.180ms** | **0.267ms** | **1,395B** |
+
+#### Per Action Cycle (Action + Observation)
+
+| Metric | Value |
+|---|---|
+| Median | 0.36ms |
+| Mean | 0.37ms |
+
+---
+
+### 2. Replay Time vs. Log Size
+
+**Method:** Build event logs of increasing size from real payloads. Measure index rebuild (directory listing + filename regex parse) and full replay (read + JSON parse all events).
+
+| Events | Storage | Index Rebuild | Full Replay |
+|---|---|---|---|
+| 10 | 36.4KB | 0.02ms | 0.30ms |
+| 25 | 57.5KB | 0.03ms | 0.58ms |
+| 50 | 122.1KB | 0.05ms | 1.21ms |
+| 100 | 227.0KB | 0.08ms | 2.28ms |
+| 200 | 576.2KB | 0.17ms | 4.89ms |
+| 500 | 2.0MB | 0.37ms | 14.26ms |
+| 1,000 | 4.3MB | 0.75ms | 29.49ms |
+| 1,500 | 8.2MB | 1.09ms | 48.06ms |
+
+Replay scales linearly with event count. At the maximum observed conversation size in the evaluation (358 events), full replay completes in under 10ms.
+
+---
+
+### 3. Storage Growth
+
+**Method:** Analyze all 433 SWE-Bench conversations. Measure per-conversation storage and breakdown by event type.
+
+#### Conversation Size Distribution
+
+| Metric | Min | P25 | Median | P75 | Max |
+|---|---|---|---|---|---|
+| Events | 22 | 64 | 82 | 108 | 358 |
+| Storage | 109.6KB | — | 380.0KB | 634.3KB | 3,357.0KB |
+
+Mean events per conversation: 92.1 (stdev 39.9). Average event size: ~624 bytes. Storage grows linearly with event count.
+
+#### Storage Composition by Event Type
+
+| Event Type | Count | % Events | Total | % Storage | Avg Size |
+|---|---|---|---|---|---|
+| ObservationEvent | 19,065 | 47.8% | 177.1MB | 78.0% | 9.51KB |
+| ActionEvent | 19,069 | 47.8% | 38.3MB | 16.9% | 2.05KB |
+| SystemPromptEvent | 433 | 1.1% | 10.1MB | 4.5% | 23.93KB |
+| MessageEvent | 433 | 1.1% | 1.4MB | 0.6% | 3.29KB |
+| ConversationStateUpdateEvent | 866 | 2.2% | 0.2MB | 0.1% | 0.19KB |
+| **Total** | **39,870** | | **227.1MB** | | |
+
+ObservationEvents (tool outputs) account for 78% of storage despite being only 48% of events by count.
+
+---
+
+### 4. Time-to-Recover via Replay After Failures
+
+**Method:** Build event logs from real payloads, then measure the full recovery path: read all events + reverse scan for actions without matching observations (unmatched-action detection, as implemented in `ConversationState.get_unmatched_actions()`).
+
+| Events | Storage | Time-to-Recover |
+|---|---|---|
+| 10 | 36.4KB | 0.64ms |
+| 25 | 57.5KB | 1.45ms |
+| 50 | 122.1KB | 2.71ms |
+| 100 | 227.0KB | 5.35ms |
+| 200 | 576.2KB | 10.70ms |
+| 500 | 2.0MB | 27.92ms |
+| 1,000 | 4.3MB | 57.50ms |
+| 1,500 | 8.2MB | 90.26ms |
+
+Recovery includes full Pydantic deserialization of all events via `Event.model_validate_json()` and scanning in reverse for actions that lack a corresponding observation (indicating a crash mid-execution) via `ConversationState.get_unmatched_actions()`. At the median conversation size (82 events), recovery completes in ~5ms. At the largest observed conversation (358 events), recovery completes in under 20ms.
@@ -0,0 +1,227 @@
+#!/usr/bin/env python3
+"""
+Benchmark: Persist latency per event and per action cycle.
+
+Extracts real event payloads from SWE-Bench evaluation conversation traces
+and replays them through the SDK's LocalFileStore lock-and-write path to
+measure per-event and per-cycle persist latency.
+
+Usage:
+    python bench_persist_latency.py --eval-dir <path-to-eval-run>
+"""
+
+import argparse
+import gc
+import json
+import os
+import shutil
+import statistics
+import tempfile
+import time
+
+from benchmark_utils import extract_conversation, read_event_files
+
+from openhands.sdk.io import LocalFileStore
+
+
+EVENTS_DIR_NAME = "events"
+LOCK_FILE = "events/.eventlog.lock"
+
+
+def measure_persist_latencies(event_files: list[dict]) -> list[dict]:
+    """Replay the persist path EventLog.append() uses:
+    lock -> write JSON file -> release lock
+
+    Uses LocalFileStore directly with real event payloads.
+    """
+    tmpdir = tempfile.mkdtemp(prefix="bench_persist_")
+    try:
+        fs = LocalFileStore(tmpdir, cache_limit_size=len(event_files) + 100)
+
+        results = []
+        for i, ef in enumerate(event_files):
+            target_path = f"{EVENTS_DIR_NAME}/{ef['filename']}"
+
+            gc.disable()
+            t0 = time.perf_counter()
+            with fs.lock(LOCK_FILE, timeout=30.0):
+                fs.write(target_path, ef["json_str"])
+            t1 = time.perf_counter()
+            gc.enable()
+
+            results.append(
+                {
+                    "kind": ef["kind"],
+                    "size_bytes": ef["size_bytes"],
+                    "persist_ms": (t1 - t0) * 1000,
+                    "event_idx": i,
+                }
+            )
+        return results
+    finally:
+        shutil.rmtree(tmpdir, ignore_errors=True)
+
+
+def main():
+    import logging
+
+    logging.getLogger("openhands").setLevel(logging.ERROR)
+
+    parser = argparse.ArgumentParser(
+        description="Benchmark persist latency per event/action cycle"
+    )
+    parser.add_argument(
+        "--eval-dir",
+        required=True,
+        help="Path to evaluation run directory",
+    )
+    parser.add_argument(
+        "--output",
+        default="bench_persist_latency_results.json",
+        help="Output JSON file path",
+    )
+    parser.add_argument(
+        "--sample-step",
+        type=int,
+        default=15,
+        help="Sample every Nth conversation (default: 15)",
+    )
+    args = parser.parse_args()
+
+    # Load instance metadata
+    instances = {}
+    with open(os.path.join(args.eval_dir, "output.jsonl")) as f:
+        for line in f:
+            d = json.loads(line)
+            instances[d["instance_id"]] = d
+
+    conv_dir = os.path.join(args.eval_dir, "conversations")
+    tarballs = sorted(os.listdir(conv_dir))
+    sample_tarballs = tarballs[:: args.sample_step]
+    print(f"Sampling {len(sample_tarballs)} of {len(tarballs)} conversations\n")
+
+    all_persist: list[dict] = []
+    conv_summaries: list[dict] = []
+
+    for tarname in sample_tarballs:
+        instance_id = tarname.replace(".tar.gz", "")
+        instance_data = instances.get(instance_id)
+        if not instance_data:
+            continue
+
+        tarpath = os.path.join(conv_dir, tarname)
+        tmpdir = tempfile.mkdtemp(prefix="bench_persist_")
+        try:
+            events_dir = extract_conversation(tarpath, tmpdir)
+            if not events_dir:
+                continue
+            event_files = read_event_files(events_dir)
+            if not event_files:
+                continue
+
+            persist_results = measure_persist_latencies(event_files)
+            all_persist.extend(persist_results)
+
+            # Per-cycle persist time (action + observation pairs)
+            action_p = [r for r in persist_results if r["kind"] == "ActionEvent"]
+            obs_p = [r for r in persist_results if r["kind"] == "ObservationEvent"]
+            n_cycles = min(len(action_p), len(obs_p))
+            cycle_persist = [
+                action_p[i]["persist_ms"] + obs_p[i]["persist_ms"]
+                for i in range(n_cycles)
+            ]
+
+            total_persist_ms = sum(r["persist_ms"] for r in persist_results)
+
+            conv_summaries.append(
+                {
+                    "instance_id": instance_id,
+                    "n_events": len(event_files),
+                    "n_cycles": n_cycles,
+                    "total_persist_ms": total_persist_ms,
+                    "mean_cycle_persist_ms": (
+                        statistics.mean(cycle_persist) if cycle_persist else 0
+                    ),
+                }
+            )
+            n_ev = len(event_files)
+            print(
+                f"  {instance_id[:50]:50s}  events={n_ev:>4}"
+                f"  persist={total_persist_ms:>7.1f}ms"
+            )
+
+        finally:
+            shutil.rmtree(tmpdir, ignore_errors=True)
+
+    # --- Analysis ---
+    print(f"\n{'=' * 70}")
+    print("RESULTS: Persist Latency per Event / Action Cycle")
+    print(f"{'=' * 70}")
+
+    by_kind: dict[str, list[dict]] = {}
+    for r in all_persist:
+        by_kind.setdefault(r["kind"], []).append(r)
+
+    print("\n--- Per-Event Persist Latency ---")
+    header = (
+        f"  {'Event Type':<35} {'N':>5} {'Median':>10}"
+        f" {'Mean':>10} {'P95':>10} {'MedSize':>10}"
+    )
+    print(header)
+    print(f"  {'-' * 80}")
+    for kind in [
+        "SystemPromptEvent",
+        "MessageEvent",
+        "ActionEvent",
+        "ObservationEvent",
+        "ConversationStateUpdateEvent",
+        "AgentErrorEvent",
+    ]:
+        if kind not in by_kind:
+            continue
+        entries = by_kind[kind]
+        lats = sorted([e["persist_ms"] for e in entries])
+        sizes = sorted([e["size_bytes"] for e in entries])
+        n = len(lats)
+        print(
+            f"  {kind:<35} {n:>5}"
+            f" {lats[n // 2]:>9.3f}ms"
+            f" {statistics.mean(lats):>9.3f}ms"
+            f" {lats[int(n * 0.95)]:>9.3f}ms"
+            f" {sizes[n // 2]:>8,}B"
+        )
+
+    all_lats = sorted([r["persist_ms"] for r in all_persist])
+    all_sizes = sorted([r["size_bytes"] for r in all_persist])
+    n = len(all_lats)
+    print(f"  {'-' * 80}")
+    print(
+        f"  {'ALL EVENTS':<35} {n:>5}"
+        f" {all_lats[n // 2]:>9.3f}ms"
+        f" {statistics.mean(all_lats):>9.3f}ms"
+        f" {all_lats[int(n * 0.95)]:>9.3f}ms"
+        f" {all_sizes[n // 2]:>8,}B"
+    )
+
+    # Per action cycle
+    print("\n--- Per Action Cycle (Action + Observation) ---")
+    cycle_persists = [
+        s["mean_cycle_persist_ms"] for s in conv_summaries if s["n_cycles"] > 0
+    ]
+    med = statistics.median(cycle_persists)
+    mean = statistics.mean(cycle_persists)
+    print(f"  Median per-cycle persist time:  {med:.2f}ms")
+    print(f"  Mean per-cycle persist time:    {mean:.2f}ms")
+
+    # Save
+    with open(args.output, "w") as f:
+        json.dump(
+            {"per_event": all_persist, "conversations": conv_summaries},
+            f,
+            indent=2,
+        )
+    print(f"\nRaw data saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()