Skip to content

Add deterministic replay gate for agent tests#673

Open
ZackMitchell910 wants to merge 1 commit intodaydreamsai:mainfrom
ZackMitchell910:runledger/replay-gate
Open

Add deterministic replay gate for agent tests#673
ZackMitchell910 wants to merge 1 commit intodaydreamsai:mainfrom
ZackMitchell910:runledger/replay-gate

Conversation

@ZackMitchell910
Copy link

@ZackMitchell910 ZackMitchell910 commented Dec 19, 2025

Summary

  • add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
  • add a baseline file for regression gating
  • add a GitHub Actions workflow using runledger/Runledger@v0.1
  • add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

  • no external calls; replay-only cassette
  • feel free to remove the suite/workflow if it is not desired
  • GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

Summary by CodeRabbit

  • New Features

    • Added automated evaluation pipeline with CI gate for tool-using agents in replay mode.
    • Introduced baseline benchmarking system for performance tracking.
  • Documentation

    • Updated readme with information about the new CI evaluation gate.
  • Chores

    • Updated .gitignore to exclude evaluation artifacts.

✏️ Tip: You can customize this high-level summary in your review settings.

@vercel
Copy link

vercel bot commented Dec 19, 2025

@ZackMitchell910 is attempting to deploy a commit to the ponderingdemocritus' projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Introduces a RunLedger evaluation system for tool-using agents, including a GitHub Actions CI workflow, a minimal Python agent implementation with JSON IPC, test cases, cassette-based tool mocking, schema validation, and baseline benchmarks for deterministic evaluation and replay-based testing.

Changes

Cohort / File(s) Summary
CI/Workflow Setup
.github/workflows/runledger.yml, .gitignore
Adds new GitHub Actions workflow runledger-gate triggered on pull requests to run deterministic evals in replay mode; ignores runledger_out/ directory in git.
Evaluation Suite Configuration
evals/runledger/suite.yaml, evals/runledger/schema.json
Defines RunLedger suite orchestration (executor, tool registry, assertions), resource budgets, and regression thresholds; adds JSON schema for validating required category and reply fields.
Agent Implementation
evals/runledger/agent/agent.py
Implements minimal JSON-over-stdin/stdout agent that listens for task_start messages, calls search_docs tool, processes tool_result, and emits final_output.
Test Cases & Fixtures
evals/runledger/cases/t1.yaml, evals/runledger/cassettes/t1.jsonl
Adds evaluation case t1 (login ticket triage scenario) and corresponding cassette fixture with mocked search_docs tool result.
Baselines & Documentation
baselines/runledger-demo.json, readme.md
Provides baseline benchmark result for replay-based runs; documents new RunLedger CI gate functionality.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant Agent as Python Agent
    participant Tools as Tool Registry
    participant Cassette as Cassette (Replay)
    participant Assertions as Assertions
    
    GHA->>Agent: Execute agent.py
    GHA->>Agent: Provide stdin (task_start with ticket)
    
    Agent->>Agent: Parse JSON task_start message
    Agent->>Agent: Extract ticket → search query
    
    Agent->>Tools: Emit tool_call (search_docs, query)
    Tools->>Cassette: Lookup mocked tool result
    Cassette-->>Tools: Return cached result
    Tools-->>Agent: Provide tool_result via stdin
    
    Agent->>Agent: Parse JSON tool_result
    Agent->>Agent: Emit final_output (category, reply)
    
    GHA->>Assertions: Validate final_output against schema
    Assertions->>Assertions: Check required fields
    Assertions-->>GHA: Pass/Fail result
    
    GHA->>GHA: Compare against baseline
    GHA->>GHA: Upload artifacts & report
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Agent logic (agent.py): Review message handling, IPC protocol, tool invocation semantics
  • Schema validation: Ensure JSON schema correctly enforces required fields
  • Configuration correctness: Verify suite.yaml paths, executor, tool definitions, and resource budgets align with agent implementation
  • Test data integrity: Confirm cassette format matches expected tool result structure and case assertions

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add deterministic replay gate for agent tests' accurately and concisely summarizes the main objective of the pull request, which introduces a deterministic replay-based evaluation gate for testing agents.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ZackMitchell910
Copy link
Author

Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow runs are currently waiting on fork approval (action_required) or have not started yet for forks. If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
readme.md (1)

534-543: Clear documentation of the replay gate.

The section effectively documents the deterministic CI gate and provides the local run command. Consider adding a note that this is a demonstration setup with a stub agent, to set appropriate expectations for users exploring the feature.

.github/workflows/runledger.yml (2)

13-16: Consider pinning the RunLedger version for reproducibility.

The workflow installs runledger without specifying a version, which could lead to non-deterministic CI behavior if the package is updated. For a deterministic replay gate, consider pinning to a specific version.

🔎 Proposed fix
       - name: Install RunLedger
         run: |
           python -m pip install --upgrade pip
-          python -m pip install runledger
+          python -m pip install runledger==0.1.0

1-24: Verify the workflow setup and consider additional checks.

The workflow structure is sound and will run on every pull request. Consider the following:

  1. The workflow will run on all PRs, including those that don't modify eval-related files. You might want to add path filters if performance becomes a concern.
  2. The workflow doesn't have explicit failure conditions documented - it relies on the runledger command's exit code.
evals/runledger/agent/agent.py (1)

8-19: Add error handling for JSON parsing and consider validating message structure.

The agent reads and parses JSON from stdin without error handling. If malformed JSON is received, json.loads() will raise an exception and crash the agent. For a more robust implementation, consider wrapping the parsing in a try-except block.

Additionally, the agent uses hardcoded output values and doesn't actually process the tool_result data. While this is acceptable for a stub agent, it's worth documenting this limitation.

🔎 Proposed improvement
 def main():
     for line in sys.stdin:
         line = line.strip()
         if not line:
             continue
-        msg = json.loads(line)
+        try:
+            msg = json.loads(line)
+        except json.JSONDecodeError as e:
+            send({"type": "error", "message": f"Invalid JSON: {e}"})
+            continue
         if msg.get("type") == "task_start":
             ticket = msg.get("input", {}).get("ticket", "")
             send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}})
         elif msg.get("type") == "tool_result":
+            # Note: This is a stub agent with hardcoded output
             send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})
             break
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25e199d and f99d5a8.

📒 Files selected for processing (9)
  • .github/workflows/runledger.yml (1 hunks)
  • .gitignore (1 hunks)
  • baselines/runledger-demo.json (1 hunks)
  • evals/runledger/agent/agent.py (1 hunks)
  • evals/runledger/cases/t1.yaml (1 hunks)
  • evals/runledger/cassettes/t1.jsonl (1 hunks)
  • evals/runledger/schema.json (1 hunks)
  • evals/runledger/suite.yaml (1 hunks)
  • readme.md (1 hunks)
🔇 Additional comments (9)
.gitignore (1)

208-209: LGTM!

The addition of runledger_out/ to the ignore list aligns with the CI workflow that generates and uploads artifacts from this directory.

evals/runledger/cassettes/t1.jsonl (1)

1-1: LGTM!

The cassette data structure is well-formed and aligns with the test case input ("reset password") and the search_docs tool definition in the suite configuration.

evals/runledger/schema.json (1)

1-15: LGTM!

The schema correctly defines the expected output structure with required string fields category and reply, matching the assertions in the test case and the agent's final output format.

evals/runledger/cases/t1.yaml (1)

1-10: LGTM!

The test case is well-structured with clear input, cassette reference, and assertions that align with the JSON schema. The required_fields assertion correctly validates the expected output structure.

evals/runledger/agent/agent.py (1)

4-6: LGTM!

The send function correctly writes JSON to stdout and flushes immediately, ensuring proper IPC communication.

evals/runledger/suite.yaml (1)

1-18: Configuration looks good for a deterministic replay demo.

The suite configuration is well-structured with appropriate settings for replay mode. The strict thresholds (min_pass_rate: 1.0 and max_tool_errors: 0) are acceptable for a single deterministic test case, though they mean any deviation will fail the CI gate.

Note that the baseline file at baselines/runledger-demo.json is referenced but not shown in the provided files. Ensure this file exists and contains the expected baseline data for the replay validation.

baselines/runledger-demo.json (3)

109-109: Clarify the intent of the null suite_config_hash.

Line 109 shows suite_config_hash as null. Confirm whether this is intentional (e.g., not yet implemented in RunLedger) or if it should be populated to detect config drift in future runs.


1-113: Data consistency within the baseline is sound.

The aggregates correctly summarize the single passing test case: counts align (1 pass, 0 fail/error), metric aggregations match the case data (tool_calls=1, tool_errors=0, wall_ms=58), pass_rate is 1.0, and assertion counts are consistent (2 total, 0 failed). The structure follows the expected RunLedger schema.


71-72: Verify the cassette file exists and the SHA256 hash matches.

The baseline references a cassette at evals/runledger/cassettes/t1.jsonl with a SHA256 hash. Ensure the file exists in the repository and the hash is accurate; if the cassette is missing or the hash mismatches, evaluation replay will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant