Add deterministic replay gate for agent tests by ZackMitchell910 · Pull Request #673 · daydreamsai/daydreams

ZackMitchell910 · 2025-12-19T12:13:44Z

Summary

add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
add a baseline file for regression gating
add a GitHub Actions workflow using runledger/Runledger@v0.1
add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

no external calls; replay-only cassette
feel free to remove the suite/workflow if it is not desired
GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

Summary by CodeRabbit

New Features
- Added automated evaluation pipeline with CI gate for tool-using agents in replay mode.
- Introduced baseline benchmarking system for performance tracking.
Documentation
- Updated readme with information about the new CI evaluation gate.
Chores
- Updated .gitignore to exclude evaluation artifacts.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

vercel · 2025-12-19T12:13:48Z

@ZackMitchell910 is attempting to deploy a commit to the ponderingdemocritus' projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2025-12-19T12:13:54Z

Walkthrough

Introduces a RunLedger evaluation system for tool-using agents, including a GitHub Actions CI workflow, a minimal Python agent implementation with JSON IPC, test cases, cassette-based tool mocking, schema validation, and baseline benchmarks for deterministic evaluation and replay-based testing.

Changes

Cohort / File(s)	Summary
CI/Workflow Setup `.github/workflows/runledger.yml`, `.gitignore`	Adds new GitHub Actions workflow `runledger-gate` triggered on pull requests to run deterministic evals in replay mode; ignores `runledger_out/` directory in git.
Evaluation Suite Configuration `evals/runledger/suite.yaml`, `evals/runledger/schema.json`	Defines RunLedger suite orchestration (executor, tool registry, assertions), resource budgets, and regression thresholds; adds JSON schema for validating required `category` and `reply` fields.
Agent Implementation `evals/runledger/agent/agent.py`	Implements minimal JSON-over-stdin/stdout agent that listens for `task_start` messages, calls `search_docs` tool, processes `tool_result`, and emits `final_output`.
Test Cases & Fixtures `evals/runledger/cases/t1.yaml`, `evals/runledger/cassettes/t1.jsonl`	Adds evaluation case `t1` (login ticket triage scenario) and corresponding cassette fixture with mocked `search_docs` tool result.
Baselines & Documentation `baselines/runledger-demo.json`, `readme.md`	Provides baseline benchmark result for replay-based runs; documents new RunLedger CI gate functionality.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant Agent as Python Agent
    participant Tools as Tool Registry
    participant Cassette as Cassette (Replay)
    participant Assertions as Assertions
    
    GHA->>Agent: Execute agent.py
    GHA->>Agent: Provide stdin (task_start with ticket)
    
    Agent->>Agent: Parse JSON task_start message
    Agent->>Agent: Extract ticket → search query
    
    Agent->>Tools: Emit tool_call (search_docs, query)
    Tools->>Cassette: Lookup mocked tool result
    Cassette-->>Tools: Return cached result
    Tools-->>Agent: Provide tool_result via stdin
    
    Agent->>Agent: Parse JSON tool_result
    Agent->>Agent: Emit final_output (category, reply)
    
    GHA->>Assertions: Validate final_output against schema
    Assertions->>Assertions: Check required fields
    Assertions-->>GHA: Pass/Fail result
    
    GHA->>GHA: Compare against baseline
    GHA->>GHA: Upload artifacts & report

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Agent logic (agent.py): Review message handling, IPC protocol, tool invocation semantics
Schema validation: Ensure JSON schema correctly enforces required fields
Configuration correctness: Verify suite.yaml paths, executor, tool definitions, and resource budgets align with agent implementation
Test data integrity: Confirm cassette format matches expected tool result structure and case assertions

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add deterministic replay gate for agent tests' accurately and concisely summarizes the main objective of the pull request, which introduces a deterministic replay-based evaluation gate for testing agents.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ZackMitchell910 · 2025-12-19T12:14:57Z

Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow runs are currently waiting on fork approval (action_required) or have not started yet for forks. If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

readme.md (1)

534-543: Clear documentation of the replay gate.

The section effectively documents the deterministic CI gate and provides the local run command. Consider adding a note that this is a demonstration setup with a stub agent, to set appropriate expectations for users exploring the feature.
.github/workflows/runledger.yml (2)
13-16: Consider pinning the RunLedger version for reproducibility.

The workflow installs runledger without specifying a version, which could lead to non-deterministic CI behavior if the package is updated. For a deterministic replay gate, consider pinning to a specific version.
🔎 Proposed fix
       - name: Install RunLedger
         run: |
           python -m pip install --upgrade pip
-          python -m pip install runledger
+          python -m pip install runledger==0.1.0
1-24: Verify the workflow setup and consider additional checks.

The workflow structure is sound and will run on every pull request. Consider the following:

The workflow will run on all PRs, including those that don't modify eval-related files. You might want to add path filters if performance becomes a concern.

The workflow doesn't have explicit failure conditions documented - it relies on the runledger command's exit code.
evals/runledger/agent/agent.py (1)
8-19: Add error handling for JSON parsing and consider validating message structure.

The agent reads and parses JSON from stdin without error handling. If malformed JSON is received, json.loads() will raise an exception and crash the agent. For a more robust implementation, consider wrapping the parsing in a try-except block.

Additionally, the agent uses hardcoded output values and doesn't actually process the tool_result data. While this is acceptable for a stub agent, it's worth documenting this limitation.
🔎 Proposed improvement
 def main():
     for line in sys.stdin:
         line = line.strip()
         if not line:
             continue
-        msg = json.loads(line)
+        try:
+            msg = json.loads(line)
+        except json.JSONDecodeError as e:
+            send({"type": "error", "message": f"Invalid JSON: {e}"})
+            continue
         if msg.get("type") == "task_start":
             ticket = msg.get("input", {}).get("ticket", "")
             send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}})
         elif msg.get("type") == "tool_result":
+            # Note: This is a stub agent with hardcoded output
             send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})
             break

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25e199d and f99d5a8.

📒 Files selected for processing (9)

.github/workflows/runledger.yml (1 hunks)
.gitignore (1 hunks)
baselines/runledger-demo.json (1 hunks)
evals/runledger/agent/agent.py (1 hunks)
evals/runledger/cases/t1.yaml (1 hunks)
evals/runledger/cassettes/t1.jsonl (1 hunks)
evals/runledger/schema.json (1 hunks)
evals/runledger/suite.yaml (1 hunks)
readme.md (1 hunks)

🔇 Additional comments (9)

.gitignore (1)

208-209: LGTM!

The addition of runledger_out/ to the ignore list aligns with the CI workflow that generates and uploads artifacts from this directory.

evals/runledger/cassettes/t1.jsonl (1)

1-1: LGTM!

The cassette data structure is well-formed and aligns with the test case input ("reset password") and the search_docs tool definition in the suite configuration.

evals/runledger/schema.json (1)

1-15: LGTM!

The schema correctly defines the expected output structure with required string fields category and reply, matching the assertions in the test case and the agent's final output format.

evals/runledger/cases/t1.yaml (1)

1-10: LGTM!

The test case is well-structured with clear input, cassette reference, and assertions that align with the JSON schema. The required_fields assertion correctly validates the expected output structure.

evals/runledger/agent/agent.py (1)

4-6: LGTM!

The send function correctly writes JSON to stdout and flushes immediately, ensuring proper IPC communication.

evals/runledger/suite.yaml (1)

1-18: Configuration looks good for a deterministic replay demo.

The suite configuration is well-structured with appropriate settings for replay mode. The strict thresholds (min_pass_rate: 1.0 and max_tool_errors: 0) are acceptable for a single deterministic test case, though they mean any deviation will fail the CI gate.

Note that the baseline file at baselines/runledger-demo.json is referenced but not shown in the provided files. Ensure this file exists and contains the expected baseline data for the replay validation.

baselines/runledger-demo.json (3)

109-109: Clarify the intent of the null suite_config_hash.

Line 109 shows suite_config_hash as null. Confirm whether this is intentional (e.g., not yet implemented in RunLedger) or if it should be populated to detect config drift in future runs.

1-113: Data consistency within the baseline is sound.

The aggregates correctly summarize the single passing test case: counts align (1 pass, 0 fail/error), metric aggregations match the case data (tool_calls=1, tool_errors=0, wall_ms=58), pass_rate is 1.0, and assertion counts are consistent (2 total, 0 failed). The structure follows the expected RunLedger schema.

71-72: Verify the cassette file exists and the SHA256 hash matches.

The baseline references a cassette at evals/runledger/cassettes/t1.jsonl with a SHA256 hash. Ensure the file exists in the repository and the hash is accurate; if the cassette is missing or the hash mismatches, evaluation replay will fail.

Add RunLedger replay gate

f99d5a8

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic replay gate for agent tests#673

Add deterministic replay gate for agent tests#673
ZackMitchell910 wants to merge 1 commit intodaydreamsai:mainfrom
ZackMitchell910:runledger/replay-gate

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading

Uh oh!

vercel bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

ZackMitchell910 commented Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZackMitchell910 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to run locally

Notes

Summary by CodeRabbit

Uh oh!

vercel bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

ZackMitchell910 commented Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading