Add deterministic replay gate for agent tests#673
Add deterministic replay gate for agent tests#673ZackMitchell910 wants to merge 1 commit intodaydreamsai:mainfrom
Conversation
|
@ZackMitchell910 is attempting to deploy a commit to the ponderingdemocritus' projects Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughIntroduces a RunLedger evaluation system for tool-using agents, including a GitHub Actions CI workflow, a minimal Python agent implementation with JSON IPC, test cases, cassette-based tool mocking, schema validation, and baseline benchmarks for deterministic evaluation and replay-based testing. Changes
Sequence Diagram(s)sequenceDiagram
participant GHA as GitHub Actions
participant Agent as Python Agent
participant Tools as Tool Registry
participant Cassette as Cassette (Replay)
participant Assertions as Assertions
GHA->>Agent: Execute agent.py
GHA->>Agent: Provide stdin (task_start with ticket)
Agent->>Agent: Parse JSON task_start message
Agent->>Agent: Extract ticket → search query
Agent->>Tools: Emit tool_call (search_docs, query)
Tools->>Cassette: Lookup mocked tool result
Cassette-->>Tools: Return cached result
Tools-->>Agent: Provide tool_result via stdin
Agent->>Agent: Parse JSON tool_result
Agent->>Agent: Emit final_output (category, reply)
GHA->>Assertions: Validate final_output against schema
Assertions->>Assertions: Check required fields
Assertions-->>GHA: Pass/Fail result
GHA->>GHA: Compare against baseline
GHA->>GHA: Upload artifacts & report
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow runs are currently waiting on fork approval (action_required) or have not started yet for forks. If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
readme.md (1)
534-543: Clear documentation of the replay gate.The section effectively documents the deterministic CI gate and provides the local run command. Consider adding a note that this is a demonstration setup with a stub agent, to set appropriate expectations for users exploring the feature.
.github/workflows/runledger.yml (2)
13-16: Consider pinning the RunLedger version for reproducibility.The workflow installs
runledgerwithout specifying a version, which could lead to non-deterministic CI behavior if the package is updated. For a deterministic replay gate, consider pinning to a specific version.🔎 Proposed fix
- name: Install RunLedger run: | python -m pip install --upgrade pip - python -m pip install runledger + python -m pip install runledger==0.1.0
1-24: Verify the workflow setup and consider additional checks.The workflow structure is sound and will run on every pull request. Consider the following:
- The workflow will run on all PRs, including those that don't modify eval-related files. You might want to add path filters if performance becomes a concern.
- The workflow doesn't have explicit failure conditions documented - it relies on the runledger command's exit code.
evals/runledger/agent/agent.py (1)
8-19: Add error handling for JSON parsing and consider validating message structure.The agent reads and parses JSON from stdin without error handling. If malformed JSON is received,
json.loads()will raise an exception and crash the agent. For a more robust implementation, consider wrapping the parsing in a try-except block.Additionally, the agent uses hardcoded output values and doesn't actually process the
tool_resultdata. While this is acceptable for a stub agent, it's worth documenting this limitation.🔎 Proposed improvement
def main(): for line in sys.stdin: line = line.strip() if not line: continue - msg = json.loads(line) + try: + msg = json.loads(line) + except json.JSONDecodeError as e: + send({"type": "error", "message": f"Invalid JSON: {e}"}) + continue if msg.get("type") == "task_start": ticket = msg.get("input", {}).get("ticket", "") send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}}) elif msg.get("type") == "tool_result": + # Note: This is a stub agent with hardcoded output send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}}) break
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
.github/workflows/runledger.yml(1 hunks).gitignore(1 hunks)baselines/runledger-demo.json(1 hunks)evals/runledger/agent/agent.py(1 hunks)evals/runledger/cases/t1.yaml(1 hunks)evals/runledger/cassettes/t1.jsonl(1 hunks)evals/runledger/schema.json(1 hunks)evals/runledger/suite.yaml(1 hunks)readme.md(1 hunks)
🔇 Additional comments (9)
.gitignore (1)
208-209: LGTM!The addition of
runledger_out/to the ignore list aligns with the CI workflow that generates and uploads artifacts from this directory.evals/runledger/cassettes/t1.jsonl (1)
1-1: LGTM!The cassette data structure is well-formed and aligns with the test case input ("reset password") and the
search_docstool definition in the suite configuration.evals/runledger/schema.json (1)
1-15: LGTM!The schema correctly defines the expected output structure with required string fields
categoryandreply, matching the assertions in the test case and the agent's final output format.evals/runledger/cases/t1.yaml (1)
1-10: LGTM!The test case is well-structured with clear input, cassette reference, and assertions that align with the JSON schema. The
required_fieldsassertion correctly validates the expected output structure.evals/runledger/agent/agent.py (1)
4-6: LGTM!The
sendfunction correctly writes JSON to stdout and flushes immediately, ensuring proper IPC communication.evals/runledger/suite.yaml (1)
1-18: Configuration looks good for a deterministic replay demo.The suite configuration is well-structured with appropriate settings for replay mode. The strict thresholds (
min_pass_rate: 1.0andmax_tool_errors: 0) are acceptable for a single deterministic test case, though they mean any deviation will fail the CI gate.Note that the baseline file at
baselines/runledger-demo.jsonis referenced but not shown in the provided files. Ensure this file exists and contains the expected baseline data for the replay validation.baselines/runledger-demo.json (3)
109-109: Clarify the intent of the nullsuite_config_hash.Line 109 shows
suite_config_hashas null. Confirm whether this is intentional (e.g., not yet implemented in RunLedger) or if it should be populated to detect config drift in future runs.
1-113: Data consistency within the baseline is sound.The aggregates correctly summarize the single passing test case: counts align (1 pass, 0 fail/error), metric aggregations match the case data (tool_calls=1, tool_errors=0, wall_ms=58), pass_rate is 1.0, and assertion counts are consistent (2 total, 0 failed). The structure follows the expected RunLedger schema.
71-72: Verify the cassette file exists and the SHA256 hash matches.The baseline references a cassette at
evals/runledger/cassettes/t1.jsonlwith a SHA256 hash. Ensure the file exists in the repository and the hash is accurate; if the cassette is missing or the hash mismatches, evaluation replay will fail.
Summary
runledger/Runledger@v0.1runledger_out/How to run locally
Notes
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.