add eval for diagnosing concurrency lease renewal failures#98
add eval for diagnosing concurrency lease renewal failures#98
Conversation
📊 ObservabilityView eval run traces in Logfire: prefect-mcp-server-evals @ 611decf |
Evaluation Results17 tests +1 15 ✅ - 1 2m 4s ⏱️ -1s For more details on these failures, see this check. Results for commit 611decf. ± Comparison against base commit 5496283. ♻️ This comment has been updated with latest results. |
17c3933 to
676acd5
Compare
676acd5 to
ba0d8a8
Compare
There was a problem hiding this comment.
This feels a little contrived to me. The set up makes this an eval that checks if an agent can correctly retrieve and read a state message, which I think is less interesting than evaluating it against the real scenario. IMO, making this look more like our concurrency lease integration tests would make this eval more interesting.
There was a problem hiding this comment.
I still think this eval should be closer to simulating a realistic situation. Actually crashing a run due to a failed lease renewal will help ensure that agents with the MCP server can debug the failure even with newer versions of Prefect.
e566fa0 to
2fc3b9f
Compare
tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures - a common production issue. based on real user issues: - PrefectHQ/prefect#19068 - PrefectHQ/prefect#18839 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
d1f1f44 to
8bac451
Compare
Instead of artificially setting Crashed state, this eval now actually triggers a lease renewal failure by: 1. Creating a concurrency limit and flow that acquires a slot 2. Patching renew_concurrency_lease to fail on second attempt 3. Waiting for renewal at t=45s (0.75 * 60s lease duration) 4. Flow crashes with real "Concurrency lease renewal failed" message This ensures agents with the MCP server can diagnose the failure even with newer versions of Prefect, as requested by reviewer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- replace complex patching with direct simulation of network failure - reduce test time from ~60s to ~1s by failing the renewal loop immediately - simulates httpx.ConnectError which is a realistic failure mode - maintains the same evaluation criteria 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
closes #97
summary
adds a new eval (
test_lease_renewal_crash) for issue #97 - tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures.based on real user issues:
changes
evals/test_lease_renewal_crash.py- eval for concurrency lease renewal crash diagnosistest plan
test_lease_renewal_crashpasses locally🤖 Generated with Claude Code