Skip to content

add eval for diagnosing concurrency lease renewal failures#98

Open
zzstoatzz wants to merge 3 commits intomainfrom
claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78
Open

add eval for diagnosing concurrency lease renewal failures#98
zzstoatzz wants to merge 3 commits intomainfrom
claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78

Conversation

@zzstoatzz
Copy link
Collaborator

@zzstoatzz zzstoatzz commented Oct 20, 2025

closes #97

summary

adds a new eval (test_lease_renewal_crash) for issue #97 - tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures.

based on real user issues:

changes

  • new: evals/test_lease_renewal_crash.py - eval for concurrency lease renewal crash diagnosis
  • updated: README with new eval

test plan

  • test_lease_renewal_crash passes locally
  • CI passes

🤖 Generated with Claude Code

@github-actions
Copy link

github-actions bot commented Oct 20, 2025

📊 Observability

View eval run traces in Logfire: prefect-mcp-server-evals @ 611decf

@github-actions
Copy link

github-actions bot commented Oct 20, 2025

Evaluation Results

17 tests  +1   15 ✅  - 1   2m 4s ⏱️ -1s
 1 suites ±0    0 💤 ±0 
 1 files   ±0    2 ❌ +2 

For more details on these failures, see this check.

Results for commit 611decf. ± Comparison against base commit 5496283.

♻️ This comment has been updated with latest results.

@zzstoatzz zzstoatzz marked this pull request as draft October 29, 2025 06:15
@zzstoatzz zzstoatzz changed the title Add eval for debugging concurrency lease renewal failures Add eval for debugging automation action validation failures Oct 29, 2025
@zzstoatzz zzstoatzz marked this pull request as ready for review October 29, 2025 18:11
@zzstoatzz zzstoatzz requested a review from desertaxle November 3, 2025 04:30
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 17c3933 to 676acd5 Compare November 4, 2025 00:43
@zzstoatzz zzstoatzz marked this pull request as draft December 31, 2025 03:37
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 676acd5 to ba0d8a8 Compare December 31, 2025 03:38
@zzstoatzz zzstoatzz changed the title Add eval for debugging automation action validation failures add eval for debugging concurrency lease renewal failures Dec 31, 2025
@zzstoatzz zzstoatzz changed the title add eval for debugging concurrency lease renewal failures add eval for diagnosing concurrency lease renewal failures Dec 31, 2025
@zzstoatzz zzstoatzz marked this pull request as ready for review December 31, 2025 16:35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little contrived to me. The set up makes this an eval that checks if an agent can correctly retrieve and read a state message, which I think is less interesting than evaluating it against the real scenario. IMO, making this look more like our concurrency lease integration tests would make this eval more interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this eval should be closer to simulating a realistic situation. Actually crashing a run due to a failed lease renewal will help ensure that agents with the MCP server can debug the failure even with newer versions of Prefect.

@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch 3 times, most recently from e566fa0 to 2fc3b9f Compare December 31, 2025 16:59
tests that an agent can diagnose flow runs that crashed due to
concurrency lease renewal failures - a common production issue.

based on real user issues:
- PrefectHQ/prefect#19068
- PrefectHQ/prefect#18839

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from d1f1f44 to 8bac451 Compare December 31, 2025 21:21
@zzstoatzz zzstoatzz requested a review from desertaxle December 31, 2025 21:28
zzstoatzz and others added 2 commits December 31, 2025 16:20
Instead of artificially setting Crashed state, this eval now actually
triggers a lease renewal failure by:

1. Creating a concurrency limit and flow that acquires a slot
2. Patching renew_concurrency_lease to fail on second attempt
3. Waiting for renewal at t=45s (0.75 * 60s lease duration)
4. Flow crashes with real "Concurrency lease renewal failed" message

This ensures agents with the MCP server can diagnose the failure even
with newer versions of Prefect, as requested by reviewer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- replace complex patching with direct simulation of network failure
- reduce test time from ~60s to ~1s by failing the renewal loop immediately
- simulates httpx.ConnectError which is a realistic failure mode
- maintains the same evaluation criteria

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add eval for debugging concurrency lease renewal failures

2 participants