Skip to content

Fix workflow stuck in refining/evaluation loops#120

Merged
neuromechanist merged 3 commits intodevelopfrom
119-fix-workflow-stuck-in-refiningevaluation-loops
Mar 4, 2026
Merged

Fix workflow stuck in refining/evaluation loops#120
neuromechanist merged 3 commits intodevelopfrom
119-fix-workflow-stuck-in-refiningevaluation-loops

Conversation

@neuromechanist
Copy link
Member

Summary

  • Make evaluation informational-only when run_assessment=False (default), preventing the evaluation-driven refinement loop that caused the "stuck" behavior
  • Add 15s LLM call timeout via request_timeout on ChatLiteLLM to prevent hanging on slow providers
  • Default evaluation parsing to ACCEPT when ambiguous instead of REFINE
  • Derive max_total_iterations from max_validation_attempts + 1 (was hardcoded 10)
  • Add per-node timing (time.monotonic()) to all workflow nodes for diagnosing slowness
  • Switch default evaluation model to openai/gpt-oss-120b on groq
  • Lower LangGraph recursion_limit from 100 to 50
  • Update default max_validation_attempts from 5 to 3

Behavior Change

Setting Before After
run_assessment=False Evaluation could loop back to refinement Evaluation is informational only, always ends
run_assessment=True Same as above Evaluation can trigger refinement (capped)
Default max iterations 10 (hardcoded) max_validation_attempts + 1 (4 by default)
LLM call timeout None (infinite) 15 seconds
Eval model qwen/qwen3-235b on Cerebras openai/gpt-oss-120b on groq

All settings remain tunable via the frontend (Max Validation Attempts dropdown, Run Assessment checkbox).

Test plan

  • 415 unit tests pass, 0 failures
  • Manual test: simple description completes in 1-2 iterations
  • Manual test: run_assessment=False never triggers refinement from evaluation
  • Manual test: run_assessment=True allows refinement but caps at max_validation_attempts + 1
  • Verify per-node timing appears in server logs

Closes #119

- Make evaluation informational-only when run_assessment=False
- Add 15s LLM call timeout via request_timeout on ChatLiteLLM
- Default evaluation parsing to ACCEPT when ambiguous
- Derive max_total_iterations from max_validation_attempts + 1
- Add per-node timing to all workflow nodes
- Switch eval model default to openai/gpt-oss-120b on groq
- Lower recursion_limit from 100 to 50
- Update default max_validation_attempts from 5 to 3

Closes #119
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 4, 2026

Deploying hedit with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6cf0986
Status: ✅  Deploy successful!
Preview URL: https://228724b8.hedit.pages.dev
Branch Preview URL: https://119-fix-workflow-stuck-in-re.hedit.pages.dev

View logs

- Move re import to module level in evaluation_agent.py
- Add missing "Entering assess node" log for consistency
- Centralize max_total_iterations derivation in state.py and workflow.py
  (was duplicated 3x in main.py, now defaults to max_validation_attempts + 1)
- Update create_initial_state defaults (was stale at 5/10)
- Fix ty warnings: remove unused type: ignore comments
- Fix ty errors: add type: ignore for LangGraph/Starlette typing limitations
- Fix return type on get_default_path (-> str | None)
- Update test_state to match new default
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 26.53061% with 72 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/agents/workflow.py 13.33% 25 Missing and 1 partial ⚠️
src/api/main.py 31.57% 26 Missing ⚠️
src/agents/evaluation_agent.py 23.07% 10 Missing ⚠️
src/agents/assessment_agent.py 28.57% 5 Missing ⚠️
src/agents/feedback_summarizer.py 28.57% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Add try/except with logging to evaluation, assessment, and feedback agents
- Map timeouts to HTTP 504, rate limits to HTTP 429 in API endpoints
- Add error_type field to streaming SSE error events
- Sanitize error messages to avoid leaking internal details
- Add debug log for silent ACCEPT fallback in evaluation parsing
@neuromechanist
Copy link
Member Author

PR Review Summary

Three specialized review agents analyzed this PR. Here is a summary of all findings and how each was addressed.

Silent Failure Hunter

CRITICAL - No timeout error handling in LLM calls (evaluation, assessment, feedback agents)

  • Fixed in commit 6cf0986: Added try/except with logging around llm.ainvoke() calls in evaluation_agent.py, assessment_agent.py, and feedback_summarizer.py. Errors are logged with full traceback and re-raised.

HIGH - API endpoints return generic 500 for all errors

  • Fixed in commit 6cf0986: Added specific exception handlers for APITimeoutError (maps to HTTP 504) and RateLimitError (maps to HTTP 429) in all four annotation endpoints. Streaming SSE error events now include an error_type field (timeout, rate_limit, internal). Raw exception strings no longer leak to clients.

MEDIUM - Silent ACCEPT fallback in evaluation parsing

  • Fixed in commit 6cf0986: Added logger.debug() when _parse_decision() falls through to the default ACCEPT, so ambiguous responses are traceable in logs.

Code Reviewer

Stale defaults in create_initial_state() (max_validation_attempts=5, max_total_iterations=10)

  • Fixed in commit c89dcac: Updated to max_validation_attempts=3, max_total_iterations=None (auto-derived as max_validation_attempts + 1).

import re inside method body instead of module-level

  • Fixed in commit c89dcac: Moved to module-level import.

Missing "Entering assess node" log

  • Fixed in commit c89dcac: Added print statement consistent with other nodes.

Test assertion using old default (5 instead of 3)

  • Fixed in commit c89dcac: Updated test_create_initial_state assertion.

Code Simplifier

max_total_iterations derivation duplicated 3x in main.py

  • Fixed in commit c89dcac: Centralized derivation in create_initial_state() and workflow.run() with None default and auto-derivation. All call sites in main.py simplified.

Pre-existing Issues (not introduced by this PR)

  • test_cli_integration.py::test_annotate_complex_description fails on develop (server error, unrelated to this PR)
  • Dependabot vulnerability on default branch (pre-existing)

@neuromechanist neuromechanist merged commit e67d2e4 into develop Mar 4, 2026
12 of 13 checks passed
@neuromechanist neuromechanist deleted the 119-fix-workflow-stuck-in-refiningevaluation-loops branch March 4, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant