context fixes

Zvi Fried · Zvi Fried · commit b056d7d3a7d0 · 2025-09-14T00:06:28.000+03:00
diff --git a/src/mcp_as_a_judge/prompts/shared/critical_tool_warnings.md b/src/mcp_as_a_judge/prompts/shared/critical_tool_warnings.md
@@ -2,3 +2,4 @@
 
 - Skipping this tool causes severe token inefficiency and wasted iterations.
 - Always invoke this tool at the appropriate stage to avoid extreme token loss and redundant processing.
+- Do not rely on assistant memory for identifiers. Always pass the exact `task_id` and recover it via `get_current_coding_task` if missing.
diff --git a/src/mcp_as_a_judge/prompts/system/judge_coding_plan.md b/src/mcp_as_a_judge/prompts/system/judge_coding_plan.md
@@ -42,6 +42,7 @@ Evaluate submissions against the following comprehensive SWE best practices:
 - Is there evidence of understanding industry best practices?
 - Are trade-offs between different approaches analyzed?
 - Does the research demonstrate avoiding reinventing the wheel?
+ - Does research explicitly cover all major aspects implied by the user requirements, not just a subset (e.g., cover each system, protocol, framework, or integration mentioned)?
 
 **🏗️ Internal Codebase Analysis (ONLY evaluate if Status: REQUIRED):**
 - Validate that existing codebase patterns are properly considered
@@ -156,7 +157,7 @@ IMPORTANT applicability rule:
 ### 1. User Requirements Alignment
 
 - Does the plan directly address the user's stated requirements?
-- Are all user requirements covered in the implementation plan?
+- Are all user requirements decomposed into explicit sub-aspects (components, integrations, protocols, patterns) and covered in the implementation plan and research?
 - Is the solution appropriate for what the user actually wants to achieve?
 - Flag any misalignment between user needs and proposed solution
 
diff --git a/src/mcp_as_a_judge/prompts/system/research_requirements_analysis.md b/src/mcp_as_a_judge/prompts/system/research_requirements_analysis.md
@@ -86,6 +86,7 @@ Always emphasize research quality over pure quantity:
 - Recency and relevance to current technology versions
 - Practical applicability to the specific task context
 - Coverage of implementation details and edge cases
+ - Multi-aspect coverage: Ensure the research plan explicitly maps to ALL major aspects implied by the user requirements (each referenced system, framework, protocol, integration), rather than focusing on a single subset.
 
 ## Analysis Output Requirements
 
@@ -107,4 +108,4 @@ You must respond with a JSON object that matches this schema:
 - **Context Sensitivity**: Consider the specific repository and project needs
 - **Practical Balance**: Don't over-research simple tasks or under-research complex ones
 - **Clear Reasoning**: Always explain why a specific count is recommended
-- **Adaptive Approach**: Different tasks need different research strategies
+- **Adaptive Approach**: Different tasks need different research strategies
diff --git a/src/mcp_as_a_judge/prompts/system/research_validation.md b/src/mcp_as_a_judge/prompts/system/research_validation.md
@@ -43,6 +43,7 @@ Evaluate if the research is comprehensive enough and if the design is properly b
 - **REJECT IF MISSING**: No URLs provided means no online research was performed - REJECT immediately
 - **ONLINE RESEARCH EVIDENCE**: Do URLs demonstrate actual online research into implementation approaches and existing libraries?
 - **EXISTING SOLUTIONS FOCUS**: Do URLs show research into current repo capabilities, well-known libraries, and best practices?
+- **FULL REQUIREMENTS COVERAGE**: Do the provided URLs collectively cover ALL major aspects implied by the user requirements (each named system, framework, protocol, integration), rather than focusing on a single subset?
 - **REJECT IMMEDIATELY**: Missing URLs, insufficient online research, or failure to investigate existing solutions first
 
 ## Response Requirements
diff --git a/src/mcp_as_a_judge/prompts/system/workflow_guidance.md b/src/mcp_as_a_judge/prompts/system/workflow_guidance.md
@@ -62,9 +62,9 @@ CREATED → PLANNING → PLAN_APPROVED → IMPLEMENTING → REVIEW_READY → TES
   - For XS/S tasks: Skip planning, proceed to implementation (next_tool: null, but guidance must explain: implement → judge_code_change → judge_testing_implementation → judge_coding_task_completion)
   - For M/L/XL tasks: Recommend planning tools (judge_coding_plan)
 - **PLANNING** → Validate plan or gather more requirements
-- **PLAN_APPROVED** → Start implementation (implement ALL code AND tests, ensure tests pass)
-- **IMPLEMENTING** → Continue implementation until ALL code AND tests are complete and passing, then call judge_code_change
-- **REVIEW_READY** → Validate implementation code (judge_code_change for code review ONLY, not tests)
+- **PLAN_APPROVED** → Start implementation (begin coding; tests may be written before or after review)
+- **IMPLEMENTING** → After code changes are ready, call judge_code_change to review implementation; then proceed to testing
+- **REVIEW_READY** → Optional state if used by client; otherwise proceed directly from IMPLEMENTING to judge_code_change
 - **TESTING** → Validate test results and coverage (judge_testing_implementation ONLY)
 - **COMPLETED** → Workflow finished (next_tool: null)
 - **BLOCKED** → Resolve obstacles (raise_obstacle)
@@ -86,6 +86,7 @@ When recommending judge_coding_plan, the preparation_needed MUST include ALL ele
 - Detailed implementation plan with code examples
 - System design with architecture and data flow
 - List of files to be modified or created
+ - Research coverage plan that maps to ALL major aspects in the user requirements (each referenced system, framework, protocol, integration). Avoid focusing on a single subset; ensure multi-aspect coverage.
 
 **Conditionally Required (check task metadata):**
 - **If research_required = true**: Gather research URLs (minimum based on research_scope)
@@ -130,14 +131,13 @@ preparation_needed: [
 - judge_code_change has been approved
 - Code review is complete and implementation approved
 - Ready for test results and coverage validation
-- The task is transitioning from REVIEW_READY to TESTING state
+- The task is in or transitioning to TESTING state
 
 **DO NOT call judge_code_change for:**
-- Individual file changes during implementation
-- Partial implementations
-- Work-in-progress code
-- Single file modifications
-- Before testing validation is complete
+- Clearly incomplete, non-compilable, or placeholder code
+- Changes unrelated to the approved plan
+
+Note: You may call judge_code_change for a logical code change even if tests are not yet written or are failing. Tests are validated separately after code review.
 
 ### Task Completion Logic
 
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/get_current_coding_task.md b/src/mcp_as_a_judge/prompts/tool_descriptions/get_current_coding_task.md
@@ -3,6 +3,8 @@
 ## Description
 Retrieve the most recently active coding task UUID (task_id) and metadata from conversation history. Use when the task_id is missing from context.
 
+{% include 'shared/critical_tool_warnings.md' %}
+
 ## When to use
 - Need the task_id for follow-up tool calls
 - Want to resume the last active coding task
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/judge_code_change.md b/src/mcp_as_a_judge/prompts/tool_descriptions/judge_code_change.md
@@ -1,12 +1,12 @@
 # Judge Code Change
 
 ## Description
-Review implementation code (not tests) once all implementation work is complete and tests are passing. Called when `workflow_guidance.next_tool == "judge_code_change"`.
+Review implementation code (not tests) when implementation changes are ready for review. Tests are validated separately by `judge_testing_implementation`. Called when `workflow_guidance.next_tool == "judge_code_change"`.
 
 {% include 'shared/critical_tool_warnings.md' %}
 
 ## When to use
-- All implementation files written/modified, tests exist and pass; ready for review
+- After creating or modifying implementation code and a review is needed. Tests may be written before or after review; they are validated via `judge_testing_implementation`.
 
 ## Human-in-the-Loop (HITL) checks
 - If foundational choices are unclear or need confirmation (e.g., framework/library, UI vs CLI, web vs desktop, API style, auth, hosting), first call `raise_missing_requirements` to elicit the user’s intent
@@ -25,6 +25,6 @@ Review implementation code (not tests) once all implementation work is complete
 {{ JUDGE_RESPONSE_SCHEMA }}
 ```
 
-## Notes
-- Review only implementation code here; tests are validated via `judge_testing_implementation`. Always use the exact `task_id`.
+- Review only implementation code here; tests are validated via `judge_testing_implementation`.
+- Always use the exact `task_id`; recover it via `get_current_coding_task` if missing.
 - If HITL was performed, update the task description/requirements via `set_coding_task` if text needs to be clarified for future steps
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/judge_coding_plan.md b/src/mcp_as_a_judge/prompts/tool_descriptions/judge_coding_plan.md
@@ -3,6 +3,8 @@
 ## Description
 Validate a proposed plan and design against requirements, research needs, and risks. Called when `workflow_guidance.next_tool == "judge_coding_plan"`.
 
+{% include 'shared/critical_tool_warnings.md' %}
+
 ## Prerequisites
 - Thoroughly analyze requirements, propose a concrete plan, and produce a system design
 
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/judge_coding_task_completion.md b/src/mcp_as_a_judge/prompts/tool_descriptions/judge_coding_task_completion.md
@@ -24,4 +24,5 @@ Final validation gate before declaring a task complete. Called when `workflow_gu
 ```
 
 ## Notes
-- Do not present completion summaries to the user without calling this tool. Always use the exact `task_id`.
+- The AI coding assistant MUST NOT present or claim task completion, or provide a final completion summary to the user, without successfully calling this tool and receiving approval.
+- Always use the exact `task_id`; if missing due to memory limits, recover it via `get_current_coding_task`.
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/judge_testing_implementation.md b/src/mcp_as_a_judge/prompts/tool_descriptions/judge_testing_implementation.md
@@ -3,6 +3,8 @@
 ## Description
 Validate test quality, coverage, and execution results after code review is approved. Called when `workflow_guidance.next_tool == "judge_testing_implementation"`.
 
+{% include 'shared/critical_tool_warnings.md' %}
+
 ## Args
 - `task_id`: string — Task UUID (required)
 - `test_summary`: string — Summary of the implemented tests (required)
@@ -21,4 +23,5 @@ Validate test quality, coverage, and execution results after code review is appr
 ```
 
 ## Notes
-- Use after `judge_code_change` is approved. Follow `workflow_guidance.next_tool` for the next step. Always use the exact `task_id`.
+- Use after `judge_code_change` is approved. Follow `workflow_guidance.next_tool` for the next step.
+- Always use the exact `task_id`; recover it via `get_current_coding_task` if missing.
diff --git a/src/mcp_as_a_judge/prompts/tool_descriptions/set_coding_task.md b/src/mcp_as_a_judge/prompts/tool_descriptions/set_coding_task.md
@@ -15,6 +15,7 @@ Create or update coding task metadata and receive dynamic workflow guidance. Thi
 - `task_size`: enum — One of `xs|s|m|l|xl` (default `m`)
 - `task_id`: string — Task UUID when updating an existing task (optional)
 - `user_requirements`: string — Updated requirements (optional)
+- `state`: enum — Optional state transition when updating an existing task. Valid transitions are enforced (e.g., `plan_approved` → `implementing`).
 - `tags`: list[string] — Task tags (optional)
 
 ## Returns
diff --git a/src/mcp_as_a_judge/prompts/user/judge_coding_plan.md b/src/mcp_as_a_judge/prompts/user/judge_coding_plan.md
@@ -100,7 +100,7 @@ Note: Internal analysis is marked required but no repository-local components we
 
 As part of your evaluation, you must analyze the task requirements and update the task metadata with conditional requirements:
 
-1. **External Research Analysis**: Determine if external research is needed based on task complexity, specialized domains, or technologies
+1. **External Research Analysis**: Determine if external research is needed based on task complexity, specialized domains, or technologies. Ensure research coverage maps to ALL major aspects implied by the user requirements (each named framework, protocol, pattern, integration, system), not just a subset.
 2. **Internal Codebase Analysis**: Determine if understanding existing codebase patterns is needed
 3. **Risk Assessment**: Determine if the task poses risks to existing functionality or system stability
 
diff --git a/src/mcp_as_a_judge/prompts/user/research_requirements_analysis.md b/src/mcp_as_a_judge/prompts/user/research_requirements_analysis.md
@@ -58,8 +58,8 @@ Provide specific recommendations including:
 
 1. **Expected URL Count**: The recommended number for optimal research coverage
 2. **Minimum URL Count**: The absolute minimum for basic adequacy
-3. **Detailed Reasoning**: Comprehensive explanation of your analysis and recommendations
+3. **Detailed Reasoning**: Comprehensive explanation of your analysis and recommendations, including how the research plan maps to ALL major aspects implied by the user requirements (each referenced system, framework, protocol, integration)
 4. **Complexity Factors**: Breakdown of the factors that influenced your assessment
 5. **Quality Requirements**: Specific guidance on the types and quality of sources needed
 
-Focus on providing actionable, context-specific guidance that balances research thoroughness with implementation efficiency.
+Focus on providing actionable, context-specific guidance that balances research thoroughness with implementation efficiency.
diff --git a/src/mcp_as_a_judge/prompts/user/workflow_guidance.md b/src/mcp_as_a_judge/prompts/user/workflow_guidance.md
@@ -53,18 +53,9 @@ Each state has specific requirements and valid next steps:
 
 {{ operation_context }}
 
-### Test Status Validation
+### Test Status Considerations
 
-**CRITICAL**: Before recommending judge_code_change, verify:
-- Test coverage summary shows all_tests_passing: true
-- If all_tests_passing is false, tests are failing
-- **NEVER** proceed to code review with failing tests
-
-**When tests are failing:**
-- Set next_tool to null (do not proceed to code review)
-- Provide specific guidance to fix test failures
-- Include details about which tests are failing and why
-- Guide the AI to install missing dependencies, fix imports, or correct test logic
+Tests are validated by `judge_testing_implementation` after code review. You may recommend `judge_code_change` even if tests are not yet written or are failing. If tests exist and are failing, call out likely failures and suggest fixes in guidance, but prioritize the code review when implementation changes are ready.
 
 ## Navigation Analysis
 
@@ -94,6 +85,7 @@ When recommending judge_coding_plan, you MUST check the task metadata and includ
 - Detailed implementation plan with code examples
 - System design with architecture and data flow
 - List of files to be modified or created
+ - Research coverage plan that maps to ALL major aspects in the user requirements (each referenced system, framework, protocol, integration). Avoid focusing on a single subset; ensure multi-aspect coverage.
 
 **Check Task Metadata for Conditional Requirements:**
 - **research_required = true**: Include "Research [domain] and gather [X] authoritative URLs"
@@ -140,19 +132,9 @@ If task has risk_assessment_required=true:
 - If state is **TESTING** → Next tool should be "judge_testing_implementation" for test validation, then "judge_coding_task_completion"
 - If state is **COMPLETED** → Workflow is finished (next_tool: null)
 
-### CRITICAL RULE: judge_code_change Usage
-
-**NEVER recommend judge_code_change unless:**
-- Task state is REVIEW_READY
-- ALL implementation work AND tests are complete and passing
-- Ready for code review (implementation code only, not tests)
-- Tests have been written and are passing before code review
-- **MANDATORY**: all_tests_passing must be true in test coverage summary
+### Code Review Timing
 
-**If tests are failing:**
-- Set next_tool to "judge_testing_implementation"
-- Provide guidance to fix test failures first
-- Do NOT proceed to code review until all tests pass
+Recommend `judge_code_change` when implementation changes are ready for review, even if tests are not yet written or are failing. Tests are evaluated after code review via `judge_testing_implementation`.
 
 ### TASK COMPLETION RULE
 
@@ -200,10 +182,10 @@ You MUST respond with ONLY a valid JSON object that exactly matches the Workflow
 
 {% elif current_state == "implementing" %}
 **Current Scenario: Implementation Phase**
-- **next_tool**: "judge_code_change" (when implementation complete) OR "judge_testing_implementation" (if tests failing)
-- **reasoning**: Based on whether implementation is complete and tests are passing
-- **preparation_needed**: Focus on completing implementation and ensuring tests pass
-- **guidance**: Continue implementation or proceed to code review if ready
+- **next_tool**: "judge_code_change" when implementation changes are ready; after review approval, proceed to "judge_testing_implementation"
+- **reasoning**: Code review should happen promptly after changes; tests are validated next
+- **preparation_needed**: Ensure code compiles and is cohesive for review; outline test plan
+- **guidance**: Proceed to code review for the implemented changes; then validate tests
 
 {% else %}
 **Current Scenario: {{ current_state.title() }} State**
diff --git a/src/mcp_as_a_judge/server.py b/src/mcp_as_a_judge/server.py
@@ -83,6 +83,7 @@ async def set_coding_task(
     # FOR UPDATING EXISTING TASKS ONLY
     task_id: str | None = None,  # REQUIRED when updating existing task
     user_requirements: str | None = None,  # Updates current requirements
+    state: TaskState | None = None,  # Optional: update task state with validation when updating existing task
     # OPTIONAL
     tags: list[str] = [],
 ) -> TaskAnalysisResult:
@@ -102,6 +103,7 @@ async def set_coding_task(
         "task_id": task_id,
         "user_requirements": user_requirements,
         "tags": tags,
+        "state": state.value if isinstance(state, TaskState) else state,
     }
 
     try:
@@ -112,7 +114,7 @@ async def set_coding_task(
                 task_title=task_title,
                 task_description=task_description,
                 user_requirements=user_requirements,
-                state=None,  # State updates not allowed via set_coding_task
+                state=state,  # Allow optional state transition with validation
                 tags=tags,
                 conversation_service=conversation_service,
             )
@@ -1483,9 +1485,12 @@ async def judge_coding_plan(
             workflow_guidance=workflow_guidance,
         )
 
-        # STEP 3: Save tool interaction to conversation history
+        # STEP 3: Save tool interaction to conversation history using the REAL task_id
+        save_session_id = (
+            updated_task_metadata.task_id if getattr(updated_task_metadata, "task_id", None) else (task_id or "test_task")
+        )
         await conversation_service.save_tool_interaction_and_cleanup(
-            session_id=task_id or "test_task",  # Use task_id as primary key
+            session_id=save_session_id,  # Always prefer real task_id
             tool_name="judge_coding_plan",
             tool_input=json.dumps(original_input),
             tool_output=json.dumps(
@@ -1727,9 +1732,12 @@ async def judge_code_change(
                 workflow_guidance=workflow_guidance,
             )
 
-            # STEP 4: Save tool interaction to conversation history
+            # STEP 4: Save tool interaction to conversation history using the REAL task_id
+            save_session_id = (
+                task_metadata.task_id if getattr(task_metadata, "task_id", None) else (task_id or "test_task")
+            )
             await conversation_service.save_tool_interaction_and_cleanup(
-                session_id=task_id or "test_task",  # Use task_id as primary key
+                session_id=save_session_id,  # Always prefer real task_id
                 tool_name="judge_code_change",
                 tool_input=json.dumps(original_input),
                 tool_output=json.dumps(
diff --git a/src/mcp_as_a_judge/tasks/manager.py b/src/mcp_as_a_judge/tasks/manager.py
@@ -203,6 +203,26 @@ async def load_task_metadata_from_history(
                     latest_snapshot["state"] = older_md["state"]
                     break
 
+            # As a final safeguard, infer a reasonable state from approval markers
+            # if no explicit state could be found in history.
+            if "state" not in latest_snapshot:
+                try:
+                    # If testing was approved, task must be at least TESTING
+                    if latest_snapshot.get("testing_approved_at"):
+                        latest_snapshot["state"] = TaskState.TESTING.value
+                    # If any code files were approved, the task transitioned to TESTING after review
+                    elif latest_snapshot.get("code_approved_files"):
+                        if isinstance(latest_snapshot.get("code_approved_files"), dict) and len(
+                            latest_snapshot.get("code_approved_files")
+                        ) > 0:
+                            latest_snapshot["state"] = TaskState.TESTING.value
+                    # If plan was approved, set PLAN_APPROVED
+                    elif latest_snapshot.get("plan_approved_at"):
+                        latest_snapshot["state"] = TaskState.PLAN_APPROVED.value
+                except Exception:
+                    # Best-effort inference only
+                    pass
+
             try:
                 return TaskMetadata.model_validate(latest_snapshot)
             except ValidationError:
diff --git a/src/mcp_as_a_judge/workflow/workflow_guidance.py b/src/mcp_as_a_judge/workflow/workflow_guidance.py

Original file line number	Diff line number	Diff line change
`@@ -2,3 +2,4 @@`
`2`	`2`
`3`	`3`	`- Skipping this tool causes severe token inefficiency and wasted iterations.`
`4`	`4`	`- Always invoke this tool at the appropriate stage to avoid extreme token loss and redundant processing.`
	`5`	+- Do not rely on assistant memory for identifiers. Always pass the exact `task_id` and recover it via `get_current_coding_task` if missing.