-
Notifications
You must be signed in to change notification settings - Fork 5
feat(diagnostics): add self-diagnosis system for failed task analysis #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements a comprehensive self-diagnosis system that activates when task execution fails, providing actionable recommendations for recovery. Core Components: - RunLogger: Structured logging with categories (AGENT_ACTION, LLM_CALL, ERROR, VERIFICATION, STATE_CHANGE, BLOCKER) - DiagnosticAgent: Pattern-based + LLM-powered failure analysis - DiagnosticReport: Root cause, severity, and recommendations CLI Commands: - `cf work diagnose <task-id>`: Analyze failed run and show recommendations - `cf work retry <task-id>`: Reset failed task and retry execution - `cf work update-description <task-id> <desc>`: Update task description Features: - Pattern detection for common failures (missing deps, code quality, blockers, environment issues, task description problems) - LLM-powered root cause analysis for complex failures - Severity assessment (CRITICAL, HIGH, MEDIUM, LOW) - Executable CLI commands in recommendations - Persistent storage in SQLite (run_logs, diagnostic_reports tables) - Integration with agent execution via RunLogger Tests: 71 new tests (29 diagnostics, 19 diagnostic agent, 11 CLI, 12 integration)
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a diagnostics subsystem: CLI commands ( Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as "work diagnose\n(CLI)"
participant Agent as "DiagnosticAgent"
participant DB as "Workspace DB"
participant LLM as "LLM Provider\n(optional)"
participant Display as "Report Display"
User->>CLI: diagnose task_id [--force]
CLI->>DB: resolve task (partial ID) & find failed run
CLI->>Agent: analyze(task_id, run_id)
Agent->>DB: get_run_logs(run_id)
DB-->>Agent: list[RunLogEntry]
Agent->>Agent: detect_failure_patterns(logs)
Agent->>Agent: assess_severity()
Agent->>Agent: generate_recommendations()
alt with LLM
Agent->>LLM: _analyze_with_llm(logs, error_msgs)
LLM-->>Agent: root_cause text
else without LLM
Agent->>Agent: _generate_root_cause()
end
Agent->>Agent: summarize_logs()
Agent->>DB: save_diagnostic_report()
DB-->>Agent: DiagnosticReport
Agent-->>CLI: DiagnosticReport
CLI->>Display: _display_diagnostic_report()
Display-->>User: formatted report (severity, category, recommendations)
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Add self-diagnosis for failed task analysis by introducing
|
Review: Self-Diagnosis System ImplementationSummaryThis PR adds a comprehensive self-diagnosis system for failed task analysis with pattern-based and LLM-powered failure detection. The implementation includes structured logging (), diagnostic analysis (), and CLI commands (, , ). Test coverage is excellent with 71 new tests. ✅ Strengths
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@codeframe/core/diagnostic_agent.py`:
- Around line 568-610: The _analyze_with_llm method currently calls
llm_provider.complete with a raw string prompt which mismatches the provider
API; change the call to pass messages=[{"role":"user","content": prompt}] and
include the purpose argument (Purpose.EXECUTION) so it matches other usages of
llm_provider.complete and returns the LLM response correctly; update the try
block replacing the single-string call to llm_provider.complete with the
messages+purpose signature and keep the existing response.content handling and
exception fallback.
- Around line 511-548: The code computes severity (assess_severity) and
recommendations (generate_recommendations) before allowing the LLM override of
failure_category from _analyze_with_llm/_extract_category_from_llm; move or
repeat the calls so that after you call self._analyze_with_llm(logs,
error_messages) and possibly update failure_category from
_extract_category_from_llm, you recompute severity = assess_severity(...) and
recommendations = generate_recommendations(...); keep
detect_primary_failure_category, error_messages, summarize_logs as-is and only
reorder/re-call the two functions to ensure they use the final failure_category.
- Around line 224-240: Replace the incorrect CLI strings used in the
TASK_DESCRIPTION remediation recommendations: in the block that checks
FailureCategory.TASK_DESCRIPTION (where DiagnosticRecommendation and
RemediationAction.UPDATE_TASK_DESCRIPTION are used and task_id is available),
change the review command to "codeframe tasks get {task_id}" and change the
update command to "codeframe work update-description {task_id} \"<new
description>\"" (keep the same actions, reasons, and parameters dictionary).
This updates the first recommendation's command to use codeframe tasks get and
the second to use codeframe work update-description so the suggested CLI steps
match the documented commands.
In `@codeframe/core/tasks.py`:
- Around line 254-329: The update function in tasks.py currently bypasses the
repository and issues raw sqlite3 queries via get_db_connection; replace that
logic by calling TaskRepository.update_task(workspace, task_id, title=...,
description=..., priority=...) (or the repository's equivalent method) so all
validation, safety checks, and persistence are handled by the domain layer;
remove the manual cursor/execute/commit/close and manual updated_at assignment,
and ensure the repository's return value (updated Task) is returned from
update(); pass only the fields that are not None.
In `@tests/cli/test_work_diagnose.py`:
- Around line 99-125: The tests test_diagnose_shows_recommendations and
test_diagnose_with_verbose are using the unreliable check "result.exit_code == 0
or result.exit_code is None"; replace this with deterministic assertions: assert
result.exit_code == 0 and assert result.exception is None after the
runner.invoke calls (or, if you expect a handled failure, assert the specific
exit code and inspect result.exception for the expected exception type/message).
Update the assertions in the two test functions that call runner.invoke
(referencing runner.invoke, result.exit_code, and result.exception) to remove
"is None" and use explicit, deterministic checks for success or for a specific
expected failure.
- Around line 70-77: The current weak assertion allows failures to slip by;
update the test in tests/cli/test_work_diagnose.py that checks result from
runner.invoke(app, ["work", "diagnose", task.id[:8], "--workspace",
str(workspace.repo_path)]) to assert explicitly that the command succeeded
(e.g., assert result.exit_code == 0) and additionally verify there is no error
output (e.g., assert "Error" not in result.stdout and/or assert result.exception
is None) so a non-zero exit code cannot pass the test.
🧹 Nitpick comments (11)
codeframe/core/diagnostic_agent.py (1)
351-371: Broaden package-name extraction to handle hyphens/dots.Current regex only matches
\w, so it misses names likepydantic-coreorgoogle.auth.♻️ Proposed tweak
- match = re.search(r"no module named ['\"]?(\w+)", msg, re.IGNORECASE) + match = re.search(r"no module named ['\"]?([\w.-]+)", msg, re.IGNORECASE) if match: return match.group(1) - match = re.search(r"cannot import.*from ['\"]?(\w+)", msg, re.IGNORECASE) + match = re.search(r"cannot import.*from ['\"]?([\w.-]+)", msg, re.IGNORECASE) if match: return match.group(1)codeframe/core/runtime.py (1)
756-773: Include actionable error text in the ERROR log message.
DiagnosticAgentextractserror_messagesfromlog.message, but the current message is generic and the details live only in metadata. Consider embedding a short error excerpt in the message to improve downstream recommendations.🔧 Example adjustment
- run_logger.error(LogCategory.ERROR, "Agent execution failed", error_info) + message = "Agent execution failed" + if error_info.get("last_step_error"): + message = f"{message}: {error_info['last_step_error'][:200]}" + run_logger.error(LogCategory.ERROR, message, error_info)tests/core/test_diagnostic_agent.py (4)
1-24: Missingpytestmarkfor v2 tests.Other test files in this PR (e.g.,
test_work_diagnose.py,test_self_diagnosis_workflow.py) includepytestmark = pytest.mark.v2to mark all tests as v2. This file is missing that marker, which may cause inconsistency in test filtering.♻️ Add v2 marker for consistency
from codeframe.core.workspace import create_or_load_workspace +# Mark all tests as v2 +pytestmark = pytest.mark.v2 + + `@pytest.fixture` def workspace(tmp_path: Path):
155-168: Unused import in test method.
DiagnosticAgentis imported on line 157 but not used in this test. Onlydetect_failure_patternsis used.♻️ Remove unused import
def test_detect_ambiguous_task_pattern(self, workspace, run_id, task_id): """Test detection of ambiguous task description pattern.""" - from codeframe.core.diagnostic_agent import DiagnosticAgent, detect_failure_patterns + from codeframe.core.diagnostic_agent import detect_failure_patterns logger = RunLogger(workspace, run_id, task_id)
218-273: Unused fixture parameters in recommendation tests.The
workspacefixture is passed totest_generate_update_description_recommendation,test_generate_retry_recommendation, andtest_generate_dependency_fix_recommendationbut is not used. Thegenerate_recommendationsfunction doesn't require a workspace.♻️ Remove unused workspace fixture from recommendation tests
- def test_generate_update_description_recommendation(self, workspace, run_id, task_id): + def test_generate_update_description_recommendation(self, run_id, task_id): """Test generating UPDATE_TASK_DESCRIPTION recommendation.""" from codeframe.core.diagnostic_agent import generate_recommendationsApply similar changes to
test_generate_retry_recommendationandtest_generate_dependency_fix_recommendation.
275-324: Unused fixture parameters in severity tests.The
workspace,run_id, andtask_idfixtures are passed to all severity assessment tests but are not used. Theassess_severityfunction only takesfailure_category,error_count, andhas_blockerparameters.♻️ Remove unused fixtures from severity tests
- def test_critical_severity_for_repeated_failures(self, workspace, run_id, task_id): + def test_critical_severity_for_repeated_failures(self): """Test that repeated failures get CRITICAL severity.""" from codeframe.core.diagnostic_agent import assess_severityApply similar changes to other severity test methods.
tests/cli/test_work_diagnose.py (2)
6-10: Unused imports.
patchfromunittest.mock(line 10) is imported but never used in this file.♻️ Remove unused import
import pytest import uuid from datetime import datetime, timezone from pathlib import Path -from unittest.mock import MagicMock, patch +from unittest.mock import MagicMock
228-243: Test relies on probabilistic UUID matching.The test uses
"a"as a task ID prefix (line 238), hoping it might match multiple UUIDs. However, UUIDs are randomly generated, so this test's behavior is non-deterministic. The comment acknowledges this but the test may not reliably exercise the "multiple matches" path.Consider creating tasks with known prefixes or mocking the task lookup to ensure consistent test behavior.
tests/core/test_diagnostics.py (1)
1-31: Missingpytestmarkand unused import.
- Similar to
test_diagnostic_agent.py, this file is missingpytestmark = pytest.mark.v2for consistency with other test files in this PR.json(line 6) is imported but not used anywhere in the file.♻️ Add v2 marker and remove unused import
"""Tests for diagnostics module. Tests the RunLogger, diagnostic reports, and related functions. """ -import json import pytest import uuid from datetime import datetime, timezone from pathlib import Path from codeframe.core.diagnostics import ( ... ) from codeframe.core.workspace import create_or_load_workspace, get_db_connection +# Mark all tests as v2 +pytestmark = pytest.mark.v2 + + `@pytest.fixture`tests/integration/test_self_diagnosis_workflow.py (2)
6-27: Unused imports.Several imports are not used in this file:
datetimeandtimezone(line 8) - not referenced anywhereget_run_logs(line 22) - not called in any test♻️ Remove unused imports
import pytest import uuid -from datetime import datetime, timezone from pathlib import Path from typer.testing import CliRunner from codeframe.cli.app import app from codeframe.core.diagnostics import ( DiagnosticReport, FailureCategory, LogCategory, RemediationAction, RunLogger, Severity, get_latest_diagnostic_report, - get_run_logs, )
100-158: CLI tests use loose assertions.Lines 115 and 135 use
exit_code == 0 or exit_code is None, which is the same pattern flagged intest_work_diagnose.py. These assertions may mask unexpected failures.The workaround in
test_retry_command_resets_failed_task(lines 152-157) usingtasks.update_statusinstead of the actual retry command is pragmatic given the API key dependency, but consider adding a TODO or comment about eventually testing the full retry flow in an environment with credentials.
| if failure_category == FailureCategory.TASK_DESCRIPTION: | ||
| recommendations.append( | ||
| DiagnosticRecommendation( | ||
| action=RemediationAction.UPDATE_TASK_DESCRIPTION, | ||
| reason="Task description may be ambiguous or incomplete. Adding clearer acceptance criteria will help the agent understand requirements.", | ||
| command=f"cf tasks show {task_id} # Review current description, then update with clearer criteria", | ||
| parameters={"task_id": task_id}, | ||
| ) | ||
| ) | ||
| recommendations.append( | ||
| DiagnosticRecommendation( | ||
| action=RemediationAction.RETRY_WITH_CONTEXT, | ||
| reason="After updating the description, retry the task", | ||
| command=f"cf work start {task_id} --execute", | ||
| parameters={"task_id": task_id}, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find CLI-related files
find . -type f -name "*cli*" -o -name "*command*" -o -name "*app.py" | head -20Repository: frankbria/codeframe
Length of output: 182
🏁 Script executed:
# Search for tasks show command definition
rg -n "tasks.*show|\"show\"|command.*show" --type py | grep -i "task\|cli" | head -30Repository: frankbria/codeframe
Length of output: 1540
🏁 Script executed:
# Look for work update-description command
rg -n "update.description|update_description" --type py | head -20Repository: frankbria/codeframe
Length of output: 782
🏁 Script executed:
# Search for CLI app structure and command definitions
fd -e py . | xargs grep -l "click\|typer\|argparse\|@app\|@command" | head -10Repository: frankbria/codeframe
Length of output: 373
🏁 Script executed:
cat -n codeframe/cli/tasks_commands.pyRepository: frankbria/codeframe
Length of output: 10733
🏁 Script executed:
# Search for tasks_app and its command definitions
rg -n "tasks_app\s*=" --type pyRepository: frankbria/codeframe
Length of output: 173
🏁 Script executed:
# Look for all command definitions in app.py for tasks
rg -n "@tasks_app\.command|def.*task" codeframe/cli/app.py | head -20Repository: frankbria/codeframe
Length of output: 1070
🏁 Script executed:
sed -n '220,245p' codeframe/core/diagnostic_agent.pyRepository: frankbria/codeframe
Length of output: 1245
🏁 Script executed:
# Check what work update-description expects as parameters
sed -n '2124,2160p' codeframe/cli/app.pyRepository: frankbria/codeframe
Length of output: 1361
Use codeframe tasks get to review, then codeframe work update-description to update.
The cf tasks show command doesn't exist. The available commands are codeframe tasks get {task_id} to review the current description and codeframe work update-description {task_id} "<new description>" to update it. The second command is documented as the proper way to clarify requirements after a failed run.
🔧 Suggested adjustment
- command=f"cf tasks show {task_id} # Review current description, then update with clearer criteria",
+ command=f"codeframe tasks get {task_id} # Review current description",
+ )
+ )
+ recommendations.append(
+ DiagnosticRecommendation(
+ action=RemediationAction.UPDATE_TASK_DESCRIPTION,
+ reason="Update the task description with clearer acceptance criteria",
+ command=f"codeframe work update-description {task_id} \"<add acceptance criteria>\"",🤖 Prompt for AI Agents
In `@codeframe/core/diagnostic_agent.py` around lines 224 - 240, Replace the
incorrect CLI strings used in the TASK_DESCRIPTION remediation recommendations:
in the block that checks FailureCategory.TASK_DESCRIPTION (where
DiagnosticRecommendation and RemediationAction.UPDATE_TASK_DESCRIPTION are used
and task_id is available), change the review command to "codeframe tasks get
{task_id}" and change the update command to "codeframe work update-description
{task_id} \"<new description>\"" (keep the same actions, reasons, and parameters
dictionary). This updates the first recommendation's command to use codeframe
tasks get and the second to use codeframe work update-description so the
suggested CLI steps match the documented commands.
| def _analyze_with_llm( | ||
| self, | ||
| logs: list[RunLogEntry], | ||
| error_messages: list[str], | ||
| ) -> str: | ||
| """Use LLM to analyze logs and determine root cause. | ||
| Args: | ||
| logs: Log entries to analyze | ||
| error_messages: Extracted error messages | ||
| Returns: | ||
| Root cause description from LLM | ||
| """ | ||
| if not self.llm_provider: | ||
| return "LLM analysis not available" | ||
|
|
||
| # Build prompt | ||
| log_text = "\n".join( | ||
| f"[{log.log_level.value}] {log.category.value}: {log.message}" | ||
| for log in logs[-30:] # Last 30 logs | ||
| ) | ||
|
|
||
| prompt = f"""Analyze the following agent execution logs and determine the root cause of the failure. | ||
| Logs: | ||
| {log_text} | ||
| Error messages: | ||
| {chr(10).join(error_messages[:10])} | ||
| Provide your analysis in this format: | ||
| Root Cause: [One sentence description of the root cause] | ||
| Failure Category: [One of: task_description, blocker_unresolved, model_limitation, code_quality, dependency_issue, environment_issue, technical_error, unknown] | ||
| Severity: [One of: critical, high, medium, low] | ||
| Then provide recommendations.""" | ||
|
|
||
| try: | ||
| response = self.llm_provider.complete(prompt) | ||
| return response.content if hasattr(response, 'content') else str(response) | ||
| except Exception as e: | ||
| return f"LLM analysis failed: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find LLMProvider class definition and complete method
find . -type f -name "*.py" | xargs grep -l "class.*LLMProvider" | head -20Repository: frankbria/codeframe
Length of output: 166
🏁 Script executed:
#!/bin/bash
# Search for complete method definitions in provider-related files
rg "def complete" -A 5 -g "*.py"Repository: frankbria/codeframe
Length of output: 2519
🏁 Script executed:
#!/bin/bash
# Find all usages of .complete( to see the calling patterns
rg "\.complete\(" -B 2 -A 2 -g "*.py"Repository: frankbria/codeframe
Length of output: 7603
Fix LLMProvider.complete call to use correct API signature.
The complete() method expects messages: list[dict] as its first parameter, but this code passes a raw string prompt. All other call sites in the codebase use the correct pattern: provider.complete(messages=[{"role": "user", "content": prompt}], purpose=Purpose.XXX).
Change to:
response = self.llm_provider.complete(
messages=[{"role": "user", "content": prompt}],
purpose=Purpose.EXECUTION,
)🤖 Prompt for AI Agents
In `@codeframe/core/diagnostic_agent.py` around lines 568 - 610, The
_analyze_with_llm method currently calls llm_provider.complete with a raw string
prompt which mismatches the provider API; change the call to pass
messages=[{"role":"user","content": prompt}] and include the purpose argument
(Purpose.EXECUTION) so it matches other usages of llm_provider.complete and
returns the LLM response correctly; update the try block replacing the
single-string call to llm_provider.complete with the messages+purpose signature
and keep the existing response.content handling and exception fallback.
| def update( | ||
| workspace: Workspace, | ||
| task_id: str, | ||
| title: Optional[str] = None, | ||
| description: Optional[str] = None, | ||
| priority: Optional[int] = None, | ||
| ) -> Task: | ||
| """Update a task's title, description, or priority. | ||
| Only provided fields are updated; others are left unchanged. | ||
| Args: | ||
| workspace: Target workspace | ||
| task_id: Task to update | ||
| title: New title (optional) | ||
| description: New description (optional) | ||
| priority: New priority (optional) | ||
| Returns: | ||
| Updated Task | ||
| Raises: | ||
| ValueError: If task not found | ||
| """ | ||
| task = get(workspace, task_id) | ||
| if not task: | ||
| raise ValueError(f"Task not found: {task_id}") | ||
|
|
||
| # Build update query dynamically | ||
| updates = [] | ||
| params = [] | ||
|
|
||
| if title is not None: | ||
| updates.append("title = ?") | ||
| params.append(title) | ||
| task.title = title | ||
|
|
||
| if description is not None: | ||
| updates.append("description = ?") | ||
| params.append(description) | ||
| task.description = description | ||
|
|
||
| if priority is not None: | ||
| updates.append("priority = ?") | ||
| params.append(priority) | ||
| task.priority = priority | ||
|
|
||
| if not updates: | ||
| # Nothing to update | ||
| return task | ||
|
|
||
| now = _utc_now().isoformat() | ||
| updates.append("updated_at = ?") | ||
| params.append(now) | ||
|
|
||
| # Add WHERE clause params | ||
| params.extend([workspace.id, task_id]) | ||
|
|
||
| conn = get_db_connection(workspace) | ||
| cursor = conn.cursor() | ||
|
|
||
| cursor.execute( | ||
| f""" | ||
| UPDATE tasks | ||
| SET {', '.join(updates)} | ||
| WHERE workspace_id = ? AND id = ? | ||
| """, | ||
| params, | ||
| ) | ||
| conn.commit() | ||
| conn.close() | ||
|
|
||
| task.updated_at = datetime.fromisoformat(now) | ||
|
|
||
| return task | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
fd -t f "repository" codeframe/ | head -20Repository: frankbria/codeframe
Length of output: 1067
🏁 Script executed:
ls -la codeframe/persistence/repositories/ 2>/dev/null || echo "Directory not found, searching..."
fd -t f "\.py" codeframe/persistence/ | head -20Repository: frankbria/codeframe
Length of output: 2602
🏁 Script executed:
# Check the structure of the codeframe directory
find codeframe -type d -maxdepth 3 | head -30Repository: frankbria/codeframe
Length of output: 680
🏁 Script executed:
# Look for task-related repository files
fd -t f "task.*\.py" codeframe/ --case-sensitiveRepository: frankbria/codeframe
Length of output: 273
🏁 Script executed:
# Check if get_db_connection is sync or async
rg "def get_db_connection" codeframe/Repository: frankbria/codeframe
Length of output: 157
🏁 Script executed:
# Check the actual update() function implementation in tasks.py
cat -n codeframe/core/tasks.py | sed -n '254,329p'Repository: frankbria/codeframe
Length of output: 2344
🏁 Script executed:
head -100 codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 2494
🏁 Script executed:
rg "async def|aiosqlite|def update" codeframe/persistence/repositories/task_repository.py | head -30Repository: frankbria/codeframe
Length of output: 569
🏁 Script executed:
head -50 codeframe/persistence/repositories/base.pyRepository: frankbria/codeframe
Length of output: 1942
🏁 Script executed:
# Check if there's an update method in task_repository.py
rg "def update" codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 212
🏁 Script executed:
# Check imports to see if using aiosqlite or sqlite3
head -30 codeframe/persistence/repositories/base.pyRepository: frankbria/codeframe
Length of output: 918
🏁 Script executed:
# Check the database module to understand connection handling
cat -n codeframe/persistence/database.py | head -50Repository: frankbria/codeframe
Length of output: 1634
🏁 Script executed:
# Get the full update_task method from task_repository.py
rg -A 20 "def update_task\(" codeframe/persistence/repositories/task_repository.py | head -40Repository: frankbria/codeframe
Length of output: 736
🏁 Script executed:
# Check how the repository is being used in codeframe/core/tasks.py
head -50 codeframe/core/tasks.pyRepository: frankbria/codeframe
Length of output: 1487
🏁 Script executed:
# Check what imports are in codeframe/core/tasks.py
rg "^import|^from" codeframe/core/tasks.py | head -20Repository: frankbria/codeframe
Length of output: 385
🏁 Script executed:
# Check if get() function uses repository
rg -B 5 -A 15 "^def get\(" codeframe/core/tasks.pyRepository: frankbria/codeframe
Length of output: 610
🏁 Script executed:
# Check the full update_task method in task_repository
sed -n '100,200p' codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 3249
🏁 Script executed:
# Check if TaskRepository is imported or used in codeframe/core/tasks.py
rg "TaskRepository|from.*repositories" codeframe/core/tasks.pyRepository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Check the full get_db_connection function
rg -A 10 "def get_db_connection" codeframe/core/workspace.pyRepository: frankbria/codeframe
Length of output: 333
🏁 Script executed:
# Check if there are other functions in tasks.py that also use direct DB access
rg "get_db_connection\|conn.cursor\|conn.commit" codeframe/core/tasks.py | head -20Repository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Check if there's a pattern of using repositories in other core modules
rg "Repository\(" codeframe/core/*.pyRepository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Check if update_task has async version
rg "async def update_task" codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Get more context on the repository methods - check if any are async
rg "async def" codeframe/persistence/repositories/task_repository.py | head -10Repository: frankbria/codeframe
Length of output: 171
🏁 Script executed:
# Check how other modules in codeframe/core use repositories
find codeframe/core -name "*.py" -type f -exec grep -l "Repository" {} \;Repository: frankbria/codeframe
Length of output: 200
🏁 Script executed:
# Look at other functions in tasks.py to see the overall pattern
wc -l codeframe/core/tasks.pyRepository: frankbria/codeframe
Length of output: 90
🏁 Script executed:
# Check how other similar operations are structured in the codebase
rg "def create\(" codeframe/core/tasks.py -A 30 | head -50Repository: frankbria/codeframe
Length of output: 888
🏁 Script executed:
# Check if there's async version of create/update in tasks.py
rg "async def" codeframe/core/tasks.pyRepository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Check how artifacts.py uses repositories since it showed up in the search
head -100 codeframe/core/artifacts.py | grep -A 5 -B 5 "Repository\|get_db_connection"Repository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Check if there's a pattern in how repositories are used
rg "from codeframe.persistence.repositories" codeframe/core/*.pyRepository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Look at the workspace.py to see how it uses repositories
grep -n "Repository\|database\|Database" codeframe/core/workspace.py | head -20Repository: frankbria/codeframe
Length of output: 715
🏁 Script executed:
# Check the full TaskRepository class to see all methods
wc -l codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 120
🏁 Script executed:
# Search for all method definitions in task_repository
rg "^\s*def |^\s*async def" codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 1612
🏁 Script executed:
# Check if there's an async version of update_task
cat codeframe/persistence/repositories/task_repository.py | grep -n "update_task\|async def"Repository: frankbria/codeframe
Length of output: 336
🏁 Script executed:
# Look at the exact section around update_task to understand its signature
sed -n '90,150p' codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 1922
🏁 Script executed:
# Check if tasks.py functions are async or sync
rg "^async def|^def " codeframe/core/tasks.py | head -20Repository: frankbria/codeframe
Length of output: 723
🏁 Script executed:
# Verify the signature and calling pattern for update_task in task_repository
sed -n '97,145p' codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 1655
🏁 Script executed:
# Check how the Database class instantiates repositories
grep -n "TaskRepository\|self.tasks" codeframe/persistence/database.py | head -10Repository: frankbria/codeframe
Length of output: 692
Route updates through the task repository layer.
This function performs direct sqlite3 updates via get_db_connection, but the guideline requires using domain-specific repositories for data access. The TaskRepository.update_task() method already exists and handles updates with proper validation and safety checks. Call the repository method here to maintain consistency with the established data access pattern.
🤖 Prompt for AI Agents
In `@codeframe/core/tasks.py` around lines 254 - 329, The update function in
tasks.py currently bypasses the repository and issues raw sqlite3 queries via
get_db_connection; replace that logic by calling
TaskRepository.update_task(workspace, task_id, title=..., description=...,
priority=...) (or the repository's equivalent method) so all validation, safety
checks, and persistence are handled by the domain layer; remove the manual
cursor/execute/commit/close and manual updated_at assignment, and ensure the
repository's return value (updated Task) is returned from update(); pass only
the fields that are not None.
Review SummaryThis PR implements a comprehensive self-diagnosis system for failed task execution with pattern-based detection, LLM analysis, and actionable recommendations. The architecture is well-designed and follows the headless, CLI-first v2 approach. Test coverage is excellent with 71 new tests. However, there are critical issues that must be addressed before merging: 🔴 Critical Issues1. Severity/Recommendations Stale After LLM Override (diagnostic_agent.py:511-548)Issue: Severity and recommendations are computed on lines 523-536, but the LLM can override Fix: Move or recompute severity and recommendations after line 547 (after potential LLM category update): # After llm_category is applied (line 547):
failure_category = llm_category
# Now recompute based on final category
severity = assess_severity(
failure_category=failure_category,
error_count=len(errors),
has_blocker=has_blocker,
)
recommendations = generate_recommendations(
task_id=task_id,
run_id=run_id,
failure_category=failure_category,
error_messages=error_messages,
)2. LLMProvider.complete Call Uses Wrong API (diagnostic_agent.py:607)Issue: Line 607 passes a raw string Fix: response = self.llm_provider.complete(
messages=[{"role": "user", "content": prompt}],
purpose=Purpose.EXECUTION,
)You'll also need to import 🟡 Medium Priority3. Database Connection Leak in tasks.update() (tasks.py:312-324)Issue: The Fix: conn = get_db_connection(workspace)
try:
cursor = conn.cursor()
# ... existing code ...
conn.commit()
conn.close()
except Exception as e:
conn.close()
raise # Re-raise after cleanupNote: Consider also routing this through 🟢 Minor / Nitpicks4. Package Name Extraction Limited (diagnostic_agent.py:362, 367)Suggestion: The regex match = re.search(r"no module named ['\"]?([\w.-]+)", msg, re.IGNORECASE)5. CLI Recommendation Uses Wrong Commands (diagnostic_agent.py:229, 237)Issue: Line 229 uses Fix:
6. Overly Defensive Type Check (runtime.py:769)Issue: Line 769 checks Fix: Simplify to: error_info["last_step_status"] = last_step.status.value✅ Strengths
📊 Test Coverage Notes
🔒 Security✅ No security concerns:
Overall AssessmentStatus: 🟡 Needs Minor Fixes (Critical) The implementation is solid and the feature adds significant value. However, issues #1 (stale severity/recommendations) and #2 (wrong LLM API call) are critical bugs that will prevent the feature from working correctly in production. Issue #3 (connection leak) is important for resource management. Please address issues 1-3 before merging. Issues 4-6 can be handled in follow-up PRs or at your discretion. |
|
I've posted a comprehensive review of PR #300 focusing on: Critical Issues:
Medium Priority: Minor Issues:
The review also highlights the strengths (well-architected, follows v2 patterns, excellent test coverage) and confirms no security concerns. Overall assessment: needs the critical fixes before merging. Review posted at: #300 (comment) |
1. Fix severity/recommendations stale after LLM override
- Move assess_severity() and generate_recommendations() calls
to after the LLM category update (lines 536-549)
- Ensures report has consistent category/severity/recommendations
2. Fix LLMProvider.complete() call to use correct API
- Changed from complete(prompt) to complete(messages=[...], purpose=...)
- Added Purpose import from codeframe.adapters.llm
3. Fix database connection leak in tasks.update()
- Added try/finally block to ensure conn.close() always runs
- Consistent with create() function pattern
Code Review SummaryThis PR implements a comprehensive self-diagnosis system for failed task analysis. The architecture is well-designed and follows the v2 headless, CLI-first approach. Overall Assessment: 🟢 Ready with minor improvements recommended ✅ Previously Reported Issues (All Fixed)The critical issues mentioned in earlier reviews have been addressed in the current code:
🔴 New Issue: Schema Migration RiskFile: The schema upgrade logic creates and tables with foreign keys to , but does not ensure the table exists first. If a workspace is initialized from an older state where the table doesn't exist, this will fail. Recommendation: Add a check to ensure the table exists before creating dependent tables: 🟡 Minor Improvements1. Package Name Extraction (diagnostic_agent.py:363, 368)The regex doesn't match packages with hyphens (e.g., , ). Suggestion: Expand to to handle hyphens and dots. 2. Inconsistent CLI Commands (diagnostic_agent.py:230, 238, 278, 296, 306)Several recommendations use incorrect CLI commands:
Suggestion: Update to match actual CLI commands in . 3. Missing Connection Cleanup (tasks.py:234-246)The function doesn't use try/finally for connection cleanup, unlike and . If fails, the connection leaks. Suggestion: Wrap in try/finally like other functions in the module. ✅ Strengths
🔒 Security✅ No security concerns:
📊 Final RecommendationStatus: 🟢 Ready with recommended fix The schema migration risk should be addressed before merging to prevent potential failures on older workspaces. The minor improvements (package regex, CLI commands, connection cleanup) can be handled in a follow-up PR or at your discretion. Great work on this feature! The self-diagnosis system significantly improves the developer experience when tasks fail. |
Code Review SummaryThis PR implements a comprehensive self-diagnosis system for failed task analysis. The architecture is well-designed and follows v2 headless, CLI-first approach. Overall Assessment: 🟢 Ready with minor improvements recommended ✅ Previously Reported Issues (All Fixed)The critical issues mentioned in earlier reviews have been addressed in current code:
🔴 New Issue: Schema Migration RiskFile: The schema upgrade logic creates Recommendation: Add a check to ensure the 🟡 Minor Improvements1. Package Name ExtractionThe regex patterns at diagnostic_agent.py:363 and 368 don't match packages with hyphens (e.g., Suggestion: Expand the regex to handle hyphens and dots in package names. 2. Inconsistent CLI CommandsSeveral recommendations use incorrect CLI commands:
Suggestion: Update the commands to match the actual CLI commands defined in 3. Missing Connection CleanupThe Suggestion: Wrap the connection in a try/finally block like the other functions in the module. ✅ Strengths
🔒 Security✅ No significant security concerns:
📊 Final RecommendationStatus: 🟢 Ready with recommended fix The schema migration risk should be addressed before merging to prevent potential failures on older workspaces. The minor improvements (package regex, CLI commands, connection cleanup) can be handled in a follow-up PR or at your discretion. Great work on this feature! The self-diagnosis system significantly improves the developer experience when tasks fail. |
|
Review comment posted successfully at #300 (comment) The review acknowledges that all previously reported critical issues (stale severity/recommendations, wrong LLM API signature, database connection leak) have been fixed in the current code. It identifies a new schema migration risk where the foreign key references to the |
1. Schema Migration: Ensure runs table exists before dependent tables
- Added runs table creation in _ensure_schema_upgrades() before
creating run_logs and diagnostic_reports tables
- Prevents foreign key errors on older workspaces
2. Package Name Regex: Handle hyphens and dots
- Changed (\w+) to ([\w\-\.]+) in _extract_package_name()
- Now correctly extracts: pydantic-core, google.auth, typing-extensions
3. CLI Commands: Use correct command names
- Changed "cf tasks show" to "cf work update-description"
- Changed "cf work start" to "cf work retry" for retry recommendations
- Updated test to expect new command format
4. Connection Cleanup: Add try/finally to update_status()
- Ensures conn.close() always runs even on exception
- Consistent with create() and update() patterns
Code Review SummaryThis PR implements a comprehensive self-diagnosis system with excellent test coverage (71 tests). The architecture follows v2 patterns well. However, there are critical issues that must be addressed before merging. 🔴 Critical Issues1. Schema Migration Failure Risk (workspace.py:333-380)The Fix: Add a check to ensure 2. Fragile LLM Category Extraction (diagnostic_agent.py:620-640)
Fix: Parse specifically the category line using regex pattern matching. 3. String Truncation Bug (diagnostic_agent.py:464)When Fix: Add guard: 🟡 Medium Priority Issues4. Bypass Repository Layer (tasks.py:254-329)The 5. Incorrect CLI Commands (diagnostic_agent.py:230, 238, 278, 296, 306)Recommendations reference non-existent commands:
6. Weak Test Assertions (tests/cli/test_work_diagnose.py:76, 115, 124)Tests use ✅ Strengths
🔒 Security✅ No security concerns - SQL queries use proper parameterization. 📊 Final AssessmentStatus: 🟡 Needs Critical Fixes Issues #1 (schema), #2 (LLM parsing), and #3 (string truncation) must be fixed before merging. Issue #4-6 can be handled in follow-ups. The implementation is solid overall, but these bugs will cause runtime failures in production scenarios. |
|
Review posted at: #300 (comment) |
| verbose_mode = " [dim](verbose)[/dim]" if verbose else "" | ||
| console.print(f"\n[bold]Executing agent...{mode}{verbose_mode}[/bold]") | ||
|
|
||
| state = runtime.execute_agent(workspace, run, dry_run=dry_run, verbose=verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_event_type_to_category is called in the on_agent_event callback but never defined, causing a NameError at runtime. Consider adding the function definition or using LogCategory.AGENT_ACTION directly.
🚀 Want me to fix this? Reply ex: "fix it for me".
1. Fragile LLM Category Extraction (diagnostic_agent.py:618-651) - Changed from simple substring matching to regex parsing - Now specifically parses "Failure Category:" line - Prevents false matches like "NOT a dependency_issue" 2. String Truncation Bug (diagnostic_agent.py:463-464) - Added guard: max(0, max_length - 3) for negative index - Prevents incorrect behavior when max_length < 3 3. Weak Test Assertions (test_work_diagnose.py) - Changed "exit_code == 0 or exit_code is None" to explicit "exit_code == 0" - Added descriptive error messages with stdout output - Fixed 6 weak assertions across the test file
Code Review SummaryThis PR implements a comprehensive self-diagnosis system for failed task analysis with pattern-based and LLM-powered failure detection. All previously reported critical issues have been addressed. Overall Assessment: 🟢 Ready with one minor improvement ✅ Previously Reported Issues (All Fixed)All critical and medium priority issues from earlier reviews have been addressed in the current code:
🟡 Minor ImprovementCLI Command Prefix MismatchFile: (lines 230, 238, 248, 278, 296, 306, 334, 344) The diagnostic recommendations use prefix instead of . According to the CLI entry point (app.py:6), the command structure is , not . Current examples:
Suggestion: Replace all instances of with : ℹ️ Note on Repository LayerThe function in tasks.py uses direct SQL queries instead of routing through . However, this appears to be an intentional design choice:
This represents a transitional pattern in the codebase during the v2 refactor and is acceptable for this PR. ✅ Strengths
🔒 Security✅ No security concerns:
📊 Final RecommendationStatus: 🟢 Ready with minor improvement The only remaining issue is a CLI command prefix mismatch that affects user-facing documentation in recommendations. This is cosmetic and doesn't affect functionality, but should be fixed for consistency. Excellent work on addressing all the critical feedback from previous reviews. The self-diagnosis system significantly improves the developer experience when tasks fail. |
|
I've posted a comprehensive code review for PR #300. Here's a summary: Overall Assessment: 🟢 Ready with one minor improvement All critical issues from previous reviews have been fixed:
Remaining Issue (Minor):
The implementation is solid with excellent test coverage (71 tests) and follows v2 patterns well. Only a cosmetic CLI prefix issue remains. Review posted at: #300 (comment) |




Summary
New CLI Commands
Core Components
RunLoggerDiagnosticAgentDiagnosticReportFailure Categories Detected
Database Schema
Added two new tables:
run_logs- Structured log entries per rundiagnostic_reports- Persisted analysis resultsTest Plan
Total: 71 new tests, all passing (795 v2 tests pass)
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.