Skip to content

Conversation

@frankbria
Copy link
Owner

@frankbria frankbria commented Jan 22, 2026

Summary

  • Implements a comprehensive self-diagnosis system that activates when task execution fails
  • Provides actionable recommendations for recovery based on pattern detection and LLM analysis
  • Adds structured logging during agent execution for post-mortem analysis

New CLI Commands

cf work diagnose <task-id>              # Analyze failed run and show recommendations
cf work diagnose <task-id> --verbose    # Include full log summary
cf work retry <task-id>                 # Reset failed task and retry execution
cf work update-description <task-id>    # Update task description (common remediation)

Core Components

Component Purpose
RunLogger Structured logging with categories (AGENT_ACTION, LLM_CALL, ERROR, VERIFICATION)
DiagnosticAgent Pattern-based + LLM-powered failure analysis
DiagnosticReport Root cause, severity (CRITICAL/HIGH/MEDIUM/LOW), and recommendations

Failure Categories Detected

  • task_description - Ambiguous or missing requirements
  • dependency_issue - Missing modules/packages
  • code_quality - Test or lint failures
  • blocker_unresolved - Human input needed
  • environment_issue - Permission or config problems
  • model_limitation - Context length or capability issues

Database Schema

Added two new tables:

  • run_logs - Structured log entries per run
  • diagnostic_reports - Persisted analysis results

Test Plan

  • 29 unit tests for RunLogger and diagnostics models
  • 19 unit tests for DiagnosticAgent pattern detection
  • 11 CLI command tests
  • 12 integration tests for full workflow
  • Manual testing on cf-test project with real failures

Total: 71 new tests, all passing (795 v2 tests pass)

Summary by CodeRabbit

  • New Features

    • CLI: diagnose — analyze failed runs, show colored severity, root cause, recommendations, and optional verbose logs
    • CLI: update-description — edit task descriptions and show next-step examples
    • Background diagnostics — headless Diagnostic Agent with persisted diagnostic reports and run-level logging for richer failure insights
    • Database: persistent storage and indexing for run logs and diagnostic reports
  • Tests

    • Added extensive unit, integration, and CLI tests covering diagnostics, reporting, workflows, and formatting

✏️ Tip: You can customize this high-level summary in your review settings.

Implements a comprehensive self-diagnosis system that activates when
task execution fails, providing actionable recommendations for recovery.

Core Components:
- RunLogger: Structured logging with categories (AGENT_ACTION, LLM_CALL,
  ERROR, VERIFICATION, STATE_CHANGE, BLOCKER)
- DiagnosticAgent: Pattern-based + LLM-powered failure analysis
- DiagnosticReport: Root cause, severity, and recommendations

CLI Commands:
- `cf work diagnose <task-id>`: Analyze failed run and show recommendations
- `cf work retry <task-id>`: Reset failed task and retry execution
- `cf work update-description <task-id> <desc>`: Update task description

Features:
- Pattern detection for common failures (missing deps, code quality,
  blockers, environment issues, task description problems)
- LLM-powered root cause analysis for complex failures
- Severity assessment (CRITICAL, HIGH, MEDIUM, LOW)
- Executable CLI commands in recommendations
- Persistent storage in SQLite (run_logs, diagnostic_reports tables)
- Integration with agent execution via RunLogger

Tests: 71 new tests (29 diagnostics, 19 diagnostic agent, 11 CLI, 12 integration)
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a diagnostics subsystem: CLI commands (work diagnose, work update-description), a headless DiagnosticAgent (pattern-based with optional LLM), run-level RunLogger and DB schemas for run logs and diagnostic reports, runtime instrumentation to emit diagnostic logs, task update API, and comprehensive unit and integration tests.

Changes

Cohort / File(s) Change Summary
CLI Commands
\codeframe/cli/app.py``
Added work diagnose (resolve partial task ID, pick most-recent failed run, reuse/regenerate DiagnosticReport with --force) and work update-description. Added _display_diagnostic_report and public CLI entrypoints; watch CLI output formatting and exit codes.
Diagnostic Engine
\codeframe/core/diagnostic_agent.py``
New DiagnosticAgent implementing pattern-based failure detection, optional LLM root-cause analysis, severity assessment, recommendation generation, log summarization, and persistence. Exposed helpers: detect_failure_patterns, detect_primary_failure_category, generate_recommendations, assess_severity, summarize_logs.
Diagnostics Data & Logging
\codeframe/core/diagnostics.py``
New enums/dataclasses (LogLevel, LogCategory, RemediationAction, FailureCategory, Severity, RunLogEntry, DiagnosticRecommendation, DiagnosticReport), RunLogger (buffered per-run logging + DB persistence), and helpers (get_run_logs, get_run_errors, count_logs_by_level, save_diagnostic_report, get_latest_diagnostic_report, list_diagnostic_reports). DB schema and serialization logic added—review JSON/timestamp handling.
Runtime Instrumentation
\codeframe/core/runtime.py``
Instrumented runtime to log agent events and final run state transitions via RunLogger; added helper _event_type_to_category(event_type: str) -> LogCategory. Check event-to-category mapping and extra log volume.
Task Management
\codeframe/core/tasks.py``
Added update(workspace, task_id, title=None, description=None, priority=None) -> Task to apply selective field updates and persist updated_at. Ensure SQL parameterization and not updating when no fields provided.
Workspace Schema
\codeframe/core/workspace.py``
Added run_logs and diagnostic_reports tables, indices, constraints, and idempotent schema-upgrade logic. Review migrations and foreign-key/index creation paths.
Tests
\tests/cli/test_work_diagnose.py`, `tests/core/test_diagnostic_agent.py`, `tests/core/test_diagnostics.py`, `tests/integration/test_self_diagnosis_workflow.py``
New unit and integration tests covering CLI diagnose/update-description, DiagnosticAgent (pattern and LLM paths), RunLogger/log retrieval, diagnostic report persistence, and end-to-end self-diagnosis workflows across multiple failure categories and edge cases.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "work diagnose\n(CLI)"
    participant Agent as "DiagnosticAgent"
    participant DB as "Workspace DB"
    participant LLM as "LLM Provider\n(optional)"
    participant Display as "Report Display"

    User->>CLI: diagnose task_id [--force]
    CLI->>DB: resolve task (partial ID) & find failed run
    CLI->>Agent: analyze(task_id, run_id)
    Agent->>DB: get_run_logs(run_id)
    DB-->>Agent: list[RunLogEntry]
    Agent->>Agent: detect_failure_patterns(logs)
    Agent->>Agent: assess_severity()
    Agent->>Agent: generate_recommendations()
    alt with LLM
        Agent->>LLM: _analyze_with_llm(logs, error_msgs)
        LLM-->>Agent: root_cause text
    else without LLM
        Agent->>Agent: _generate_root_cause()
    end
    Agent->>Agent: summarize_logs()
    Agent->>DB: save_diagnostic_report()
    DB-->>Agent: DiagnosticReport
    Agent-->>CLI: DiagnosticReport
    CLI->>Display: _display_diagnostic_report()
    Display-->>User: formatted report (severity, category, recommendations)
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I nibble logs with careful paws,
I hop through traces, find the cause.
A root revealed, a gentle clue—
I tuck a fix and bounce to you.
Hooray — diagnosis, fresh and new!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat(diagnostics): add self-diagnosis system for failed task analysis' directly and clearly summarizes the main change—introducing a self-diagnosis system for analyzing failed tasks—which is the primary purpose of this significant changeset.
Docstring Coverage ✅ Passed Docstring coverage is 99.21% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/self-diagnosis-system

Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp
Copy link
Contributor

macroscopeapp bot commented Jan 22, 2026

Add self-diagnosis for failed task analysis by introducing DiagnosticAgent.analyze and CLI commands work diagnose, work retry, and work update-description

Add a diagnostics pipeline that writes structured run logs, infers a primary failure category, generates recommendations, and persists DiagnosticReports; wire it into runtime and the CLI. Introduce new tables and migrations. Note: several SQL strings and _extract_package_name contain stray + tokens that cause syntax errors and break imports or DB operations.

📍Where to Start

Start with DiagnosticAgent.analyze in diagnostic_agent.py, then review structured logging in RunLogger.flush and querying in get_run_logs in diagnostics.py, followed by CLI entrypoints in work_diagnose in codeframe/cli/app.py.


Macroscope summarized b166ab0.

@github-actions
Copy link
Contributor

Review: Self-Diagnosis System Implementation

Summary

This PR adds a comprehensive self-diagnosis system for failed task analysis with pattern-based and LLM-powered failure detection. The implementation includes structured logging (), diagnostic analysis (), and CLI commands (, , ). Test coverage is excellent with 71 new tests.

✅ Strengths

  1. Well-structured architecture: Clear separation between data models (), analysis logic (), and CLI ()
  2. Follows existing patterns: Uses the same headless, CLI-first approach as other v2 modules
  3. Comprehensive database schema: Properly indexed tables for and with appropriate foreign keys and CHECK constraints
  4. Great test coverage: 71 tests covering unit, integration, and CLI flows
  5. Rich user experience: Well-formatted CLI output with color-coded severity and actionable recommendations

⚠️ Issues Found

1. Typos in DiagnosticAgent (CRITICAL)

File:

Line 601: Typo in prompt template

Failure Category: [One of: task_description, blocker_unresolved, model_limitation, ...]
#                                                                                   ^^^^^^^^^^^^^^ missing 'i' (should be 'model_limitation')

Line 626: Mismatched category key

category_map = {
    ...
    "blocker_unresolved": FailureCategory.BLOCKER_UNRESOLVED,  # ← double 'e' typo
    ...
}
# Should be "blocker_unresolved" (single 'e') to match enum value BLOCKER_UNRESOLVED

Impact: Pattern-based detection for blocker and model limitation issues will not work correctly.


2. Fragile LLM Category Extraction (MEDIUM)

File:

The function uses simple substring matching:

if key in response_lower:
    return category

Problem: This could match the category string in unexpected contexts. For example:

  • If LLM says "The issue is NOT a dependency_issue", it would still match
  • Matches could occur within longer words or sentences

Recommendation: Use word boundary matching or more robust parsing:

import re
pattern = r'\b' + re.escape(key) + r'\b'
if re.search(pattern, response_lower):
    return category

3. Defensive Type Checking in Runtime (MINOR)

File:

error_info["last_step_status"] = last_step.status.value if hasattr(last_step.status, 'value') else str(last_step.status)

The check is overly defensive. Since is an enum (defined in ), it will always have a attribute.

Recommendation: Simplify to:

error_info["last_step_status"] = last_step.status.value

🚀 Performance Considerations

  1. Database write pattern: defaults to , which writes to SQLite after each log entry. This could impact performance during heavy logging. The existing method is a good escape hatch.

  2. Indexes are well-chosen: The new indexes on and will optimize common queries.


🔒 Security

✅ No security concerns identified:

  • All SQL queries use proper parameterization
  • No user input is directly interpolated into queries
  • Metadata is JSON-serialized before storage

🧪 Test Coverage

✅ Excellent:

  • 29 unit tests for RunLogger and diagnostics models
  • 19 unit tests for DiagnosticAgent pattern detection
  • 11 CLI command tests
  • 12 integration tests for full workflow

Suggestion: Consider adding tests for the typos identified above to prevent regression.


📋 Recommendations

  1. **Fix typos in ** (lines 601, 626) before merging
  2. Add word boundary checks to for robuster pattern matching
  3. Simplify defensive type checking in
  4. Consider adding tests that verify the category extraction handles edge cases correctly

Overall, this is a well-implemented feature that significantly improves the developer experience when tasks fail. The issues are minor and easily addressable.


Overall Assessment: 🟡 Needs Minor Fixes

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@codeframe/core/diagnostic_agent.py`:
- Around line 568-610: The _analyze_with_llm method currently calls
llm_provider.complete with a raw string prompt which mismatches the provider
API; change the call to pass messages=[{"role":"user","content": prompt}] and
include the purpose argument (Purpose.EXECUTION) so it matches other usages of
llm_provider.complete and returns the LLM response correctly; update the try
block replacing the single-string call to llm_provider.complete with the
messages+purpose signature and keep the existing response.content handling and
exception fallback.
- Around line 511-548: The code computes severity (assess_severity) and
recommendations (generate_recommendations) before allowing the LLM override of
failure_category from _analyze_with_llm/_extract_category_from_llm; move or
repeat the calls so that after you call self._analyze_with_llm(logs,
error_messages) and possibly update failure_category from
_extract_category_from_llm, you recompute severity = assess_severity(...) and
recommendations = generate_recommendations(...); keep
detect_primary_failure_category, error_messages, summarize_logs as-is and only
reorder/re-call the two functions to ensure they use the final failure_category.
- Around line 224-240: Replace the incorrect CLI strings used in the
TASK_DESCRIPTION remediation recommendations: in the block that checks
FailureCategory.TASK_DESCRIPTION (where DiagnosticRecommendation and
RemediationAction.UPDATE_TASK_DESCRIPTION are used and task_id is available),
change the review command to "codeframe tasks get {task_id}" and change the
update command to "codeframe work update-description {task_id} \"<new
description>\"" (keep the same actions, reasons, and parameters dictionary).
This updates the first recommendation's command to use codeframe tasks get and
the second to use codeframe work update-description so the suggested CLI steps
match the documented commands.

In `@codeframe/core/tasks.py`:
- Around line 254-329: The update function in tasks.py currently bypasses the
repository and issues raw sqlite3 queries via get_db_connection; replace that
logic by calling TaskRepository.update_task(workspace, task_id, title=...,
description=..., priority=...) (or the repository's equivalent method) so all
validation, safety checks, and persistence are handled by the domain layer;
remove the manual cursor/execute/commit/close and manual updated_at assignment,
and ensure the repository's return value (updated Task) is returned from
update(); pass only the fields that are not None.

In `@tests/cli/test_work_diagnose.py`:
- Around line 99-125: The tests test_diagnose_shows_recommendations and
test_diagnose_with_verbose are using the unreliable check "result.exit_code == 0
or result.exit_code is None"; replace this with deterministic assertions: assert
result.exit_code == 0 and assert result.exception is None after the
runner.invoke calls (or, if you expect a handled failure, assert the specific
exit code and inspect result.exception for the expected exception type/message).
Update the assertions in the two test functions that call runner.invoke
(referencing runner.invoke, result.exit_code, and result.exception) to remove
"is None" and use explicit, deterministic checks for success or for a specific
expected failure.
- Around line 70-77: The current weak assertion allows failures to slip by;
update the test in tests/cli/test_work_diagnose.py that checks result from
runner.invoke(app, ["work", "diagnose", task.id[:8], "--workspace",
str(workspace.repo_path)]) to assert explicitly that the command succeeded
(e.g., assert result.exit_code == 0) and additionally verify there is no error
output (e.g., assert "Error" not in result.stdout and/or assert result.exception
is None) so a non-zero exit code cannot pass the test.
🧹 Nitpick comments (11)
codeframe/core/diagnostic_agent.py (1)

351-371: Broaden package-name extraction to handle hyphens/dots.

Current regex only matches \w, so it misses names like pydantic-core or google.auth.

♻️ Proposed tweak
-        match = re.search(r"no module named ['\"]?(\w+)", msg, re.IGNORECASE)
+        match = re.search(r"no module named ['\"]?([\w.-]+)", msg, re.IGNORECASE)
         if match:
             return match.group(1)

-        match = re.search(r"cannot import.*from ['\"]?(\w+)", msg, re.IGNORECASE)
+        match = re.search(r"cannot import.*from ['\"]?([\w.-]+)", msg, re.IGNORECASE)
         if match:
             return match.group(1)
codeframe/core/runtime.py (1)

756-773: Include actionable error text in the ERROR log message.

DiagnosticAgent extracts error_messages from log.message, but the current message is generic and the details live only in metadata. Consider embedding a short error excerpt in the message to improve downstream recommendations.

🔧 Example adjustment
-        run_logger.error(LogCategory.ERROR, "Agent execution failed", error_info)
+        message = "Agent execution failed"
+        if error_info.get("last_step_error"):
+            message = f"{message}: {error_info['last_step_error'][:200]}"
+        run_logger.error(LogCategory.ERROR, message, error_info)
tests/core/test_diagnostic_agent.py (4)

1-24: Missing pytestmark for v2 tests.

Other test files in this PR (e.g., test_work_diagnose.py, test_self_diagnosis_workflow.py) include pytestmark = pytest.mark.v2 to mark all tests as v2. This file is missing that marker, which may cause inconsistency in test filtering.

♻️ Add v2 marker for consistency
 from codeframe.core.workspace import create_or_load_workspace
 
 
+# Mark all tests as v2
+pytestmark = pytest.mark.v2
+
+
 `@pytest.fixture`
 def workspace(tmp_path: Path):

155-168: Unused import in test method.

DiagnosticAgent is imported on line 157 but not used in this test. Only detect_failure_patterns is used.

♻️ Remove unused import
     def test_detect_ambiguous_task_pattern(self, workspace, run_id, task_id):
         """Test detection of ambiguous task description pattern."""
-        from codeframe.core.diagnostic_agent import DiagnosticAgent, detect_failure_patterns
+        from codeframe.core.diagnostic_agent import detect_failure_patterns
 
         logger = RunLogger(workspace, run_id, task_id)

218-273: Unused fixture parameters in recommendation tests.

The workspace fixture is passed to test_generate_update_description_recommendation, test_generate_retry_recommendation, and test_generate_dependency_fix_recommendation but is not used. The generate_recommendations function doesn't require a workspace.

♻️ Remove unused workspace fixture from recommendation tests
-    def test_generate_update_description_recommendation(self, workspace, run_id, task_id):
+    def test_generate_update_description_recommendation(self, run_id, task_id):
         """Test generating UPDATE_TASK_DESCRIPTION recommendation."""
         from codeframe.core.diagnostic_agent import generate_recommendations

Apply similar changes to test_generate_retry_recommendation and test_generate_dependency_fix_recommendation.


275-324: Unused fixture parameters in severity tests.

The workspace, run_id, and task_id fixtures are passed to all severity assessment tests but are not used. The assess_severity function only takes failure_category, error_count, and has_blocker parameters.

♻️ Remove unused fixtures from severity tests
-    def test_critical_severity_for_repeated_failures(self, workspace, run_id, task_id):
+    def test_critical_severity_for_repeated_failures(self):
         """Test that repeated failures get CRITICAL severity."""
         from codeframe.core.diagnostic_agent import assess_severity

Apply similar changes to other severity test methods.

tests/cli/test_work_diagnose.py (2)

6-10: Unused imports.

patch from unittest.mock (line 10) is imported but never used in this file.

♻️ Remove unused import
 import pytest
 import uuid
 from datetime import datetime, timezone
 from pathlib import Path
-from unittest.mock import MagicMock, patch
+from unittest.mock import MagicMock

228-243: Test relies on probabilistic UUID matching.

The test uses "a" as a task ID prefix (line 238), hoping it might match multiple UUIDs. However, UUIDs are randomly generated, so this test's behavior is non-deterministic. The comment acknowledges this but the test may not reliably exercise the "multiple matches" path.

Consider creating tasks with known prefixes or mocking the task lookup to ensure consistent test behavior.

tests/core/test_diagnostics.py (1)

1-31: Missing pytestmark and unused import.

  1. Similar to test_diagnostic_agent.py, this file is missing pytestmark = pytest.mark.v2 for consistency with other test files in this PR.
  2. json (line 6) is imported but not used anywhere in the file.
♻️ Add v2 marker and remove unused import
 """Tests for diagnostics module.
 
 Tests the RunLogger, diagnostic reports, and related functions.
 """
 
-import json
 import pytest
 import uuid
 from datetime import datetime, timezone
 from pathlib import Path
 
 from codeframe.core.diagnostics import (
     ...
 )
 from codeframe.core.workspace import create_or_load_workspace, get_db_connection
 
 
+# Mark all tests as v2
+pytestmark = pytest.mark.v2
+
+
 `@pytest.fixture`
tests/integration/test_self_diagnosis_workflow.py (2)

6-27: Unused imports.

Several imports are not used in this file:

  • datetime and timezone (line 8) - not referenced anywhere
  • get_run_logs (line 22) - not called in any test
♻️ Remove unused imports
 import pytest
 import uuid
-from datetime import datetime, timezone
 from pathlib import Path
 
 from typer.testing import CliRunner
 
 from codeframe.cli.app import app
 from codeframe.core.diagnostics import (
     DiagnosticReport,
     FailureCategory,
     LogCategory,
     RemediationAction,
     RunLogger,
     Severity,
     get_latest_diagnostic_report,
-    get_run_logs,
 )

100-158: CLI tests use loose assertions.

Lines 115 and 135 use exit_code == 0 or exit_code is None, which is the same pattern flagged in test_work_diagnose.py. These assertions may mask unexpected failures.

The workaround in test_retry_command_resets_failed_task (lines 152-157) using tasks.update_status instead of the actual retry command is pragmatic given the API key dependency, but consider adding a TODO or comment about eventually testing the full retry flow in an environment with credentials.

Comment on lines 224 to 240
if failure_category == FailureCategory.TASK_DESCRIPTION:
recommendations.append(
DiagnosticRecommendation(
action=RemediationAction.UPDATE_TASK_DESCRIPTION,
reason="Task description may be ambiguous or incomplete. Adding clearer acceptance criteria will help the agent understand requirements.",
command=f"cf tasks show {task_id} # Review current description, then update with clearer criteria",
parameters={"task_id": task_id},
)
)
recommendations.append(
DiagnosticRecommendation(
action=RemediationAction.RETRY_WITH_CONTEXT,
reason="After updating the description, retry the task",
command=f"cf work start {task_id} --execute",
parameters={"task_id": task_id},
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find CLI-related files
find . -type f -name "*cli*" -o -name "*command*" -o -name "*app.py" | head -20

Repository: frankbria/codeframe

Length of output: 182


🏁 Script executed:

# Search for tasks show command definition
rg -n "tasks.*show|\"show\"|command.*show" --type py | grep -i "task\|cli" | head -30

Repository: frankbria/codeframe

Length of output: 1540


🏁 Script executed:

# Look for work update-description command
rg -n "update.description|update_description" --type py | head -20

Repository: frankbria/codeframe

Length of output: 782


🏁 Script executed:

# Search for CLI app structure and command definitions
fd -e py . | xargs grep -l "click\|typer\|argparse\|@app\|@command" | head -10

Repository: frankbria/codeframe

Length of output: 373


🏁 Script executed:

cat -n codeframe/cli/tasks_commands.py

Repository: frankbria/codeframe

Length of output: 10733


🏁 Script executed:

# Search for tasks_app and its command definitions
rg -n "tasks_app\s*=" --type py

Repository: frankbria/codeframe

Length of output: 173


🏁 Script executed:

# Look for all command definitions in app.py for tasks
rg -n "@tasks_app\.command|def.*task" codeframe/cli/app.py | head -20

Repository: frankbria/codeframe

Length of output: 1070


🏁 Script executed:

sed -n '220,245p' codeframe/core/diagnostic_agent.py

Repository: frankbria/codeframe

Length of output: 1245


🏁 Script executed:

# Check what work update-description expects as parameters
sed -n '2124,2160p' codeframe/cli/app.py

Repository: frankbria/codeframe

Length of output: 1361


Use codeframe tasks get to review, then codeframe work update-description to update.

The cf tasks show command doesn't exist. The available commands are codeframe tasks get {task_id} to review the current description and codeframe work update-description {task_id} "<new description>" to update it. The second command is documented as the proper way to clarify requirements after a failed run.

🔧 Suggested adjustment
-                command=f"cf tasks show {task_id}  # Review current description, then update with clearer criteria",
+                command=f"codeframe tasks get {task_id}  # Review current description",
+            )
+        )
+        recommendations.append(
+            DiagnosticRecommendation(
+                action=RemediationAction.UPDATE_TASK_DESCRIPTION,
+                reason="Update the task description with clearer acceptance criteria",
+                command=f"codeframe work update-description {task_id} \"<add acceptance criteria>\"",
🤖 Prompt for AI Agents
In `@codeframe/core/diagnostic_agent.py` around lines 224 - 240, Replace the
incorrect CLI strings used in the TASK_DESCRIPTION remediation recommendations:
in the block that checks FailureCategory.TASK_DESCRIPTION (where
DiagnosticRecommendation and RemediationAction.UPDATE_TASK_DESCRIPTION are used
and task_id is available), change the review command to "codeframe tasks get
{task_id}" and change the update command to "codeframe work update-description
{task_id} \"<new description>\"" (keep the same actions, reasons, and parameters
dictionary). This updates the first recommendation's command to use codeframe
tasks get and the second to use codeframe work update-description so the
suggested CLI steps match the documented commands.

Comment on lines 568 to 610
def _analyze_with_llm(
self,
logs: list[RunLogEntry],
error_messages: list[str],
) -> str:
"""Use LLM to analyze logs and determine root cause.
Args:
logs: Log entries to analyze
error_messages: Extracted error messages
Returns:
Root cause description from LLM
"""
if not self.llm_provider:
return "LLM analysis not available"

# Build prompt
log_text = "\n".join(
f"[{log.log_level.value}] {log.category.value}: {log.message}"
for log in logs[-30:] # Last 30 logs
)

prompt = f"""Analyze the following agent execution logs and determine the root cause of the failure.
Logs:
{log_text}
Error messages:
{chr(10).join(error_messages[:10])}
Provide your analysis in this format:
Root Cause: [One sentence description of the root cause]
Failure Category: [One of: task_description, blocker_unresolved, model_limitation, code_quality, dependency_issue, environment_issue, technical_error, unknown]
Severity: [One of: critical, high, medium, low]
Then provide recommendations."""

try:
response = self.llm_provider.complete(prompt)
return response.content if hasattr(response, 'content') else str(response)
except Exception as e:
return f"LLM analysis failed: {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find LLMProvider class definition and complete method
find . -type f -name "*.py" | xargs grep -l "class.*LLMProvider" | head -20

Repository: frankbria/codeframe

Length of output: 166


🏁 Script executed:

#!/bin/bash
# Search for complete method definitions in provider-related files
rg "def complete" -A 5 -g "*.py"

Repository: frankbria/codeframe

Length of output: 2519


🏁 Script executed:

#!/bin/bash
# Find all usages of .complete( to see the calling patterns
rg "\.complete\(" -B 2 -A 2 -g "*.py"

Repository: frankbria/codeframe

Length of output: 7603


Fix LLMProvider.complete call to use correct API signature.

The complete() method expects messages: list[dict] as its first parameter, but this code passes a raw string prompt. All other call sites in the codebase use the correct pattern: provider.complete(messages=[{"role": "user", "content": prompt}], purpose=Purpose.XXX).

Change to:

response = self.llm_provider.complete(
    messages=[{"role": "user", "content": prompt}],
    purpose=Purpose.EXECUTION,
)
🤖 Prompt for AI Agents
In `@codeframe/core/diagnostic_agent.py` around lines 568 - 610, The
_analyze_with_llm method currently calls llm_provider.complete with a raw string
prompt which mismatches the provider API; change the call to pass
messages=[{"role":"user","content": prompt}] and include the purpose argument
(Purpose.EXECUTION) so it matches other usages of llm_provider.complete and
returns the LLM response correctly; update the try block replacing the
single-string call to llm_provider.complete with the messages+purpose signature
and keep the existing response.content handling and exception fallback.

Comment on lines 254 to 329
def update(
workspace: Workspace,
task_id: str,
title: Optional[str] = None,
description: Optional[str] = None,
priority: Optional[int] = None,
) -> Task:
"""Update a task's title, description, or priority.
Only provided fields are updated; others are left unchanged.
Args:
workspace: Target workspace
task_id: Task to update
title: New title (optional)
description: New description (optional)
priority: New priority (optional)
Returns:
Updated Task
Raises:
ValueError: If task not found
"""
task = get(workspace, task_id)
if not task:
raise ValueError(f"Task not found: {task_id}")

# Build update query dynamically
updates = []
params = []

if title is not None:
updates.append("title = ?")
params.append(title)
task.title = title

if description is not None:
updates.append("description = ?")
params.append(description)
task.description = description

if priority is not None:
updates.append("priority = ?")
params.append(priority)
task.priority = priority

if not updates:
# Nothing to update
return task

now = _utc_now().isoformat()
updates.append("updated_at = ?")
params.append(now)

# Add WHERE clause params
params.extend([workspace.id, task_id])

conn = get_db_connection(workspace)
cursor = conn.cursor()

cursor.execute(
f"""
UPDATE tasks
SET {', '.join(updates)}
WHERE workspace_id = ? AND id = ?
""",
params,
)
conn.commit()
conn.close()

task.updated_at = datetime.fromisoformat(now)

return task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "repository" codeframe/ | head -20

Repository: frankbria/codeframe

Length of output: 1067


🏁 Script executed:

ls -la codeframe/persistence/repositories/ 2>/dev/null || echo "Directory not found, searching..."
fd -t f "\.py" codeframe/persistence/ | head -20

Repository: frankbria/codeframe

Length of output: 2602


🏁 Script executed:

# Check the structure of the codeframe directory
find codeframe -type d -maxdepth 3 | head -30

Repository: frankbria/codeframe

Length of output: 680


🏁 Script executed:

# Look for task-related repository files
fd -t f "task.*\.py" codeframe/ --case-sensitive

Repository: frankbria/codeframe

Length of output: 273


🏁 Script executed:

# Check if get_db_connection is sync or async
rg "def get_db_connection" codeframe/

Repository: frankbria/codeframe

Length of output: 157


🏁 Script executed:

# Check the actual update() function implementation in tasks.py
cat -n codeframe/core/tasks.py | sed -n '254,329p'

Repository: frankbria/codeframe

Length of output: 2344


🏁 Script executed:

head -100 codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 2494


🏁 Script executed:

rg "async def|aiosqlite|def update" codeframe/persistence/repositories/task_repository.py | head -30

Repository: frankbria/codeframe

Length of output: 569


🏁 Script executed:

head -50 codeframe/persistence/repositories/base.py

Repository: frankbria/codeframe

Length of output: 1942


🏁 Script executed:

# Check if there's an update method in task_repository.py
rg "def update" codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 212


🏁 Script executed:

# Check imports to see if using aiosqlite or sqlite3
head -30 codeframe/persistence/repositories/base.py

Repository: frankbria/codeframe

Length of output: 918


🏁 Script executed:

# Check the database module to understand connection handling
cat -n codeframe/persistence/database.py | head -50

Repository: frankbria/codeframe

Length of output: 1634


🏁 Script executed:

# Get the full update_task method from task_repository.py
rg -A 20 "def update_task\(" codeframe/persistence/repositories/task_repository.py | head -40

Repository: frankbria/codeframe

Length of output: 736


🏁 Script executed:

# Check how the repository is being used in codeframe/core/tasks.py
head -50 codeframe/core/tasks.py

Repository: frankbria/codeframe

Length of output: 1487


🏁 Script executed:

# Check what imports are in codeframe/core/tasks.py
rg "^import|^from" codeframe/core/tasks.py | head -20

Repository: frankbria/codeframe

Length of output: 385


🏁 Script executed:

# Check if get() function uses repository
rg -B 5 -A 15 "^def get\(" codeframe/core/tasks.py

Repository: frankbria/codeframe

Length of output: 610


🏁 Script executed:

# Check the full update_task method in task_repository
sed -n '100,200p' codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 3249


🏁 Script executed:

# Check if TaskRepository is imported or used in codeframe/core/tasks.py
rg "TaskRepository|from.*repositories" codeframe/core/tasks.py

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Check the full get_db_connection function
rg -A 10 "def get_db_connection" codeframe/core/workspace.py

Repository: frankbria/codeframe

Length of output: 333


🏁 Script executed:

# Check if there are other functions in tasks.py that also use direct DB access
rg "get_db_connection\|conn.cursor\|conn.commit" codeframe/core/tasks.py | head -20

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Check if there's a pattern of using repositories in other core modules
rg "Repository\(" codeframe/core/*.py

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Check if update_task has async version
rg "async def update_task" codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Get more context on the repository methods - check if any are async
rg "async def" codeframe/persistence/repositories/task_repository.py | head -10

Repository: frankbria/codeframe

Length of output: 171


🏁 Script executed:

# Check how other modules in codeframe/core use repositories
find codeframe/core -name "*.py" -type f -exec grep -l "Repository" {} \;

Repository: frankbria/codeframe

Length of output: 200


🏁 Script executed:

# Look at other functions in tasks.py to see the overall pattern
wc -l codeframe/core/tasks.py

Repository: frankbria/codeframe

Length of output: 90


🏁 Script executed:

# Check how other similar operations are structured in the codebase
rg "def create\(" codeframe/core/tasks.py -A 30 | head -50

Repository: frankbria/codeframe

Length of output: 888


🏁 Script executed:

# Check if there's async version of create/update in tasks.py
rg "async def" codeframe/core/tasks.py

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Check how artifacts.py uses repositories since it showed up in the search
head -100 codeframe/core/artifacts.py | grep -A 5 -B 5 "Repository\|get_db_connection"

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Check if there's a pattern in how repositories are used
rg "from codeframe.persistence.repositories" codeframe/core/*.py

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Look at the workspace.py to see how it uses repositories
grep -n "Repository\|database\|Database" codeframe/core/workspace.py | head -20

Repository: frankbria/codeframe

Length of output: 715


🏁 Script executed:

# Check the full TaskRepository class to see all methods
wc -l codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 120


🏁 Script executed:

# Search for all method definitions in task_repository
rg "^\s*def |^\s*async def" codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 1612


🏁 Script executed:

# Check if there's an async version of update_task
cat codeframe/persistence/repositories/task_repository.py | grep -n "update_task\|async def"

Repository: frankbria/codeframe

Length of output: 336


🏁 Script executed:

# Look at the exact section around update_task to understand its signature
sed -n '90,150p' codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 1922


🏁 Script executed:

# Check if tasks.py functions are async or sync
rg "^async def|^def " codeframe/core/tasks.py | head -20

Repository: frankbria/codeframe

Length of output: 723


🏁 Script executed:

# Verify the signature and calling pattern for update_task in task_repository
sed -n '97,145p' codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 1655


🏁 Script executed:

# Check how the Database class instantiates repositories
grep -n "TaskRepository\|self.tasks" codeframe/persistence/database.py | head -10

Repository: frankbria/codeframe

Length of output: 692


Route updates through the task repository layer.

This function performs direct sqlite3 updates via get_db_connection, but the guideline requires using domain-specific repositories for data access. The TaskRepository.update_task() method already exists and handles updates with proper validation and safety checks. Call the repository method here to maintain consistency with the established data access pattern.

🤖 Prompt for AI Agents
In `@codeframe/core/tasks.py` around lines 254 - 329, The update function in
tasks.py currently bypasses the repository and issues raw sqlite3 queries via
get_db_connection; replace that logic by calling
TaskRepository.update_task(workspace, task_id, title=..., description=...,
priority=...) (or the repository's equivalent method) so all validation, safety
checks, and persistence are handled by the domain layer; remove the manual
cursor/execute/commit/close and manual updated_at assignment, and ensure the
repository's return value (updated Task) is returned from update(); pass only
the fields that are not None.

@frankbria frankbria linked an issue Jan 22, 2026 that may be closed by this pull request
@github-actions
Copy link
Contributor

Review Summary

This PR implements a comprehensive self-diagnosis system for failed task execution with pattern-based detection, LLM analysis, and actionable recommendations. The architecture is well-designed and follows the headless, CLI-first v2 approach. Test coverage is excellent with 71 new tests.

However, there are critical issues that must be addressed before merging:

🔴 Critical Issues

1. Severity/Recommendations Stale After LLM Override (diagnostic_agent.py:511-548)

Issue: Severity and recommendations are computed on lines 523-536, but the LLM can override failure_category on lines 545-547. This means the report may contain mismatched severity/recommendations relative to the final category.

Fix: Move or recompute severity and recommendations after line 547 (after potential LLM category update):

# After llm_category is applied (line 547):
failure_category = llm_category

# Now recompute based on final category
severity = assess_severity(
    failure_category=failure_category,
    error_count=len(errors),
    has_blocker=has_blocker,
)
recommendations = generate_recommendations(
    task_id=task_id,
    run_id=run_id,
    failure_category=failure_category,
    error_messages=error_messages,
)

2. LLMProvider.complete Call Uses Wrong API (diagnostic_agent.py:607)

Issue: Line 607 passes a raw string prompt directly, but LLMProvider.complete expects messages: list[dict] and purpose parameters (see line 554 in tasks.py).

Fix:

response = self.llm_provider.complete(
    messages=[{"role": "user", "content": prompt}],
    purpose=Purpose.EXECUTION,
)

You'll also need to import Purpose from codeframe.adapters.llm.

🟡 Medium Priority

3. Database Connection Leak in tasks.update() (tasks.py:312-324)

Issue: The update() function (lines 312-324) doesn't use try/finally for connection cleanup, unlike the create() function (lines 81-93). If cursor.execute() or conn.commit() raises an exception, the connection remains open.

Fix:

conn = get_db_connection(workspace)
try:
    cursor = conn.cursor()
    # ... existing code ...
    conn.commit()
    conn.close()
except Exception as e:
    conn.close()
    raise  # Re-raise after cleanup

Note: Consider also routing this through TaskRepository if that exists, for consistency with the v2 architecture.

🟢 Minor / Nitpicks

4. Package Name Extraction Limited (diagnostic_agent.py:362, 367)

Suggestion: The regex r"(\w+)" doesn't match packages with hyphens (e.g., pydantic-core, google.auth). Consider expanding:

match = re.search(r"no module named ['\"]?([\w.-]+)", msg, re.IGNORECASE)

5. CLI Recommendation Uses Wrong Commands (diagnostic_agent.py:229, 237)

Issue: Line 229 uses cf tasks show but the actual command is codeframe tasks get. Line 237 uses cf work start but retry uses codeframe work retry.

Fix:

  • Line 229: "codeframe tasks get {task_id} # Review current description"
  • Add a second recommendation for updating: "codeframe work update-description {task_id} \"<add acceptance criteria>\""

6. Overly Defensive Type Check (runtime.py:769)

Issue: Line 769 checks hasattr(last_step.status, 'value'), but status is an enum (from AgentStepStatus) and always has the value attribute.

Fix: Simplify to:

error_info["last_step_status"] = last_step.status.value

✅ Strengths

  1. Well-architected: Clear separation between data models (diagnostics.py), analysis logic (diagnostic_agent.py), and CLI (app.py)
  2. Follows v2 patterns: Headless, CLI-first design consistent with Golden Path
  3. Database schema: Well-indexed tables with proper constraints
  4. Great test coverage: 71 tests covering unit, integration, and CLI flows
  5. Rich UX: Well-formatted CLI output with color-coded severity

📊 Test Coverage Notes

  • Some test files are missing pytestmark = pytest.mark.v2 marker (test_diagnostic_agent.py, test_diagnostics.py)
  • A few unused imports across test files
  • Some tests use probabilistic UUID matching that may be flaky (test_work_diagnose.py:238)

🔒 Security

✅ No security concerns:

  • SQL queries use proper parameterization
  • No user input interpolation in queries
  • Metadata is JSON-serialized before storage

Overall Assessment

Status: 🟡 Needs Minor Fixes (Critical)

The implementation is solid and the feature adds significant value. However, issues #1 (stale severity/recommendations) and #2 (wrong LLM API call) are critical bugs that will prevent the feature from working correctly in production. Issue #3 (connection leak) is important for resource management.

Please address issues 1-3 before merging. Issues 4-6 can be handled in follow-up PRs or at your discretion.

@github-actions
Copy link
Contributor

I've posted a comprehensive review of PR #300 focusing on:

Critical Issues:

  1. Stale severity/recommendations - computed before LLM can override failure_category
  2. Wrong LLM API signature - passing raw string instead of messages dict

Medium Priority:
3. Database connection leak in tasks.update() - missing try/finally

Minor Issues:

  • Limited package name regex
  • Incorrect CLI command references
  • Overly defensive type checking

The review also highlights the strengths (well-architected, follows v2 patterns, excellent test coverage) and confirms no security concerns. Overall assessment: needs the critical fixes before merging.

Review posted at: #300 (comment)

New%20session%20-%202026-01-22T21%3A28%3A37.919Z
opencode session  |  github run

1. Fix severity/recommendations stale after LLM override
   - Move assess_severity() and generate_recommendations() calls
     to after the LLM category update (lines 536-549)
   - Ensures report has consistent category/severity/recommendations

2. Fix LLMProvider.complete() call to use correct API
   - Changed from complete(prompt) to complete(messages=[...], purpose=...)
   - Added Purpose import from codeframe.adapters.llm

3. Fix database connection leak in tasks.update()
   - Added try/finally block to ensure conn.close() always runs
   - Consistent with create() function pattern
@github-actions
Copy link
Contributor

Code Review Summary

This PR implements a comprehensive self-diagnosis system for failed task analysis. The architecture is well-designed and follows the v2 headless, CLI-first approach. Overall Assessment: 🟢 Ready with minor improvements recommended

✅ Previously Reported Issues (All Fixed)

The critical issues mentioned in earlier reviews have been addressed in the current code:

  1. Stale severity/recommendations - FIXED: Severity and recommendations are now correctly computed AFTER the potential LLM category update (diagnostic_agent.py:537-550)
  2. Wrong LLM API signature - FIXED: The LLM call now uses the correct signature with dict and (diagnostic_agent.py:608-611)
  3. Database connection leak - FIXED: The function properly uses try/finally for connection cleanup (tasks.py:312-325)

🔴 New Issue: Schema Migration Risk

File:

The schema upgrade logic creates and tables with foreign keys to , but does not ensure the table exists first. If a workspace is initialized from an older state where the table doesn't exist, this will fail.

Recommendation: Add a check to ensure the table exists before creating dependent tables:

🟡 Minor Improvements

1. Package Name Extraction (diagnostic_agent.py:363, 368)

The regex doesn't match packages with hyphens (e.g., , ).

Suggestion: Expand to to handle hyphens and dots.

2. Inconsistent CLI Commands (diagnostic_agent.py:230, 238, 278, 296, 306)

Several recommendations use incorrect CLI commands:

  • Line 230: Uses but the actual command is
  • Lines 238, 278, 296: Use but retry command is

Suggestion: Update to match actual CLI commands in .

3. Missing Connection Cleanup (tasks.py:234-246)

The function doesn't use try/finally for connection cleanup, unlike and . If fails, the connection leaks.

Suggestion: Wrap in try/finally like other functions in the module.

✅ Strengths

  1. Well-architected: Clear separation between data models, analysis logic, and CLI
  2. Follows v2 patterns: Headless, CLI-first design consistent with Golden Path
  3. Comprehensive test coverage: 71 tests covering unit, integration, and CLI flows
  4. Proper database design: Indexed tables with CHECK constraints and foreign keys
  5. Rich UX: Well-formatted CLI output with color-coded severity and actionable recommendations

🔒 Security

✅ No security concerns:

  • SQL queries use proper parameterization (values)
  • However, the function uses f-string for column names (tasks.py:316) - while currently safe as only controlled values (title, description, priority) are used, consider validating column names against a whitelist.

📊 Final Recommendation

Status: 🟢 Ready with recommended fix

The schema migration risk should be addressed before merging to prevent potential failures on older workspaces. The minor improvements (package regex, CLI commands, connection cleanup) can be handled in a follow-up PR or at your discretion.

Great work on this feature! The self-diagnosis system significantly improves the developer experience when tasks fail.

@github-actions
Copy link
Contributor

Code Review Summary

This PR implements a comprehensive self-diagnosis system for failed task analysis. The architecture is well-designed and follows v2 headless, CLI-first approach. Overall Assessment: 🟢 Ready with minor improvements recommended

✅ Previously Reported Issues (All Fixed)

The critical issues mentioned in earlier reviews have been addressed in current code:

  1. Stale severity/recommendations - FIXED: Severity and recommendations are now correctly computed AFTER potential LLM category update (diagnostic_agent.py:537-550)
  2. Wrong LLM API signature - FIXED: The LLM call now uses correct signature with messages dict and purpose=Purpose.GENERATION (diagnostic_agent.py:608-611)
  3. Database connection leak - FIXED: The update() function properly uses try/finally for connection cleanup (tasks.py:312-325)

🔴 New Issue: Schema Migration Risk

File: workspace.py:333-380

The schema upgrade logic creates run_logs and diagnostic_reports tables with foreign keys to runs(id), but does not ensure the runs table exists first. If a workspace is initialized from an older state where the runs table doesn't exist, this will fail.

Recommendation: Add a check to ensure the runs table exists before creating dependent tables. Check for the runs table and create it with proper schema before creating run_logs and diagnostic_reports.

🟡 Minor Improvements

1. Package Name Extraction

The regex patterns at diagnostic_agent.py:363 and 368 don't match packages with hyphens (e.g., pydantic-core, google.auth).

Suggestion: Expand the regex to handle hyphens and dots in package names.

2. Inconsistent CLI Commands

Several recommendations use incorrect CLI commands:

  • Line 230: Uses cf tasks show but the actual command is codeframe tasks get
  • Lines 238, 278, 296: Use cf work start but the retry command is codeframe work retry

Suggestion: Update the commands to match the actual CLI commands defined in app.py.

3. Missing Connection Cleanup

The update_status() function in tasks.py:234-246 doesn't use try/finally for connection cleanup, unlike create() and update(). If cursor.execute() fails, the connection leaks.

Suggestion: Wrap the connection in a try/finally block like the other functions in the module.

✅ Strengths

  1. Well-architected: Clear separation between data models, analysis logic, and CLI
  2. Follows v2 patterns: Headless, CLI-first design consistent with Golden Path
  3. Comprehensive test coverage: 71 tests covering unit, integration, and CLI flows
  4. Proper database design: Indexed tables with CHECK constraints and foreign keys
  5. Rich UX: Well-formatted CLI output with color-coded severity and actionable recommendations

🔒 Security

✅ No significant security concerns:

  • SQL queries use proper parameterization for values
  • Note: The update() function uses f-string for column names (tasks.py:316) - while currently safe as only controlled values (title, description, priority) are used, consider validating column names against a whitelist for extra safety.

📊 Final Recommendation

Status: 🟢 Ready with recommended fix

The schema migration risk should be addressed before merging to prevent potential failures on older workspaces. The minor improvements (package regex, CLI commands, connection cleanup) can be handled in a follow-up PR or at your discretion.

Great work on this feature! The self-diagnosis system significantly improves the developer experience when tasks fail.

@github-actions
Copy link
Contributor

Review comment posted successfully at #300 (comment)

The review acknowledges that all previously reported critical issues (stale severity/recommendations, wrong LLM API signature, database connection leak) have been fixed in the current code. It identifies a new schema migration risk where the foreign key references to the runs table may fail if that table doesn't exist, and notes several minor improvements (package name regex, CLI command inconsistencies, missing connection cleanup). Overall assessment: Ready with recommended fix for the schema issue.

New%20session%20-%202026-01-22T21%3A35%3A46.808Z
opencode session  |  github run

1. Schema Migration: Ensure runs table exists before dependent tables
   - Added runs table creation in _ensure_schema_upgrades() before
     creating run_logs and diagnostic_reports tables
   - Prevents foreign key errors on older workspaces

2. Package Name Regex: Handle hyphens and dots
   - Changed (\w+) to ([\w\-\.]+) in _extract_package_name()
   - Now correctly extracts: pydantic-core, google.auth, typing-extensions

3. CLI Commands: Use correct command names
   - Changed "cf tasks show" to "cf work update-description"
   - Changed "cf work start" to "cf work retry" for retry recommendations
   - Updated test to expect new command format

4. Connection Cleanup: Add try/finally to update_status()
   - Ensures conn.close() always runs even on exception
   - Consistent with create() and update() patterns
@github-actions
Copy link
Contributor

Code Review Summary

This PR implements a comprehensive self-diagnosis system with excellent test coverage (71 tests). The architecture follows v2 patterns well. However, there are critical issues that must be addressed before merging.

🔴 Critical Issues

1. Schema Migration Failure Risk (workspace.py:333-380)

The _ensure_schema_upgrades function creates run_logs and diagnostic_reports tables with foreign keys to runs(id), but does not ensure the runs table exists first. If a workspace is initialized from an older state where runs table doesn't exist, the migration will fail with a foreign key constraint error.

Fix: Add a check to ensure runs table exists before creating dependent tables.

2. Fragile LLM Category Extraction (diagnostic_agent.py:620-640)

_extract_category_from_llm uses simple substring matching. This can match category strings in unexpected contexts (e.g., 'The issue is NOT a dependency_issue' would still match). It should parse only the 'Failure Category:' line.

Fix: Parse specifically the category line using regex pattern matching.

3. String Truncation Bug (diagnostic_agent.py:464)

When max_length < 3, the code summary[:max_length - 3] produces a negative index, returning characters from the end instead of truncating properly.

Fix: Add guard: summary = summary[:max(0, max_length - 3)] + '...'

🟡 Medium Priority Issues

4. Bypass Repository Layer (tasks.py:254-329)

The update() function bypasses TaskRepository and uses direct sqlite3 queries. This violates the domain-layer pattern used elsewhere and loses validation/safety checks.

5. Incorrect CLI Commands (diagnostic_agent.py:230, 238, 278, 296, 306)

Recommendations reference non-existent commands:

  • Line 230: cf tasks show → should be codeframe tasks get
  • Lines 238, 278, 296: cf work start → retry uses codeframe work retry

6. Weak Test Assertions (tests/cli/test_work_diagnose.py:76, 115, 124)

Tests use assert result.exit_code == 0 or result.exit_code is None which allows failures to pass. Should be explicit: assert result.exit_code == 0.

✅ Strengths

  • Well-architected with clear separation of concerns
  • Excellent test coverage (71 new tests)
  • Follows v2 CLI-first patterns
  • Proper database schema with indexes and constraints
  • Rich CLI output with color-coded severity

🔒 Security

✅ No security concerns - SQL queries use proper parameterization.

📊 Final Assessment

Status: 🟡 Needs Critical Fixes

Issues #1 (schema), #2 (LLM parsing), and #3 (string truncation) must be fixed before merging. Issue #4-6 can be handled in follow-ups.

The implementation is solid overall, but these bugs will cause runtime failures in production scenarios.

@github-actions
Copy link
Contributor

Review posted at: #300 (comment)

New%20session%20-%202026-01-22T21%3A55%3A32.356Z
opencode session  |  github run

verbose_mode = " [dim](verbose)[/dim]" if verbose else ""
console.print(f"\n[bold]Executing agent...{mode}{verbose_mode}[/bold]")

state = runtime.execute_agent(workspace, run, dry_run=dry_run, verbose=verbose)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_event_type_to_category is called in the on_agent_event callback but never defined, causing a NameError at runtime. Consider adding the function definition or using LogCategory.AGENT_ACTION directly.

🚀 Want me to fix this? Reply ex: "fix it for me".

1. Fragile LLM Category Extraction (diagnostic_agent.py:618-651)
   - Changed from simple substring matching to regex parsing
   - Now specifically parses "Failure Category:" line
   - Prevents false matches like "NOT a dependency_issue"

2. String Truncation Bug (diagnostic_agent.py:463-464)
   - Added guard: max(0, max_length - 3) for negative index
   - Prevents incorrect behavior when max_length < 3

3. Weak Test Assertions (test_work_diagnose.py)
   - Changed "exit_code == 0 or exit_code is None" to explicit "exit_code == 0"
   - Added descriptive error messages with stdout output
   - Fixed 6 weak assertions across the test file
@github-actions
Copy link
Contributor

Code Review Summary

This PR implements a comprehensive self-diagnosis system for failed task analysis with pattern-based and LLM-powered failure detection. All previously reported critical issues have been addressed. Overall Assessment: 🟢 Ready with one minor improvement

✅ Previously Reported Issues (All Fixed)

All critical and medium priority issues from earlier reviews have been addressed in the current code:

  1. Stale severity/recommendations - FIXED: Severity and recommendations are now computed AFTER the LLM category update (diagnostic_agent.py:539-552)
  2. Wrong LLM API signature - FIXED: Uses correct signature with dict and (diagnostic_agent.py:611-613)
  3. Database connection leak - FIXED: The function uses try/finally for proper cleanup (tasks.py:313-326)
  4. Schema migration risk - FIXED: Checks for table existence before creating dependent tables (workspace.py:333-351)
  5. String truncation bug - FIXED: Uses to guard against negative indices (diagnostic_agent.py:464)
  6. _event_type_to_category undefined - FIXED: Function exists and is properly defined (runtime.py:790-816)
  7. Weak test assertions - FIXED: All tests now use proper assertions with error messages (test_work_diagnose.py:74, 121)

🟡 Minor Improvement

CLI Command Prefix Mismatch

File: (lines 230, 238, 248, 278, 296, 306, 334, 344)

The diagnostic recommendations use prefix instead of . According to the CLI entry point (app.py:6), the command structure is , not .

Current examples:

  • Line 230:
  • Line 238:
  • Line 248:

Suggestion: Replace all instances of with :

ℹ️ Note on Repository Layer

The function in tasks.py uses direct SQL queries instead of routing through . However, this appears to be an intentional design choice:

  • uses aiosqlite (async) with integer task IDs
  • The core layer uses synchronous sqlite3 with UUID string task IDs
  • The sync layer provides simpler data access for CLI commands

This represents a transitional pattern in the codebase during the v2 refactor and is acceptable for this PR.

✅ Strengths

  1. Excellent architecture: Clear separation between data models (diagnostics.py), analysis logic (diagnostic_agent.py), and CLI (app.py)
  2. Follows v2 patterns: Headless, CLI-first design consistent with Golden Path
  3. Comprehensive test coverage: 71 tests covering unit, integration, and CLI flows
  4. Proper database design: Indexed tables with CHECK constraints and foreign keys
  5. Rich UX: Well-formatted CLI output with color-coded severity and actionable recommendations
  6. All critical issues fixed: Previous review feedback has been thoroughly addressed

🔒 Security

✅ No security concerns:

  • SQL queries use proper parameterization for values
  • Column names in dynamic updates are from controlled function parameters (title, description, priority)

📊 Final Recommendation

Status: 🟢 Ready with minor improvement

The only remaining issue is a CLI command prefix mismatch that affects user-facing documentation in recommendations. This is cosmetic and doesn't affect functionality, but should be fixed for consistency.

Excellent work on addressing all the critical feedback from previous reviews. The self-diagnosis system significantly improves the developer experience when tasks fail.

@github-actions
Copy link
Contributor

I've posted a comprehensive code review for PR #300. Here's a summary:

Overall Assessment: 🟢 Ready with one minor improvement

All critical issues from previous reviews have been fixed:

  • ✅ Stale severity/recommendations - now computed after LLM category update
  • ✅ Wrong LLM API signature - uses correct messages/purpose format
  • ✅ Database connection leak - proper try/finally cleanup
  • ✅ Schema migration risk - checks for runs table first
  • ✅ String truncation bug - uses max(0, max_length - 3)
  • ✅ _event_type_to_category undefined - function exists
  • ✅ Weak test assertions - proper assertions with error messages

Remaining Issue (Minor):

  • CLI recommendations use cf prefix instead of codeframe (lines 230, 238, 248, 278, 296, 306, 334, 344 in diagnostic_agent.py)

The implementation is solid with excellent test coverage (71 tests) and follows v2 patterns well. Only a cosmetic CLI prefix issue remains.

Review posted at: #300 (comment)

New%20session%20-%202026-01-22T22%3A03%3A15.741Z
opencode session  |  github run

@frankbria frankbria merged commit f203d35 into main Jan 22, 2026
11 checks passed
@frankbria frankbria deleted the feature/self-diagnosis-system branch January 22, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[V2] Add self-diagnosis when a run fails

2 participants