Skip to content

feat: Error Classification & Recovery + OODA-Optimized Memory Prefetch Skills #1330

@rjmurillo-bot

Description

@rjmurillo-bot

Overview

Two skill proposals to improve agent resilience and responsiveness, designed to work with both Copilot CLI and Claude Code.


Skill 1: Error Classification & Recovery

Problem

Agents fail in predictable ways but lack systematic recovery strategies. Failures cascade instead of self-correcting.

Research Basis

  • 39% loop recovery rate with classification + hint injection
  • 85% precision in failure type detection
  • Antifragility: systems that get stronger from failures

Error Taxonomy

Type Detection Signal Recovery Strategy
Tool Failure Non-zero exit, API error Retry with backoff, fallback tool
Reasoning Drift Output diverges from intent Re-anchor with original prompt
Infinite Loop Repeated tool calls (3+) Break loop, summarize progress, escalate
Scope Creep Task expands beyond original Checkpoint, confirm with user
Context Overflow Token limit warnings Compress context, archive old turns

Implementation

1. Error Observer Hook

Wrap tool execution to classify failures:

# .agents/hooks/error_observer.py
def classify_error(tool_name: str, exit_code: int, stderr: str) -> ErrorType:
    if "rate limit" in stderr.lower():
        return ErrorType.TRANSIENT
    if exit_code == 2:  # Config error (ADR-035)
        return ErrorType.CONFIG
    if is_repeated_call(tool_name, count=3):
        return ErrorType.LOOP
    return ErrorType.LOGIC

2. Recovery Hint Injection

Store failure→recovery mappings:

# .agents/recovery-hints.yaml
tool_failures:
  gh:
    - pattern: "GraphQL: Could not resolve"
      hint: "Issue/PR number may not exist. Verify with `gh issue list`."
    - pattern: "HTTP 403"
      hint: "Rate limited. Wait 60s or use `gh api --cache`."
      
reasoning_drift:
  - signal: "Let me also add..."
    hint: "STOP. Check if this is in scope. Original task was: {original_task}"

3. Integration Points

Copilot CLI:

  • Hook into copilot-cli wrapper script
  • Log errors to .agents/sessions/errors.jsonl
  • Inject hints via system prompt modification

Claude Code:

  • Use CLAUDE.md pre-tool hook pattern
  • Error classification runs post-tool-execution
  • Recovery hints added to next turn context

4. Pattern Learning

After recovery, log what worked:

{"timestamp": "2026-02-26T22:00:00Z", "error_type": "LOOP", "tool": "gh", "recovery": "break_and_summarize", "success": true}

Graduate patterns with 3+ successful recoveries to MEMORY.md.


Skill 2: OODA-Optimized Memory Prefetch

Problem

Agents waste the "Orient" phase of OODA gathering context that's predictable. Research shows 88-99% cycle time reduction is possible.

Research Basis

  • OODA loop optimization: fastest cycle wins
  • Edge deployment principle: precompute, don't fetch on demand
  • Predictable context: git status, open PRs, recent sessions are almost always needed

Prefetch Targets

Context Trigger Cache Duration
Git status + branch Session start 5 min
Open PRs (assigned) Session start 15 min
Recent commits (5) Session start 15 min
CI status (last run) Session start 10 min
Open issues (assigned) Session start 30 min
Last session summary Session start Until next session
HEARTBEAT.md tasks Heartbeat poll 1 min

Implementation

1. Prefetch Script

# .agents/hooks/session_prefetch.py
import subprocess
import json
from pathlib import Path

CACHE_DIR = Path(".agents/cache")

def prefetch_context():
    """Run at session start, cache results."""
    context = {
        "git_branch": run("git branch --show-current"),
        "git_status": run("git status --short"),
        "open_prs": run("gh pr list --author @me --json number,title,url --limit 5"),
        "recent_commits": run("git log --oneline -5"),
        "ci_status": run("gh run list --limit 1 --json status,conclusion,name"),
    }
    
    cache_path = CACHE_DIR / "session_context.json"
    cache_path.write_text(json.dumps(context, indent=2))
    return context

2. Context Injection

Copilot CLI:

  • Prefetch runs in .copilot/hooks/pre-session.sh
  • Results injected via environment variable or temp file
  • CLI reads cache before first prompt

Claude Code:

  • Prefetch runs in CLAUDE.md session init block
  • Results added to passive context (high in prompt)
  • Cache invalidation on git operations

3. Smart Invalidation

# Invalidate on relevant operations
INVALIDATION_TRIGGERS = {
    "git_*": ["git commit", "git push", "git pull", "git checkout"],
    "open_prs": ["gh pr create", "gh pr merge", "gh pr close"],
    "ci_status": ["gh workflow run", "git push"],
}

4. Latency Targets

Metric Before After Target
Time to first tool call 3-5s <1s 80% reduction
Context gathering turns 2-3 0 Eliminate
Redundant API calls 5+/session 1 (prefetch) 80% reduction

Shared Infrastructure

Both skills share:

  1. .agents/cache/ - Ephemeral context cache
  2. .agents/hooks/ - Pre/post execution hooks
  3. .agents/sessions/errors.jsonl - Error pattern log
  4. MEMORY.md graduation - Patterns with 3+ occurrences

Success Criteria

Error Classification & Recovery

  • Error taxonomy implemented with 5 types
  • Recovery hints for top 10 failure patterns
  • Loop detection breaks 80% of infinite loops
  • Pattern graduation to MEMORY.md working

OODA-Optimized Prefetch

  • Session start prefetches 5 context types
  • Cache invalidation on relevant git/gh ops
  • Measured 50%+ reduction in "Orient" phase time
  • Works in both Copilot CLI and Claude Code

Open Questions

  1. Should error classification run synchronously (blocking) or async (background)?
  2. How to handle prefetch in offline/air-gapped environments?
  3. Should recovery hints be agent-specific or shared across all agents?

References

  • MEMORY.md: OODA Loop Optimization (2026-02-23)
  • MEMORY.md: Multi-Agent Self-Correction (2026-02-25)
  • MEMORY.md: Behavioral Drift Detection (2026-02-25)
  • ADR-035: Exit codes (0=success, 1=logic, 2=config, 3=external, 4=auth)

Metadata

Metadata

Assignees

Labels

agent-memoryContext persistence agentarea-infrastructureBuild, CI/CD, configurationarea-promptsAgent prompts and templatesarea-skillsSkills documentation and patternsarea-workflowsGitHub Actions workflowsbugSomething isn't workingenhancementNew feature or requestpriority:P2Normal: Standard enhancement or bug fix, moderate impactquestionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions