Reliability & recovery patterns

Production reliability mechanisms that keep the autonomous pipeline healthy. Covers failure classification, model escalation, circuit breakers, health sweeps, escalation routing, PR remediation, trajectory storage, and per-feature reflections.

Failure classification

When an agent execution fails, RecoveryService.analyzeFailure() categorizes the error and determines a recovery strategy.

Failure categories

Category	Examples	Default strategy
`transient`	Network timeout, DNS failure, socket hang up	Retry with exponential backoff
`rate_limit`	API throttle (429), quota warning	Pause and wait (5s base delay)
`quota`	Monthly usage cap, spending limit	Escalate to user
`validation`	Invalid input, schema mismatch	Escalate to user
`tool_error`	Bash command failed, file not found	Alternative approach
`test_failure`	Unit test failure, build error	Retry with error context
`merge_conflict`	Git conflict on rebase	Escalate to user
`dependency`	Missing npm package, unresolved import	Retry with context
`authentication`	API key expired, token revoked	Escalate to user
`unknown`	Unclassified error	Escalate to user

Recovery strategies

Six strategies, applied based on category:

retry — Simple retry with delay (transient errors)
retry_with_context — Retry with previous error output injected into the agent prompt (test failures, dependency issues)
alternative_approach — Try a different tool or command (tool errors)
rollback_and_retry — Clear changes, start fresh (corrupted state)
pause_and_wait — Hold for API recovery (rate limits)
escalate_to_user — Emit recovery_escalated event, stop retrying (terminal)

Exponential backoff

Transient retries use exponential backoff: base × 2^retryCount, capped at maxDelay.

Agent-level backoff (RecoveryService):

Parameter	Value
Base delay	1,000 ms
Max delay	30,000 ms
Max transient retries	3
Max test failure retries	2
Rate limit base delay	5,000 ms

Git workflow backoff (git-workflow-service):

Parameter	Value
Base delay	2,000 ms
Max retries	3
Backoff	2s → 4s → 8s
Applies to	`git push`, `gh pr create` operations

The retryWithExponentialBackoff<T>() helper in git-workflow-service.ts wraps push and PR creation calls. This prevents transient GitHub/network errors from causing silent git workflow failures.

Source: apps/server/src/services/git-workflow-service.ts

Lesson generation

After 3+ failures of the same category for a project, RecoveryService.checkAndGenerateLessons() writes a guidance context file to .automaker/context/failure-lessons-{category}.md. Future agents automatically receive this guidance via the context loading system.

Source: apps/server/src/services/recovery-service.ts

Model auto-escalation

The model tier isn't fixed for a feature's lifetime. The escalation chain:

Haiku → Sonnet → Opus → ESCALATE (human)

When escalation triggers

Feature fails 2+ times at the current tier
Test failures persist after retry with context
Agent hits turn limit without completing

How it works

The Lead Engineer state machine tracks failureCount per feature. On the 2nd+ failure:

Feature enters ESCALATE state, FailureClassifierService categorizes the error
INTAKE phase on retry selects the next model tier (Haiku → Sonnet, Sonnet → Opus)
Feature retries with the higher-capability model
If Opus also fails → stays in ESCALATE, human intervention required
FeatureScheduler circuit breaker pauses auto-mode after 3 consecutive failures

This captures the human pattern: "This is harder than I thought, let me think more carefully."

Circuit breaker

The auto-mode orchestration loop includes a circuit breaker that prevents cascading failures.

Behavior

Parameter	Value
Failure threshold	2 failures in 60 seconds
Action	Pause auto-mode
Resume after	5 minutes (automatic)

When 2 features fail within a 60-second window, auto-mode pauses. This prevents burning API credits on a systemic issue (e.g., API outage, broken build on main).

After 5 minutes, auto-mode resumes automatically. If the issue persists, the circuit breaker trips again.

Integration

The circuit breaker is evaluated in the auto-mode tick loop, not in the Lead Engineer. The orchestration loop is the scheduler; the state machine is the executor.

Health sweep

Every ~100 seconds (50 iterations at a 2-second interval), the auto-mode loop runs FeatureHealthService.audit() with auto-fix enabled. This catches structural drift on the board.

Issue types

Issue type	Detection	Auto-fix
`orphaned_epic_ref`	Feature references non-existent or non-epic parent	Clear `epicId` reference
`dangling_dependency`	Feature depends on deleted features	Remove non-existent dep IDs
`epic_children_done`	All child features done, but epic still in-progress	Set epic status to `done`
`stale_running`	Feature marked `in_progress` with no active agent	Reset to `backlog`
`stale_gate`	Feature awaiting pipeline gate for >1 hour	Move to `blocked`
`merged_not_done`	Branch merged to main but feature not marked done	Set status to `done`

How it works

const report = await featureHealthService.audit(projectPath, true); // autoFix=true

// report.issues — all detected problems
// report.fixed  — problems that were auto-corrected

Each detected issue emits an escalation:signal-received event with a deduplication key, so the escalation router can alert without flooding.

Safety

Uses execFileAsync (not shell) for git operations — prevents injection
Detects both main and master as default branches
Caches epic branch --merged results to reduce git calls

Source: apps/server/src/services/feature-health-service.ts

Escalation router

When recovery fails or health sweep finds unfixable issues, signals route to notification channels via EscalationRouter.

Signal flow

Recovery failure / Health issue / Lead Engineer escalation
    ↓
EscalationRouter.routeSignal(signal)
    ├── Deduplication check (30-min window)
    │   └── Duplicate? → emit 'escalation:signal-deduplicated', skip
    ├── Severity filter
    │   └── Low severity? → log only, no routing
    ├── Per-channel rate limit check
    │   └── Rate limited? → add to rateLimited list, skip channel
    └── Send to matching channels
        └── emit 'escalation:signal-sent' per channel

Signal severity

Severity	Behavior
`low`	Logged only, not routed to channels
`medium`	Routed to matching channels
`high`	Routed to all matching channels
`critical`	Routed to all channels, bypasses rate limits

Deduplication

Signals carry a deduplicationKey (e.g., "escalation:feature-123:test-failure"). If the same key was seen within the last 30 minutes, the signal is deduplicated — logged but not re-routed.

Rate limiting

Each channel can define a rate limit:

interface EscalationChannel {
  name: string;
  canHandle(signal: EscalationSignal): boolean;
  send(signal: EscalationSignal): Promise<void>;
  rateLimit?: { maxSignals: number; windowMs: number };
}

Example: Discord might limit to 5 signals per hour. The router tracks per-channel counters and skips channels that exceed their limit.

Acknowledgment

Signals can be acknowledged via acknowledgeSignal(deduplicationKey, acknowledgedBy, notes?, clearDedup?). This marks the signal as handled in the escalation log and optionally clears the deduplication window.

Audit log

The router maintains a log of up to 1,000 entries (most recent first). Each entry records:

The signal and its severity
Which channels received it
Whether it was deduplicated or rate-limited
Acknowledgment status

Source: apps/server/src/services/escalation-router.ts

PR remediation loop

When a PR fails CI or receives review feedback, the system enters a remediation loop.

Flow

PR created → CI runs + CodeRabbit reviews
    ├── CI passes + approved → MERGE
    ├── CI fails → extract failure context → back to EXECUTE
    ├── changes_requested → collect feedback → send to agent for fixes
    └── Max retries exceeded → ESCALATE

Limits

Parameter	Value
Max CI retry cycles	2 (back to EXECUTE with failure context)
Max feedback cycles	2 (agent addresses reviewer comments)
Max total remediation	4 cycles before escalation
PR poll interval	60 seconds

How feedback flows

PRFeedbackService polls GitHub every 60 seconds for new review activity
On changes_requested: feedback is collected and sent to the agent
The agent addresses feedback in the worktree and pushes
CI re-runs, CodeRabbit re-reviews
On approved + CI passing → MERGE

The PR remediation loop handles CI failures automatically by analyzing feedback and pushing fixes.

Trajectory store

TrajectoryStoreService persists verified execution trajectories for learning.

Storage

.automaker/trajectory/{featureId}/attempt-{N}.json

Each trajectory records:

Feature metadata (ID, title, complexity)
Execution outcome (success/failure)
Key decisions the agent made
Recovery strategies that worked
Failure patterns encountered
Duration and token usage

Non-blocking writes

Trajectory writes are fire-and-forget. They never block the agent execution loop. If the write fails (disk full, permissions), the feature still completes normally.

Sibling reflections

When a feature enters EXECUTE, the Lead Engineer loads trajectories from recently completed sibling features:

const siblings = features
  .filter((f) => f.status === 'verified' && f.lastExecutionTime)
  .sort((a, b) => (b.lastExecutionTime || 0) - (a.lastExecutionTime || 0))
  .slice(0, 3); // max 3 reflections

Sibling matching: Same epicId (if in an epic) or same projectSlug (if standalone).

These reflections are injected into the agent's context as "Lessons from Similar Features" (max ~500 tokens), giving each agent the benefit of what prior agents learned.

Source: apps/server/src/services/trajectory-store-service.ts

Per-feature reflection loop

After each feature reaches DONE, a lightweight reflection is generated.

How it works

DeployProcessor.generateReflection() fires non-blocking after marking a feature done
Reads the tail of agent-output.md (last 2,000 chars) plus execution metadata
Calls simpleQuery() with Haiku (maxTurns: 1, no tools) to produce a structured reflection under 200 words
Writes result to .automaker/features/{id}/reflection.md
Emits feature:reflection:complete event

Feed-forward

Reflections from completed siblings are loaded during EXECUTE (see Trajectory Store above). This creates an in-project learning loop — each feature benefits from the last.

Cost

~$0.001 per reflection (Haiku, single turn, no tools). Fire-and-forget — failure does not block the state machine.

Observability

Reflection LLM calls are traced in Langfuse with:

Tags: feature:{id}, role:reflection
Metadata: featureId, featureName, agentRole: 'reflection'

FailureClassifierService

Pattern-matches escalation reason strings to structured failure categories and recovery strategies.

Purpose

When the Lead Engineer's ESCALATE state receives an escalation reason string (e.g., "Rate limit exceeded", "Tests failed after 3 retries"), the classifier maps it to a FailureCategory and suggests a RecoveryStrategy.

Integration

Called by EscalateProcessor.process() in the Lead Engineer state machine. The classified category determines:

Whether to retry or escalate
Which model tier to use on retry
What context to inject into the agent prompt

Source: apps/server/src/services/failure-classifier-service.ts

Crash recovery scan

On server startup, a non-blocking worktree scan detects stranded work from crashed agent sessions.

How it works

After resumeInterruptedFeatures() completes, scanWorktreesForCrashRecovery() runs via setImmediate():

Lists all worktrees via git worktree list --porcelain
Cross-references each worktree with its feature status
For features in verified/done with uncommitted or unpushed work:
- Commits stranded changes via ensureCleanWorktree()
- Pushes to remote
- Triggers runPostCompletionWorkflow() (PR creation)
For features in other states with stranded work: logs a warning
Emits maintenance:crash_recovery_scan_completed with summary

When it triggers

Server startup only (not a cron task)
Non-blocking — does not delay server initialization or request handling
Fire-and-forget — scan failures are logged but don't crash the server

Source: apps/server/src/services/maintenance-tasks.ts

Git workflow error surfacing

When git operations (commit, push, PR creation) fail after agent execution, the error is stored on the feature for UI visibility.

The `gitWorkflowError` field

interface Feature {
  gitWorkflowError?: {
    message: string; // Error description
    timestamp: string; // ISO 8601 when the error occurred
  };
}

All 4 git workflow catch blocks in auto-mode-service.ts persist errors to feature.json instead of silently logging them. Feature status remains unchanged (e.g., stays verified) — the error field provides a separate visibility channel.

Source: libs/types/src/feature.ts, apps/server/src/services/auto-mode-service.ts

Event-driven observability

All reliability services emit events for real-time UI updates and audit logging:

Service	Event prefix	Key events
RecoveryService	`recovery_*`	`analysis`, `started`, `completed`, `recorded`, `escalated`, `lesson_generated`
EscalationRouter	`escalation:*`	`signal-received`, `deduplicated`, `sent`, `failed`, `routed`, `acknowledged`
FeatureHealthService	(via auto-mode)	Issues surface through escalation events
AutoModeService	`feature:*`	`status-changed`, `completed`, `error`
Lead Engineer	`feature:*`	`reflection:complete`, `pr-merged`, `state-changed`
Maintenance	`maintenance:*`	`crash_recovery_scan_completed`

Recovery architecture diagram

Feature Execution Fails
    ↓
RecoveryService.analyzeFailure()
    ├── categorizeFailure() → FailureCategory
    ├── determineStrategy() → RecoveryStrategy
    ├── recordRecoveryAttempt() → JSONL log + emit events
    └── checkAndGenerateLessons() (after 3+ failures)
        └── Write failure-lessons-{category}.md to context/
    ↓
Recovery result: { success, shouldRetry, actionTaken }
    ├── If retryable → AutoModeService.retry() with injected context
    └── If escalate → EscalationRouter.routeSignal()
        ├── Dedup check (30-min window)
        ├── Rate limit check (per-channel)
        └── Send to registered channels
            └── EscalationLogEntry recorded for audit trail
    ↓
(Parallel) FeatureHealthService.audit()
    └── Check for drift: orphaned refs, stale running, merged branches
    └── Auto-fix if enabled → Update feature status
    ↓
Lead Engineer State Machine
    ├── [EXECUTE] Load sibling reflections from trajectory store
    ├── [REMEDIATION] Inject failure context + review feedback
    └── [ESCALATE] FailureClassifierService maps reason → category

Uh oh!

FilesExpand file tree

reliability.md

Latest commit

History

reliability.md

File metadata and controls

Reliability & recovery patterns

Failure classification

Failure categories

Recovery strategies

Exponential backoff

Lesson generation

Model auto-escalation

When escalation triggers

How it works

Circuit breaker

Behavior

Integration

Health sweep

Issue types

How it works

Safety

Escalation router

Signal flow

Signal severity

Deduplication

Rate limiting

Acknowledgment

Audit log

PR remediation loop

Flow

Limits

How feedback flows

Trajectory store

Storage

Non-blocking writes

Sibling reflections

Per-feature reflection loop

How it works

Feed-forward

Cost

Observability

FailureClassifierService

Purpose

Integration

Crash recovery scan

How it works

When it triggers

Git workflow error surfacing

The gitWorkflowError field

Event-driven observability

Recovery architecture diagram

Related documentation

The `gitWorkflowError` field