Production reliability mechanisms that keep the autonomous pipeline healthy. Covers failure classification, model escalation, circuit breakers, health sweeps, escalation routing, PR remediation, trajectory storage, and per-feature reflections.
When an agent execution fails, RecoveryService.analyzeFailure() categorizes the error and determines a recovery strategy.
| Category | Examples | Default strategy |
|---|---|---|
transient |
Network timeout, DNS failure, socket hang up | Retry with exponential backoff |
rate_limit |
API throttle (429), quota warning | Pause and wait (5s base delay) |
quota |
Monthly usage cap, spending limit | Escalate to user |
validation |
Invalid input, schema mismatch | Escalate to user |
tool_error |
Bash command failed, file not found | Alternative approach |
test_failure |
Unit test failure, build error | Retry with error context |
merge_conflict |
Git conflict on rebase | Escalate to user |
dependency |
Missing npm package, unresolved import | Retry with context |
authentication |
API key expired, token revoked | Escalate to user |
unknown |
Unclassified error | Escalate to user |
Six strategies, applied based on category:
- retry — Simple retry with delay (transient errors)
- retry_with_context — Retry with previous error output injected into the agent prompt (test failures, dependency issues)
- alternative_approach — Try a different tool or command (tool errors)
- rollback_and_retry — Clear changes, start fresh (corrupted state)
- pause_and_wait — Hold for API recovery (rate limits)
- escalate_to_user — Emit
recovery_escalatedevent, stop retrying (terminal)
Transient retries use exponential backoff: base × 2^retryCount, capped at maxDelay.
Agent-level backoff (RecoveryService):
| Parameter | Value |
|---|---|
| Base delay | 1,000 ms |
| Max delay | 30,000 ms |
| Max transient retries | 3 |
| Max test failure retries | 2 |
| Rate limit base delay | 5,000 ms |
Git workflow backoff (git-workflow-service):
| Parameter | Value |
|---|---|
| Base delay | 2,000 ms |
| Max retries | 3 |
| Backoff | 2s → 4s → 8s |
| Applies to | git push, gh pr create operations |
The retryWithExponentialBackoff<T>() helper in git-workflow-service.ts wraps push and PR creation calls. This prevents transient GitHub/network errors from causing silent git workflow failures.
Source: apps/server/src/services/git-workflow-service.ts
After 3+ failures of the same category for a project, RecoveryService.checkAndGenerateLessons() writes a guidance context file to .automaker/context/failure-lessons-{category}.md. Future agents automatically receive this guidance via the context loading system.
Source: apps/server/src/services/recovery-service.ts
The model tier isn't fixed for a feature's lifetime. The escalation chain:
Haiku → Sonnet → Opus → ESCALATE (human)
- Feature fails 2+ times at the current tier
- Test failures persist after retry with context
- Agent hits turn limit without completing
The Lead Engineer state machine tracks failureCount per feature. On the 2nd+ failure:
- Feature enters ESCALATE state,
FailureClassifierServicecategorizes the error - INTAKE phase on retry selects the next model tier (Haiku → Sonnet, Sonnet → Opus)
- Feature retries with the higher-capability model
- If Opus also fails → stays in ESCALATE, human intervention required
FeatureSchedulercircuit breaker pauses auto-mode after 3 consecutive failures
This captures the human pattern: "This is harder than I thought, let me think more carefully."
The auto-mode orchestration loop includes a circuit breaker that prevents cascading failures.
| Parameter | Value |
|---|---|
| Failure threshold | 2 failures in 60 seconds |
| Action | Pause auto-mode |
| Resume after | 5 minutes (automatic) |
When 2 features fail within a 60-second window, auto-mode pauses. This prevents burning API credits on a systemic issue (e.g., API outage, broken build on main).
After 5 minutes, auto-mode resumes automatically. If the issue persists, the circuit breaker trips again.
The circuit breaker is evaluated in the auto-mode tick loop, not in the Lead Engineer. The orchestration loop is the scheduler; the state machine is the executor.
Every ~100 seconds (50 iterations at a 2-second interval), the auto-mode loop runs FeatureHealthService.audit() with auto-fix enabled. This catches structural drift on the board.
| Issue type | Detection | Auto-fix |
|---|---|---|
orphaned_epic_ref |
Feature references non-existent or non-epic parent | Clear epicId reference |
dangling_dependency |
Feature depends on deleted features | Remove non-existent dep IDs |
epic_children_done |
All child features done, but epic still in-progress | Set epic status to done |
stale_running |
Feature marked in_progress with no active agent |
Reset to backlog |
stale_gate |
Feature awaiting pipeline gate for >1 hour | Move to blocked |
merged_not_done |
Branch merged to main but feature not marked done | Set status to done |
const report = await featureHealthService.audit(projectPath, true); // autoFix=true
// report.issues — all detected problems
// report.fixed — problems that were auto-correctedEach detected issue emits an escalation:signal-received event with a deduplication key, so the escalation router can alert without flooding.
- Uses
execFileAsync(not shell) for git operations — prevents injection - Detects both
mainandmasteras default branches - Caches epic branch
--mergedresults to reduce git calls
Source: apps/server/src/services/feature-health-service.ts
When recovery fails or health sweep finds unfixable issues, signals route to notification channels via EscalationRouter.
Recovery failure / Health issue / Lead Engineer escalation
↓
EscalationRouter.routeSignal(signal)
├── Deduplication check (30-min window)
│ └── Duplicate? → emit 'escalation:signal-deduplicated', skip
├── Severity filter
│ └── Low severity? → log only, no routing
├── Per-channel rate limit check
│ └── Rate limited? → add to rateLimited list, skip channel
└── Send to matching channels
└── emit 'escalation:signal-sent' per channel
| Severity | Behavior |
|---|---|
low |
Logged only, not routed to channels |
medium |
Routed to matching channels |
high |
Routed to all matching channels |
critical |
Routed to all channels, bypasses rate limits |
Signals carry a deduplicationKey (e.g., "escalation:feature-123:test-failure"). If the same key was seen within the last 30 minutes, the signal is deduplicated — logged but not re-routed.
Each channel can define a rate limit:
interface EscalationChannel {
name: string;
canHandle(signal: EscalationSignal): boolean;
send(signal: EscalationSignal): Promise<void>;
rateLimit?: { maxSignals: number; windowMs: number };
}Example: Discord might limit to 5 signals per hour. The router tracks per-channel counters and skips channels that exceed their limit.
Signals can be acknowledged via acknowledgeSignal(deduplicationKey, acknowledgedBy, notes?, clearDedup?). This marks the signal as handled in the escalation log and optionally clears the deduplication window.
The router maintains a log of up to 1,000 entries (most recent first). Each entry records:
- The signal and its severity
- Which channels received it
- Whether it was deduplicated or rate-limited
- Acknowledgment status
Source: apps/server/src/services/escalation-router.ts
When a PR fails CI or receives review feedback, the system enters a remediation loop.
PR created → CI runs + CodeRabbit reviews
├── CI passes + approved → MERGE
├── CI fails → extract failure context → back to EXECUTE
├── changes_requested → collect feedback → send to agent for fixes
└── Max retries exceeded → ESCALATE
| Parameter | Value |
|---|---|
| Max CI retry cycles | 2 (back to EXECUTE with failure context) |
| Max feedback cycles | 2 (agent addresses reviewer comments) |
| Max total remediation | 4 cycles before escalation |
| PR poll interval | 60 seconds |
PRFeedbackServicepolls GitHub every 60 seconds for new review activity- On
changes_requested: feedback is collected and sent to the agent - The agent addresses feedback in the worktree and pushes
- CI re-runs, CodeRabbit re-reviews
- On
approved+ CI passing → MERGE
The PR remediation loop handles CI failures automatically by analyzing feedback and pushing fixes.
TrajectoryStoreService persists verified execution trajectories for learning.
.automaker/trajectory/{featureId}/attempt-{N}.json
Each trajectory records:
- Feature metadata (ID, title, complexity)
- Execution outcome (success/failure)
- Key decisions the agent made
- Recovery strategies that worked
- Failure patterns encountered
- Duration and token usage
Trajectory writes are fire-and-forget. They never block the agent execution loop. If the write fails (disk full, permissions), the feature still completes normally.
When a feature enters EXECUTE, the Lead Engineer loads trajectories from recently completed sibling features:
const siblings = features
.filter((f) => f.status === 'verified' && f.lastExecutionTime)
.sort((a, b) => (b.lastExecutionTime || 0) - (a.lastExecutionTime || 0))
.slice(0, 3); // max 3 reflectionsSibling matching: Same epicId (if in an epic) or same projectSlug (if standalone).
These reflections are injected into the agent's context as "Lessons from Similar Features" (max ~500 tokens), giving each agent the benefit of what prior agents learned.
Source: apps/server/src/services/trajectory-store-service.ts
After each feature reaches DONE, a lightweight reflection is generated.
DeployProcessor.generateReflection()fires non-blocking after marking a feature done- Reads the tail of
agent-output.md(last 2,000 chars) plus execution metadata - Calls
simpleQuery()with Haiku (maxTurns: 1, no tools) to produce a structured reflection under 200 words - Writes result to
.automaker/features/{id}/reflection.md - Emits
feature:reflection:completeevent
Reflections from completed siblings are loaded during EXECUTE (see Trajectory Store above). This creates an in-project learning loop — each feature benefits from the last.
~$0.001 per reflection (Haiku, single turn, no tools). Fire-and-forget — failure does not block the state machine.
Reflection LLM calls are traced in Langfuse with:
- Tags:
feature:{id},role:reflection - Metadata:
featureId,featureName,agentRole: 'reflection'
Pattern-matches escalation reason strings to structured failure categories and recovery strategies.
When the Lead Engineer's ESCALATE state receives an escalation reason string (e.g., "Rate limit exceeded", "Tests failed after 3 retries"), the classifier maps it to a FailureCategory and suggests a RecoveryStrategy.
Called by EscalateProcessor.process() in the Lead Engineer state machine. The classified category determines:
- Whether to retry or escalate
- Which model tier to use on retry
- What context to inject into the agent prompt
Source: apps/server/src/services/failure-classifier-service.ts
On server startup, a non-blocking worktree scan detects stranded work from crashed agent sessions.
After resumeInterruptedFeatures() completes, scanWorktreesForCrashRecovery() runs via setImmediate():
- Lists all worktrees via
git worktree list --porcelain - Cross-references each worktree with its feature status
- For features in
verified/donewith uncommitted or unpushed work:- Commits stranded changes via
ensureCleanWorktree() - Pushes to remote
- Triggers
runPostCompletionWorkflow()(PR creation)
- Commits stranded changes via
- For features in other states with stranded work: logs a warning
- Emits
maintenance:crash_recovery_scan_completedwith summary
- Server startup only (not a cron task)
- Non-blocking — does not delay server initialization or request handling
- Fire-and-forget — scan failures are logged but don't crash the server
Source: apps/server/src/services/maintenance-tasks.ts
When git operations (commit, push, PR creation) fail after agent execution, the error is stored on the feature for UI visibility.
interface Feature {
gitWorkflowError?: {
message: string; // Error description
timestamp: string; // ISO 8601 when the error occurred
};
}All 4 git workflow catch blocks in auto-mode-service.ts persist errors to feature.json instead of silently logging them. Feature status remains unchanged (e.g., stays verified) — the error field provides a separate visibility channel.
Source: libs/types/src/feature.ts, apps/server/src/services/auto-mode-service.ts
All reliability services emit events for real-time UI updates and audit logging:
| Service | Event prefix | Key events |
|---|---|---|
| RecoveryService | recovery_* |
analysis, started, completed, recorded, escalated, lesson_generated |
| EscalationRouter | escalation:* |
signal-received, deduplicated, sent, failed, routed, acknowledged |
| FeatureHealthService | (via auto-mode) | Issues surface through escalation events |
| AutoModeService | feature:* |
status-changed, completed, error |
| Lead Engineer | feature:* |
reflection:complete, pr-merged, state-changed |
| Maintenance | maintenance:* |
crash_recovery_scan_completed |
Feature Execution Fails
↓
RecoveryService.analyzeFailure()
├── categorizeFailure() → FailureCategory
├── determineStrategy() → RecoveryStrategy
├── recordRecoveryAttempt() → JSONL log + emit events
└── checkAndGenerateLessons() (after 3+ failures)
└── Write failure-lessons-{category}.md to context/
↓
Recovery result: { success, shouldRetry, actionTaken }
├── If retryable → AutoModeService.retry() with injected context
└── If escalate → EscalationRouter.routeSignal()
├── Dedup check (30-min window)
├── Rate limit check (per-channel)
└── Send to registered channels
└── EscalationLogEntry recorded for audit trail
↓
(Parallel) FeatureHealthService.audit()
└── Check for drift: orphaned refs, stale running, merged branches
└── Auto-fix if enabled → Update feature status
↓
Lead Engineer State Machine
├── [EXECUTE] Load sibling reflections from trajectory store
├── [REMEDIATION] Inject failure context + review feedback
└── [ESCALATE] FailureClassifierService maps reason → category
- Agent Philosophy — Why the system is designed this way
- Idea to Production — Full pipeline with escalation points
- Langfuse Integration — Tracing and cost tracking