|
| 1 | +# Decision: Error Context Strategy in Commands |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Accepted |
| 6 | + |
| 7 | +## Date |
| 8 | + |
| 9 | +2025-10-06 |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +Commands need to persist error context when failures occur. We need to decide what information to store and where. |
| 14 | + |
| 15 | +### The Core Problem |
| 16 | + |
| 17 | +When a command like `ProvisionCommand` fails: |
| 18 | + |
| 19 | +- We need to know **what** failed (which step) |
| 20 | +- We need to know **why** it failed (error details) |
| 21 | +- We need **complete information** for debugging |
| 22 | +- We need **actionable guidance** for users |
| 23 | + |
| 24 | +### Key Requirements |
| 25 | + |
| 26 | +1. Type-safe error handling (not string-based) |
| 27 | +2. Complete error information (full error chain) |
| 28 | +3. No serialization constraints on error types |
| 29 | +4. Error history preservation (multiple attempts) |
| 30 | +5. Actionable user guidance |
| 31 | + |
| 32 | +## Decision |
| 33 | + |
| 34 | +**Use structured error context with independent trace files**. |
| 35 | + |
| 36 | +### Architecture |
| 37 | + |
| 38 | +**State File** (`data/{env}/state.json`) contains: |
| 39 | + |
| 40 | +- Enum-based context (type-safe) |
| 41 | +- Essential metadata (timing, step, kind) |
| 42 | +- Reference to trace file |
| 43 | + |
| 44 | +**Trace Files** (`data/{env}/traces/{timestamp}-{command}.log`) contain: |
| 45 | + |
| 46 | +- Complete error chain (all nested errors) |
| 47 | +- Root cause analysis |
| 48 | +- Suggested actions |
| 49 | +- Full debugging context |
| 50 | + |
| 51 | +### Key Innovation: `Traceable` Trait |
| 52 | + |
| 53 | +Instead of requiring errors to implement `Serialize`, we use a custom `Traceable` trait: |
| 54 | + |
| 55 | +```rust |
| 56 | +trait Traceable { |
| 57 | + fn trace_format(&self) -> String; |
| 58 | +} |
| 59 | +``` |
| 60 | + |
| 61 | +This allows: |
| 62 | + |
| 63 | +- Custom formatting per error type |
| 64 | +- No serialization constraints |
| 65 | +- Errors can contain non-serializable data (file handles, sockets, etc.) |
| 66 | + |
| 67 | +### Storage Details |
| 68 | + |
| 69 | +**Location**: `data/{env}/traces/` (not `build/`) |
| 70 | + |
| 71 | +- `data/` = internal application state |
| 72 | +- `build/` = generated artifacts (OpenTofu, Ansible configs) |
| 73 | + |
| 74 | +**Naming**: `{timestamp}-{command}.log` (timestamp first) |
| 75 | + |
| 76 | +- Example: `20251003-103045-provision.log` |
| 77 | +- Chronological sorting by default |
| 78 | +- Easy to find latest/oldest |
| 79 | + |
| 80 | +**Generation**: Only on command failure (not success) |
| 81 | + |
| 82 | +- Traces are for debugging failures |
| 83 | +- Success cases don't need error details |
| 84 | +- Keeps trace directory focused |
| 85 | + |
| 86 | +## Rationale |
| 87 | + |
| 88 | +### Why Separate Trace Files? |
| 89 | + |
| 90 | +✅ **No Serialization Constraints** |
| 91 | + |
| 92 | +- Error types don't need `Serialize + Deserialize` |
| 93 | +- Can use custom `Traceable` trait instead |
| 94 | +- Errors can contain non-serializable data |
| 95 | + |
| 96 | +✅ **Complete Error History** |
| 97 | + |
| 98 | +- Multiple failures preserved |
| 99 | +- Not overwritten on retry |
| 100 | +- Full audit trail maintained |
| 101 | + |
| 102 | +✅ **Type-Safe Context** |
| 103 | + |
| 104 | +- Pattern matching on enums |
| 105 | +- Compile-time guarantees |
| 106 | +- No string typos |
| 107 | + |
| 108 | +✅ **Independent Concern** |
| 109 | + |
| 110 | +- Trace generation separate from state management |
| 111 | +- Any command can generate traces |
| 112 | +- Flexible implementation |
| 113 | + |
| 114 | +### Why `data/` Directory? |
| 115 | + |
| 116 | +The `data/` vs `build/` distinction is important: |
| 117 | + |
| 118 | +- **`data/{env}/`** = Internal application state |
| 119 | + |
| 120 | + - `state.json` - current environment state |
| 121 | + - `traces/` - error trace history |
| 122 | + - Managed by the application |
| 123 | + - Should be backed up |
| 124 | + |
| 125 | +- **`build/{env}/`** = Generated artifacts |
| 126 | + - OpenTofu `.tf` files |
| 127 | + - Ansible inventory files |
| 128 | + - Can be regenerated from templates |
| 129 | + - Safe to delete |
| 130 | + |
| 131 | +### Why Timestamp-First Naming? |
| 132 | + |
| 133 | +`{timestamp}-{command}.log` format enables: |
| 134 | + |
| 135 | +```bash |
| 136 | +# List all traces chronologically |
| 137 | +ls data/e2e-full/traces/ |
| 138 | +20251003-103045-provision.log |
| 139 | +20251003-105230-provision.log |
| 140 | +20251003-110015-configure.log |
| 141 | + |
| 142 | +# Find latest trace |
| 143 | +ls -1 data/e2e-full/traces/ | tail -1 |
| 144 | + |
| 145 | +# Find all provision failures |
| 146 | +ls data/e2e-full/traces/*-provision.log |
| 147 | +``` |
| 148 | + |
| 149 | +## Alternatives Considered |
| 150 | + |
| 151 | +### 1. String-Based Context (Current Phase 5 Implementation) |
| 152 | + |
| 153 | +```rust |
| 154 | +pub struct ProvisionFailed { |
| 155 | + failed_step: String, |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +**Pros**: Simple, easy to implement |
| 160 | +**Cons**: No type safety, no error details, typo-prone |
| 161 | +**Verdict**: Good for MVP, insufficient for production |
| 162 | + |
| 163 | +### 2. Full Error Serialization |
| 164 | + |
| 165 | +```rust |
| 166 | +pub struct ProvisionFailed { |
| 167 | + error: ProvisionCommandError, // Requires Serialize |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +**Pros**: Complete information, type-safe |
| 172 | +**Cons**: Forces all errors to be serializable, tight coupling |
| 173 | +**Verdict**: Too restrictive |
| 174 | + |
| 175 | +### 3. Captured Logs |
| 176 | + |
| 177 | +```rust |
| 178 | +pub struct ProvisionFailureContext { |
| 179 | + captured_logs: Vec<LogRecord>, |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +**Pros**: Complete execution context |
| 184 | +**Cons**: Complex, indirect, format-dependent |
| 185 | +**Verdict**: Error trace is more direct |
| 186 | + |
| 187 | +### 4. Error Reference (Trace ID Only) |
| 188 | + |
| 189 | +```rust |
| 190 | +pub struct ProvisionFailureContext { |
| 191 | + trace_id: String, // Search logs manually |
| 192 | +} |
| 193 | +``` |
| 194 | + |
| 195 | +**Pros**: Minimal state |
| 196 | +**Cons**: Requires log retention, not self-contained |
| 197 | +**Verdict**: Trace files provide better UX |
| 198 | + |
| 199 | +## Consequences |
| 200 | + |
| 201 | +### Positive |
| 202 | + |
| 203 | +✅ Type-safe error handling with enums |
| 204 | +✅ Complete error information preserved |
| 205 | +✅ No serialization constraints via `Traceable` trait |
| 206 | +✅ Error history maintained |
| 207 | +✅ Better user experience |
| 208 | +✅ Independent trace management |
| 209 | + |
| 210 | +### Negative |
| 211 | + |
| 212 | +⚠️ Additional files to manage |
| 213 | +⚠️ Two-phase storage (state + traces) |
| 214 | +⚠️ Initial implementation effort (~10-13 hours) |
| 215 | + |
| 216 | +### Neutral |
| 217 | + |
| 218 | +ℹ️ Will need trace cleanup policy eventually |
| 219 | +ℹ️ Requires writable filesystem |
| 220 | + |
| 221 | +## Implementation Plan |
| 222 | + |
| 223 | +This will be implemented in a **separate refactor after Phase 5**: |
| 224 | + |
| 225 | +1. Complete Phase 5 with string-based approach |
| 226 | +2. Define `ProvisionFailureContext` struct with enums |
| 227 | +3. Define `TraceId` newtype (wrapping `Uuid`) for type safety |
| 228 | +4. Define `Traceable` trait for error formatting |
| 229 | +5. Implement trace file writer with `PathBuf` for file paths |
| 230 | +6. Update commands to generate traces on failure |
| 231 | +7. Add tests and documentation |
| 232 | + |
| 233 | +**Refactor plan**: `docs/refactors/error-context-with-trace-files.md` |
| 234 | + |
| 235 | +## Related Decisions |
| 236 | + |
| 237 | +- [Command State Return Pattern](./command-state-return-pattern.md) |
| 238 | +- [Type Erasure for Environment States](./type-erasure-for-environment-states.md) |
| 239 | +- [Actionable Error Messages](./actionable-error-messages.md) |
| 240 | + |
| 241 | +## Summary |
| 242 | + |
| 243 | +We chose **structured context + independent trace files** because it: |
| 244 | + |
| 245 | +1. Provides type-safe pattern matching (enums) |
| 246 | +2. Captures complete error chains (`Traceable` trait, not `Serialize`) |
| 247 | +3. Maintains lightweight state files |
| 248 | +4. Preserves error history |
| 249 | +5. Stores traces in `data/` (internal state), not `build/` (artifacts) |
| 250 | +6. Uses timestamp-first naming for easy sorting |
| 251 | +7. Generates traces only on failure |
| 252 | + |
| 253 | +This balances simplicity, completeness, and usability better than alternatives. |
0 commit comments