Skip to content

Commit 698d85a

Browse files
committed
docs: add ADR and refactor plan for error context with trace files
Adds comprehensive documentation for improving error handling: Architecture Decision: - Use structured error context (enums) in state files - Generate independent trace files only on command failure - Custom Traceable trait (not Serialize) for error formatting - Store traces in data/{env}/traces/ (not build/) - Use {timestamp}-{command}.log naming for chronological sorting Key Improvements: - TraceId newtype wrapping Uuid for type safety - PathBuf for trace_file_path (not String) - Type-safe pattern matching with enums - Complete error chains via Traceable trait - No serialization constraints on error types - Error history preserved (multiple trace files) Implementation: - Will be done in separate refactor after Phase 5 completion - Estimated 13 hours (~2 days) of work - Backward compatible with current string-based approach Related: - Supersedes simple string-based error context from Phase 5 - Builds on Command State Return Pattern ADR - Aligns with Error Handling principles (observability, actionability)
1 parent 524ede6 commit 698d85a

File tree

3 files changed

+624
-0
lines changed

3 files changed

+624
-0
lines changed

docs/decisions/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This directory contains architectural decision records for the Torrust Tracker D
66

77
| Status | Date | Decision | Summary |
88
| ----------- | ---------- | --------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
9+
| ✅ Accepted | 2025-10-03 | [Error Context Strategy](./error-context-strategy.md) | Use structured error context with trace files for complete error information |
910
| ✅ Accepted | 2025-10-03 | [Command State Return Pattern](./command-state-return-pattern.md) | Commands return typed states (Environment<S> → Environment<T>) for compile-time safety |
1011
| ✅ Accepted | 2025-10-03 | [Actionable Error Messages](./actionable-error-messages.md) | Use tiered help system with brief tips + .help() method for detailed troubleshooting |
1112
| ✅ Accepted | 2025-10-01 | [Type Erasure for Environment States](./type-erasure-for-environment-states.md) | Use enum-based type erasure to enable runtime handling and serialization of typed states |
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Decision: Error Context Strategy in Commands
2+
3+
## Status
4+
5+
Accepted
6+
7+
## Date
8+
9+
2025-10-06
10+
11+
## Context
12+
13+
Commands need to persist error context when failures occur. We need to decide what information to store and where.
14+
15+
### The Core Problem
16+
17+
When a command like `ProvisionCommand` fails:
18+
19+
- We need to know **what** failed (which step)
20+
- We need to know **why** it failed (error details)
21+
- We need **complete information** for debugging
22+
- We need **actionable guidance** for users
23+
24+
### Key Requirements
25+
26+
1. Type-safe error handling (not string-based)
27+
2. Complete error information (full error chain)
28+
3. No serialization constraints on error types
29+
4. Error history preservation (multiple attempts)
30+
5. Actionable user guidance
31+
32+
## Decision
33+
34+
**Use structured error context with independent trace files**.
35+
36+
### Architecture
37+
38+
**State File** (`data/{env}/state.json`) contains:
39+
40+
- Enum-based context (type-safe)
41+
- Essential metadata (timing, step, kind)
42+
- Reference to trace file
43+
44+
**Trace Files** (`data/{env}/traces/{timestamp}-{command}.log`) contain:
45+
46+
- Complete error chain (all nested errors)
47+
- Root cause analysis
48+
- Suggested actions
49+
- Full debugging context
50+
51+
### Key Innovation: `Traceable` Trait
52+
53+
Instead of requiring errors to implement `Serialize`, we use a custom `Traceable` trait:
54+
55+
```rust
56+
trait Traceable {
57+
fn trace_format(&self) -> String;
58+
}
59+
```
60+
61+
This allows:
62+
63+
- Custom formatting per error type
64+
- No serialization constraints
65+
- Errors can contain non-serializable data (file handles, sockets, etc.)
66+
67+
### Storage Details
68+
69+
**Location**: `data/{env}/traces/` (not `build/`)
70+
71+
- `data/` = internal application state
72+
- `build/` = generated artifacts (OpenTofu, Ansible configs)
73+
74+
**Naming**: `{timestamp}-{command}.log` (timestamp first)
75+
76+
- Example: `20251003-103045-provision.log`
77+
- Chronological sorting by default
78+
- Easy to find latest/oldest
79+
80+
**Generation**: Only on command failure (not success)
81+
82+
- Traces are for debugging failures
83+
- Success cases don't need error details
84+
- Keeps trace directory focused
85+
86+
## Rationale
87+
88+
### Why Separate Trace Files?
89+
90+
**No Serialization Constraints**
91+
92+
- Error types don't need `Serialize + Deserialize`
93+
- Can use custom `Traceable` trait instead
94+
- Errors can contain non-serializable data
95+
96+
**Complete Error History**
97+
98+
- Multiple failures preserved
99+
- Not overwritten on retry
100+
- Full audit trail maintained
101+
102+
**Type-Safe Context**
103+
104+
- Pattern matching on enums
105+
- Compile-time guarantees
106+
- No string typos
107+
108+
**Independent Concern**
109+
110+
- Trace generation separate from state management
111+
- Any command can generate traces
112+
- Flexible implementation
113+
114+
### Why `data/` Directory?
115+
116+
The `data/` vs `build/` distinction is important:
117+
118+
- **`data/{env}/`** = Internal application state
119+
120+
- `state.json` - current environment state
121+
- `traces/` - error trace history
122+
- Managed by the application
123+
- Should be backed up
124+
125+
- **`build/{env}/`** = Generated artifacts
126+
- OpenTofu `.tf` files
127+
- Ansible inventory files
128+
- Can be regenerated from templates
129+
- Safe to delete
130+
131+
### Why Timestamp-First Naming?
132+
133+
`{timestamp}-{command}.log` format enables:
134+
135+
```bash
136+
# List all traces chronologically
137+
ls data/e2e-full/traces/
138+
20251003-103045-provision.log
139+
20251003-105230-provision.log
140+
20251003-110015-configure.log
141+
142+
# Find latest trace
143+
ls -1 data/e2e-full/traces/ | tail -1
144+
145+
# Find all provision failures
146+
ls data/e2e-full/traces/*-provision.log
147+
```
148+
149+
## Alternatives Considered
150+
151+
### 1. String-Based Context (Current Phase 5 Implementation)
152+
153+
```rust
154+
pub struct ProvisionFailed {
155+
failed_step: String,
156+
}
157+
```
158+
159+
**Pros**: Simple, easy to implement
160+
**Cons**: No type safety, no error details, typo-prone
161+
**Verdict**: Good for MVP, insufficient for production
162+
163+
### 2. Full Error Serialization
164+
165+
```rust
166+
pub struct ProvisionFailed {
167+
error: ProvisionCommandError, // Requires Serialize
168+
}
169+
```
170+
171+
**Pros**: Complete information, type-safe
172+
**Cons**: Forces all errors to be serializable, tight coupling
173+
**Verdict**: Too restrictive
174+
175+
### 3. Captured Logs
176+
177+
```rust
178+
pub struct ProvisionFailureContext {
179+
captured_logs: Vec<LogRecord>,
180+
}
181+
```
182+
183+
**Pros**: Complete execution context
184+
**Cons**: Complex, indirect, format-dependent
185+
**Verdict**: Error trace is more direct
186+
187+
### 4. Error Reference (Trace ID Only)
188+
189+
```rust
190+
pub struct ProvisionFailureContext {
191+
trace_id: String, // Search logs manually
192+
}
193+
```
194+
195+
**Pros**: Minimal state
196+
**Cons**: Requires log retention, not self-contained
197+
**Verdict**: Trace files provide better UX
198+
199+
## Consequences
200+
201+
### Positive
202+
203+
✅ Type-safe error handling with enums
204+
✅ Complete error information preserved
205+
✅ No serialization constraints via `Traceable` trait
206+
✅ Error history maintained
207+
✅ Better user experience
208+
✅ Independent trace management
209+
210+
### Negative
211+
212+
⚠️ Additional files to manage
213+
⚠️ Two-phase storage (state + traces)
214+
⚠️ Initial implementation effort (~10-13 hours)
215+
216+
### Neutral
217+
218+
ℹ️ Will need trace cleanup policy eventually
219+
ℹ️ Requires writable filesystem
220+
221+
## Implementation Plan
222+
223+
This will be implemented in a **separate refactor after Phase 5**:
224+
225+
1. Complete Phase 5 with string-based approach
226+
2. Define `ProvisionFailureContext` struct with enums
227+
3. Define `TraceId` newtype (wrapping `Uuid`) for type safety
228+
4. Define `Traceable` trait for error formatting
229+
5. Implement trace file writer with `PathBuf` for file paths
230+
6. Update commands to generate traces on failure
231+
7. Add tests and documentation
232+
233+
**Refactor plan**: `docs/refactors/error-context-with-trace-files.md`
234+
235+
## Related Decisions
236+
237+
- [Command State Return Pattern](./command-state-return-pattern.md)
238+
- [Type Erasure for Environment States](./type-erasure-for-environment-states.md)
239+
- [Actionable Error Messages](./actionable-error-messages.md)
240+
241+
## Summary
242+
243+
We chose **structured context + independent trace files** because it:
244+
245+
1. Provides type-safe pattern matching (enums)
246+
2. Captures complete error chains (`Traceable` trait, not `Serialize`)
247+
3. Maintains lightweight state files
248+
4. Preserves error history
249+
5. Stores traces in `data/` (internal state), not `build/` (artifacts)
250+
6. Uses timestamp-first naming for easy sorting
251+
7. Generates traces only on failure
252+
253+
This balances simplicity, completeness, and usability better than alternatives.

0 commit comments

Comments
 (0)