@@ -82,3 +82,201 @@ encore test
8282- Never use ` console.log ` (use ` encore.dev/log ` )
8383- Always include structured context
8484
85+ ---
86+
87+ ## BUG-010 Case Study: Advanced Debugging Techniques
88+
89+ ### Diagnostic Scripts Arsenal
90+
91+ Created during BUG-010 investigation (all in ` backend/scripts/ ` ):
92+
93+ 1 . ** ` inspect-run.ts ` ** - Complete run event timeline
94+ ``` bash
95+ bunx tsx backend/scripts/inspect-run.ts < runId>
96+ # Shows: events, graph outcomes, cursor state, run record
97+ ```
98+
99+ 2 . ** ` check-agent-state.ts ` ** - Agent state snapshots
100+ ``` bash
101+ bunx tsx backend/scripts/check-agent-state.ts < runId>
102+ # Shows: nodeName, status, counters, budgets, timestamps
103+ ```
104+
105+ 3 . ** ` check-cursor-ordering.ts ` ** - Projector cursor health
106+ ``` bash
107+ bunx tsx backend/scripts/check-cursor-ordering.ts
108+ # Reveals: cursor limit issues, stuck cursors, ordering problems
109+ ```
110+
111+ 4 . ** ` find-completed-runs.ts ` ** / ** ` find-latest-run.ts ` **
112+ ``` bash
113+ bunx tsx backend/scripts/find-completed-runs.ts # Successful runs
114+ bunx tsx backend/scripts/find-latest-run.ts # Recent runs (any status)
115+ ```
116+
117+ 5 . ** ` test-projector.ts ` ** - Isolated projector testing
118+ ``` bash
119+ bunx tsx backend/scripts/test-projector.ts < runId>
120+ # Tests: cursor hydration, event fetch, screen projection
121+ ```
122+
123+ ### Git Forensics for Regressions
124+
125+ ** Timeline Method:**
126+ ``` bash
127+ # 1. Find last successful run
128+ bunx tsx backend/scripts/find-completed-runs.ts
129+ # Example: 01K9G8YXY6MG7J7875A5AM9Z4H at 2025-11-07 17:03
130+
131+ # 2. Find first failed run
132+ bunx tsx backend/scripts/find-latest-run.ts
133+ # Example: 01K9GDQF9JQFM8A4Q5WGMARPAT at 2025-11-07 18:26
134+
135+ # 3. Identify commits in regression window
136+ git log --oneline --since=" Nov 7 17:00" --until=" Nov 7 19:00"
137+
138+ # 4. Examine suspect commits
139+ git show < commit_hash> --stat # Files changed
140+ git show < commit_hash> < file_path> # Detailed diff
141+ git show < commit_hash> ~1:< file_path> # Before version
142+ ```
143+
144+ ** Binary Search Method:**
145+ ``` bash
146+ git bisect start
147+ git bisect bad HEAD # Current broken state
148+ git bisect good < last_known_good_commit> # From timeline
149+ # Test each commit automatically until culprit found
150+ git bisect reset # Exit bisect mode
151+ ```
152+
153+ ### Database Query Analysis
154+
155+ ** Stop Node Hang (BUG-010 Example):**
156+ ``` typescript
157+ // PROBLEM: Query inside node execution blocks XState machine
158+ const rows = await db .query ` SELECT COUNT(*) FROM graph_persistence_outcomes WHERE run_id = ${runId } ` ;
159+
160+ // SYMPTOMS:
161+ // - Worker times out after 30s lease
162+ // - Agent state shows "running" but stuck
163+ // - No "agent.node.finished" event emitted
164+
165+ // DIAGNOSIS:
166+ // 1. Check worker lease timeout logs
167+ // 2. Inspect agent state (last snapshot shows incomplete node)
168+ // 3. Test query in isolation (encore exec bunx tsx test-query.ts)
169+ // 4. Profile query execution time
170+
171+ // FIX:
172+ // Move heavy queries OUTSIDE critical execution path
173+ // Use lightweight operations in terminal nodes
174+ ```
175+
176+ ### Cursor Limit Investigation
177+
178+ ** Projector Stalling Pattern:**
179+ ``` typescript
180+ // SYMPTOM: Recent runs never get graph_persistence_outcomes
181+ // CHECK: backend/graph/projector.ts
182+ const CURSOR_LIMIT = 50 ; // ❌ Only processes 50 oldest cursors
183+
184+ // DIAGNOSIS:
185+ bunx tsx backend / scripts / check - cursor - ordering .ts
186+ // Output: 75 total cursors, positions 51-75 never processed
187+
188+ // VALIDATION:
189+ SELECT COUNT (* ) FROM graph_projection_cursors ; -- Shows 75
190+ SELECT * FROM graph_projection_cursors ORDER BY updated_at ASC LIMIT 50 ; -- Top 50
191+ SELECT * FROM graph_projection_cursors ORDER BY updated_at DESC LIMIT 10 ; -- Recent (excluded )
192+
193+ // FIX:
194+ const CURSOR_LIMIT = 200 ; // Scale with concurrent runs
195+ ```
196+
197+ ### Worker State Inspection
198+
199+ ** Understanding Worker Lifecycle:**
200+ ``` bash
201+ # 1. Check run claim status
202+ SELECT processing_by, processing_started_at FROM runs WHERE run_id = ' <runId>' ;
203+
204+ # 2. Verify lease heartbeat
205+ # Watch Encore logs for "extending lease" messages
206+
207+ # 3. Inspect final disposition
208+ SELECT status, stop_reason FROM runs WHERE run_id = ' <runId>' ;
209+ # status=failed indicates worker crash/timeout before Stop node
210+ ```
211+
212+ ### Phase 11: Advanced Regression Analysis (NEW)
213+
214+ When standard phases 1-10 don't reveal the issue:
215+
216+ 1 . ** Compare successful vs failed run events side-by-side**
217+ ``` bash
218+ diff <( bunx tsx backend/scripts/inspect-run.ts < good_run> ) \
219+ <( bunx tsx backend/scripts/inspect-run.ts < bad_run> )
220+ ```
221+
222+ 2 . ** Identify missing events in sequence**
223+ - Successful run: 19 events (includes Stop at step 6)
224+ - Failed run: 15 events (stops at WaitIdle step 5)
225+ - Missing: ` agent.node.started Stop ` , ` agent.run.finished `
226+
227+ 3 . ** Trace XState machine transitions**
228+ - Add logging to guards and actions in ` agent.machine.factory.ts `
229+ - Monitor which guards evaluate true/false
230+ - Identify unexpected state transitions
231+
232+ 4 . ** Test node execution in isolation**
233+ ``` typescript
234+ // scripts/test-node-isolation.ts
235+ import { stop } from " ../agent/nodes/terminal/Stop/node" ;
236+ const input = { /* build input from failed run state */ };
237+ const result = await stop (input );
238+ console .log (" Node output:" , result );
239+ ```
240+
241+ ### Common Backend Regression Patterns
242+
243+ | Issue | Symptom | Investigation | Common Cause |
244+ | -------| ---------| ---------------| --------------|
245+ | ** Cursor Limit** | Recent runs stuck at seq=1 | ` check-cursor-ordering.ts ` | ` CURSOR_LIMIT ` too low |
246+ | ** Node Hangs** | Agent state "running" indefinitely | ` check-agent-state.ts ` | DB query blocks execution |
247+ | ** Lease Timeout** | Run fails after 30s | Worker logs, database ` processing_by ` | Heavy sync operations |
248+ | ** Missing Events** | Timeline incomplete | ` inspect-run.ts ` , compare with baseline | Event not emitted or lost |
249+ | ** State Machine Stuck** | No transitions after event | XState logs, guard evaluation | Guard logic error |
250+
251+ ### Lesson: Avoid Heavy Operations in Critical Path
252+
253+ ** Bad Pattern (BUG-010):**
254+ ``` typescript
255+ export async function stop(input : StopInput ) {
256+ // ❌ DB query inside terminal node execution
257+ const rows = await db .query ` SELECT COUNT(*) ... ` ;
258+ // If query hangs, entire machine stalls
259+ }
260+ ```
261+
262+ ** Good Pattern:**
263+ ``` typescript
264+ export async function stop(input : StopInput ) {
265+ // ✅ Use pre-computed metrics from input
266+ const metrics = input .finalRunMetrics ;
267+ // Terminal nodes must be lightweight and deterministic
268+ }
269+ ```
270+
271+ ** Rationale:**
272+ - Terminal nodes finalize run state → must complete reliably
273+ - Heavy queries → post-run analytics layer
274+ - Critical path → optimized for latency, not accuracy
275+
276+ ---
277+
278+ ## References
279+ - BUG-010 RCA: ` jira/bugs/BUG-010-run-page-regressions/RCA.md `
280+ - Diagnostic Scripts: ` backend/scripts/ `
281+ - Encore Debugging: ` backend_coding_rules.mdc `
282+
0 commit comments