cockroachdb
diff --git a/‎docs/tech-notes/roachtest-investigation-tips/README.md‎
Lines changed: 93 additions & 0 deletions b/‎docs/tech-notes/roachtest-investigation-tips/README.md‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎docs/tech-notes/roachtest-investigation-tips/debugging-a-job.md‎
Lines changed: 211 additions & 0 deletions b/‎docs/tech-notes/roachtest-investigation-tips/debugging-a-job.md‎
Lines changed: 211 additions & 0 deletions
diff --git a/‎docs/tech-notes/roachtest-investigation-tips/exits_crashes_ooms.md‎
Lines changed: 73 additions & 0 deletions b/‎docs/tech-notes/roachtest-investigation-tips/exits_crashes_ooms.md‎
Lines changed: 73 additions & 0 deletions
@@ -0,0 +1,93 @@
+# Roachtest Investigation Tips
+
+This directory contains guides for investigating common roachtest failures.
+
+## General Investigation Workflow
+
+1. **Gather artifacts** using `./scripts/fetch-roachtest-artifacts`.
+
+2. **Identify the failure type** from test logs and error messages.
+
+3. **Follow the appropriate guide** based on the failure type:
+   - [**Debugging Job Failures**](debugging-a-job.md) - How to investigate failed or misbehaving
+     jobs (backups, imports, CDC, etc.).
+   - [**Node Exits, Crashes, and OOMs**](exits_crashes_ooms.md) - Investigating unexpected node
+     exits and system-level failures.
+   - [**Workload Failures**](workload-failure.md) - How to investigate failures originating from
+     workload commands (use in concert with other guides).
+   - [**Synthetic Failure Injection**](synthetic-failure-injection.md) - Investigating tests with
+     network partitions, node kills, and other chaos engineering patterns (use in concert with other guides).
+
+4. **Document findings**
+   Edit and extend `investigation.md` to capture your findings:
+   - Include citations to specific file locations or log files, or the grep/awk/etc command used to
+     find them.
+   - Update or add to the unified timeline as described below.
+   - Before adding new sections to the investigation notes, review existing sections to see if
+     there are already sections that should be amended or expanded instead.
+
+5. **Review the code**: based on the observations above, perform a cursory review of code involved
+   based on the errors, jobs, etc observed:
+   - What are the major components, key functions, etc?
+   - Are there conditionals or checks, settings or comments in the code that seem related?
+   - What is the test trying to test? what is it doing to test that?
+   - Map the call chain, request senders, component layers leading up to the site of failures to try
+     to determine which layer or component was intended to be responsible for ensuring the condition
+     that was ultimately violated.
+      - In particular, is there logic specific to this which apparently failed? Or is it missing?
+
+6. **Search git history** for recent changes related to the failure:
+   - Check recent changes to the test: `git log --oneline -n 10 pkg/path/to/test`.
+   - Check recent changes to relevant packages: `git log --oneline -n 20 pkg/path/to/relevant/package/`.
+   - Look for related setting/feature/error names: `git log -S "keyword" --oneline`.
+   
+## Timeline Hygiene
+
+Maintain a single, unified timeline for ease of reference throughout the investigation.
+ - Use a consistent format: one line per entry, each prefixed with HH:MM:SS in UTC.
+ - Do not create separate timelines for different components or log sources.
+ - All events should be integrated into one chronological sequence.
+ - Don't put too much detail in an entry; leave that to the detailed observation in the rest of the
+   document.
+   - For example, say you found hundreds of addsstable errors in the logs over 30 minutes during
+     which a job makes no progress:
+      Your timeline might show:
+         `08:55:24 - job 456 records last progress change to 22% before stalling there for 34mins`
+         `08:55:27 - first addsstable 'Command too large' error appears in logs on n2`
+         `08:55:27-09:28:04 addsstable 'Command too large' errors on all nodes for 33mins`
+      And there would be a section later in the document describing addsstable errors with complete
+      examples, number, time period and location observed, etc.
+- Distinguish test framework actions (network partitions, node kills) from cluster behavior
+  responses.
+- Note periods of baseline/normal behavior before or after issues to establish what changed, e.g.
+  if a job progresses steadily at 1.5%/min for 20 min before stalling.
+- Align test.log steps with node logs and system events in the same timeline.
+
+## Systematic Event Correlation Analysis
+
+When investigating performance changes or failures, use systematic correlation to identify what 
+differs between working and non-working periods:
+
+- **Identify transition periods**: Look for times when behavior changes (e.g., job progress rate 
+  shifts, workload performance degrades, error patterns emerge).
+- **Define before/after windows**: Establish specific time ranges for comparison (e.g., "at 11:32 
+  we're going fast and at 11:37 we're slow").
+- **Compare multiple dimensions systematically**:
+  - Node distribution patterns (which nodes handle work before vs after)
+  - Retry frequencies and error message patterns
+  - Resource indicators (flow control blocking, admission control, memory pressure)
+  - Infrastructure events (replica changes, range splits, rebalancing operations)
+- **Look for timing correlations**: Document what events occurred during or just before transition 
+  periods, but present findings as "timing correlations" rather than definitive causation.
+- **Document pattern changes**: Describe how workload distribution, failure patterns, or 
+  performance characteristics differ between periods.
+
+This technique is particularly useful when investigating jobs that show changes in progress rates 
+or workloads that show performance degradation over time.
+
+## Related Tools
+
+- `./scripts/fetch-roachtest-artifacts [issue_num|issue_link|issue_comment_link]` - Downloads test
+  artifacts for investigation.
+- `.claude/commands/roachtest-failure.md` - Claude Code automation instructions for following this
+  guide.
@@ -0,0 +1,211 @@
+# Debugging Roachtest Job Failures
+
+Roachtests that run specific jobs verify both successful completion and correct behavior, including
+performance characteristics. This guide provides a systematic approach to investigating job-related
+test failures.
+
+## Find the Job ID
+
+Identifying the specific job ID is critical for investigation. If not found in test.log, search 
+`debug/system.jobs.txt` for the relevant job type:
+- `c2c/*` (cluster-to-cluster) → `STREAM INGESTION` (in `dest_debug`).
+- `ldr/*` → `LOGICAL REPLICATION`.
+- `cdc/*`, `changefeed/*` → `CHANGEFEED`.
+- `backup/*` → `BACKUP`, `restore/*` → `RESTORE`.
+- `import/*` → `IMPORT`.
+- `schemachange/*` → `SCHEMA CHANGE`.
+
+Note for c2c tests: The cluster is split into source and destination clusters, with separate debug 
+folders `source_debug` and `dest_debug/`.
+
+Verify the job is correctly identified before proceeding:
+- If multiple jobs are found: Stop and ask which job ID is relevant for this test failure.
+- If no jobs are found: Stop and ask if this test involves a different job type not listed above.
+- Record the identified job ID for reference throughout the investigation.
+
+## General Investigation Approach
+
+Document all observations with citations and references to their locations in the test output. 
+Create an investigation.md file to track findings systematically.
+
+### Collect job-related logs
+
+Create a focused, combined log containing all entries from all nodes that mention the job ID:
+
+```bash
+grep -h "JOB_ID" PATH/TO/FETCHED/artifacts/logs/*.unredacted/cockroach.log | cut -c2- | sort > job_JOB_ID.log
+```
+NB: Use the unredacted logs, not the redacted ones in artifacts/logs/*.cockroach.log or you may be 
+missing details.
+
+Review these logs for unusual patterns, error messages, retries, notable events and their 
+timestamps.
+
+### Review system tables
+
+Examine these system tables for job-specific information:
+
+- `debug/crdb_internal.jobs.txt`: job status and errors.
+  - Error messages are almost always significant.
+  - Start and end times help build a timeline.
+- `debug/system.job_message.txt`: status messages recorded by the job.
+  - Include relevant messages in your investigation timeline.
+- `debug/system.job_progress_history.txt`: historical progress values sorted by time.
+  - Some jobs track "fraction_completed" progress; others use "resolved" timestamp progress.
+  - Jobs may switch between progress types during execution.
+  - Analyze progress patterns: is it linear? does it stall? are there inflection points?
+  - For jobs that timeout or get stuck, use `./scripts/job-progress-plot <job-id> <path>` to 
+    generate an ASCII plot and insert it in your investigation document below the timeline.
+  - Identify periods with different progress rates and document them in your timeline:
+    - Sudden changes: X%/min immediately shifts to Y%/min at time T
+    - Gradual changes: progress slows from X%/min at time T1 to Y%/min by time T2 over N minutes
+  - When rate changes occur (sudden or gradual), investigate what else happened around those 
+    time periods and document correlations in the timeline.
+  - Compare before/after periods: examine node distribution, retry patterns, flow control 
+    blocking, and infrastructure events to identify what differs between fast and slow periods.
+- `debug/jobs/<jobID>/*` sometimes contains zipped up job trace data.
+
+Note: C2C tests have separate system tables for source and destination clusters. The destination
+cluster typically contains the most relevant events.
+
+### Investigation strategies
+
+- Never assume job success means correct behavior; tests exist because jobs can succeed while
+  behaving incorrectly.
+- If a job succeeds but the test fails, investigate both the job behavior and test logic.
+- Retries and replannings:
+  - Jobs with retry loops may "get stuck" if an underlying issue causes it to retry without making
+    progress.
+  - A series of retries without making progress is an indicator something is happening that might be
+    wrong _even though the job might not fail outright_.
+- Investigate progress stalls: When a job stops making progress, examine events around that time:
+  - Increased retry attempts.
+  - Logged errors.
+  - Node failures or restarts.
+  - KV range splits or other cluster events?
+- If you find yourself digging into a slow job, look for a job trace zip.
+  - Combined trace has count of and cumulative time spent per operation.
+  - Note: many jobs run many processors, each with many workers that perform operations
+    concurrently; cumulative time often exceeds wall time.
+  - For distributed jobs: coordination nodes aggregate cumulative times from all worker nodes - a
+    single node showing 100+ hours may actually represent work distributed across many nodes.
+- Many jobs utilize a "frontier" to track progress happening across many tasks, spans, ranges, etc.
+  - The frontier tracks and reports the _furthest behind_ task or span, so if the "frontier is
+    lagging" it means a span is behind, not that there is a problem with the frontier itself.
+
+Timestamp handling: Timestamps appear in different formats:
+- `<seconds>.<fractional-seconds>,<logical>`.
+- `<nanoseconds>.<logical>`.
+
+When searching logs, use only the higher (seconds) digits to match both formats. Maintain 
+consistency in your notes by using one format throughout.
+
+### Processor Start Messages
+
+Most distributed jobs utilize "processors" that run on multiple nodes to execute work in parallel.
+
+Analyzing processor distribution and behavior can be key to spotting issues in work distribution,
+and can be a signal of replannings or retry loops if those aren't explicitly logged already.
+
+Processor log messages are already in the job-focused log since they are tagged with the job ID.
+
+Processor start messages indicate a job was planned or replanned: if this wasn't observed in the
+logs, this is often worth noting in the timeline (along with if the start message is seen on all 
+nodes or just some).
+
+Sometimes these messages include a number of spans or work units assigned; if this is not relatively
+balanced across nodes, or some nodes do not report starting processors when others do, this is 
+notable.
+
+### Data Writing Job Behavior
+
+- Jobs which write data such as restore, import and schema changes are expected to perform many 
+  splits, particularly when they begin.
+- Heavy splitting during initial phases is normal and expected behavior.
+- `SSTable cannot be added spanning range bounds` errors are automatically handled by retrying 
+  each half of the SST.
+- These errors are unlikely to be the cause of a failure and are usually benign.
+- However, if a very high proportion of all SSTable requests hit spanning errors, it suggests the 
+  SSTable adding process is hitting splits it did not expect, and can slow down the process of 
+  adding SSTables.
+
+## Job-Specific Investigation Approaches
+
+This section is continuously evolving. If your job type isn't covered here:
+- Ask for specific guidance on what to investigate.
+- Document useful patterns you discover for future additions to this guide.
+
+### Backup and Restore Jobs (backup/* and restore/* tests)
+
+For BACKUP and RESTORE jobs, always record these key infrastructure details in your investigation:
+
+- **Cloud provider**: (GCE, AWS, Azure, etc.).
+- **Cluster region(s)/zone(s)**: Where the CockroachDB nodes are located.
+- **Bucket location**: The cloud storage bucket region/location.
+- **Cross-region setup**: Note if cluster and bucket are in different regions.
+
+### Physical Replication Stream Ingestion Jobs (c2c/* tests)
+
+Job phases:
+1. Initial scan: copies all data as of the initial scan timestamp.
+2. Replication: applies streamed changes, updating replicated time.
+3. Cutover: rewinds applied changes to the chosen cutover timestamp (≤ replicated time).
+
+Key terminology:
+- Replicated time ("resolved timestamp" internally): highest time to which cutover is allowed.
+- "Lag"/"latency": how far replicated time trails behind current time.
+- Retained time: protected timestamp trailing resolved time; earliest cutover time allowed.
+
+Expected behaviors:
+- Initial scan phase: replicated time remains zero; lag shows as "thousands of years" (normal).
+- Replication phase:
+  - Replicated/resolved time should steadily increase, or "advance", when running normally.
+  - If resolved time is increasing, lag remains small and constant (usually ~30 seconds).
+    - Minutes or more of lag suggests replication is behind and trying to catch up.
+    - High *and steadily increasing* lag suggests replication has stalled and may be stuck.
+    - NB: constant lag == increasing resolved time; constant resolved time == increasing lag.
+- Cutover phase: progress shows as fraction of rewind completed.
+- Job completion: when cutover to the target timestamp finishes.
+
+Replication components:
+- Source side: stream producer (event_stream and rangefeeds).
+- Destination side: stream ingestion job (stream ingestion processor, AddSSTable).
+
+C2C-specific checklists:
+1. Augment the investigation timeline to document:
+   - Initial scan start/completion and the value of the initial scan timestamp.
+   - Replication phase.
+   - Cutover start/completion, and cutover timestamp.
+   - Key replicated/resolved timestamp values around any notable events.
+  Be sure to note these phases and what phase the job was in when the test failed.
+
+2. Resolved progress and lag analysis:
+  Look at the progress history, and logged resolved timestamps and logged lag observations:
+  
+  During replication phase:
+    - Are there periods where lag increases?
+    - Is resolved time advancing consistently or does it stall at a constant value for extended 
+      periods?
+    - Do stalls correlate with other cluster events or errors?
+  
+  Update the timeline with any observations, particularly periods of constant vs increasing lag, or 
+  sudden changes in resolved time.
+
+3. Component investigation: if replication stalls, examine both source and destination components for 
+   issues.
+
+4. Understanding frontier stalls: If resolved time stops advancing during replication:
+   - The frontier algorithm takes the minimum timestamp across all spans.
+   - A stalled frontier indicates one or more spans are stuck, not a coordination failure.
+   - Look for patterns in which nodes are "lagging behind the frontier" - this indicates which spans 
+     are stuck.
+   - Check for replanning events: frequent replanning (search "replanning") suggests persistent stuck 
+     spans.
+   - Examine AdminSplit operations and SSTable retries on specific tables that correlate with stuck 
+     spans.
+
+5. Data validation: tests typically compare fingerprints between clusters.
+   - Fingerprint mismatches indicate data corruption or replication errors.
+   - "Start timestamp greater than end timestamp" errors mean cutover time was below retained time.
+    - Start time, i.e. retained time, usually should be minutes, or hours, before end time.
+    - Usually caused by extreme replication lag; re-examine the replication phase more closely.
@@ -0,0 +1,73 @@
+# Investigating Node Exits, Crashes, and OOMs
+
+When investigating a node reported as exiting unexpectedly, look for messages, stack traces, and
+exit codes, either logged by the node itself as it crashed or by the system if it killed it.
+
+## Common Exit Scenarios
+
+- **Panic/log.Fatal**: Check `logs/N.unredacted/cockroach.stderr.log` for stack traces and fatal
+  messages.
+- **Disk Stall**: Look for "disk stall detected: unable to sync log files" in stderr/main logs.
+
+## Investigation Steps
+
+1. Check `logs/N.unredacted/cockroach.exit.log` files for exit codes (only nodes that crashed will
+   have these).
+2. Search stderr logs: `grep -n "panic\|fatal\|abort" logs/*/unredacted/cockroach.stderr.log`.
+3. Check for disk stalls: `grep "disk stall detected" logs/*/unredacted/cockroach.log`.
+4. Look for oomkiller messages in `N.dmesg.txt` files, and messages about cockroach in
+   `artifacts/*.journalctl.txt`.
+
+## Timeline and Context Analysis
+
+Before diving into crash details, establish the timeline context:
+
+**Timing Analysis:**
+- When did crashes occur relative to test start and test failure?
+- Did the test fail immediately after crashes or continue running?
+- For tests that ran significantly longer after crashes, the crash may not be the direct cause.
+
+**Test Type Context:**
+- **Node shutdown tests**: Crashes are often expected behavior - focus on whether the test
+  properly handled the shutdown.
+- **Chaos tests**: Some crashes may be intentionally induced - check if the crash matches the
+  chaos scenario.
+- **Performance tests**: Crashes may indicate resource exhaustion under load.
+- **Restore/backup tests**: Crashes during long-running operations may indicate coordination
+  issues.
+
+**Check for Intentional Node Stops:**
+Many tests intentionally stop nodes as part of their test scenario. Before investigating a crash
+as a problem:
+- Search the test log for messages about stopping/killing nodes (e.g., "stopping node", "killing
+  node", "shutdown").
+- Look for test framework calls that intentionally terminate nodes.
+- Check if the crash timing aligns with expected test operations.
+
+**Failure Mode Assessment:**
+- **Immediate failures after crash**: Crash is likely the direct cause.
+- **Timeout failures with early crashes**: Investigate what prevented the test from completing
+  or failing fast.
+- **Connection errors to crashed nodes**: Expected after crash, but investigate what was trying
+  to connect and why.
+
+## Exit Code Reference
+
+Check `artifacts/test.log` and the crashed node's stderr for exit codes:
+
+- **Exit code 7**: Disk stall detected - look for "disk stall detected: unable to sync log
+  files".
+- **Exit code 10**: Disk full - check disk space and storage errors.
+- **Exit code 132**: Illegal instruction (SIGILL) - hardware/build issue.
+- **Exit code 134**: Assertion failure (SIGABRT) - internal consistency check failed.
+- **Exit code 137**: SIGKILL - process was forcibly terminated (OOM killer, manual kill, or
+  system limits).
+- **Exit code 139**: Segmentation fault (SIGSEGV) - look for stack traces in stderr.
+
+## Investigation Priority
+
+When multiple issues are present, investigate in this order:
+
+1. **Direct test failure mechanism first** - timeout, assertion failure, etc.
+2. **Timeline correlation** - do crashes correlate with the direct failure?
+3. **Crash details** - only after understanding the failure context.