Skip to content

Commit a6a632c

Browse files
craig[bot]dtpaulniziolek
committed
152443: docs: add roachtest investigation tech notes r=dt a=dt These documentation files are intended to be readable and usable by both AI and humans, focusing on a repeatable, process-based approach to understanding what happened during roachtest failures. The guides emphasize systematic data collection and analysis rather than jumping to conclusions. Release note: none Epic: none 152453: sql: fix first contains/contained by array op null panic r=paulniziolek a=paulniziolek #### sql: fix first contains/contained by array op null panic We encountered an error in the binary op evaluator for NULL values in an array with the first contains (?`@>)` and first contained by (?<`@)` operators for ltree. This is now handled, returning the appropriate `pgerror`. Fixes: #152323 Epic: None Release note: None Co-authored-by: David Taylor <[email protected]> Co-authored-by: Paul Niziolek <[email protected]>
3 parents 4582d77 + 84ea447 + e9de003 commit a6a632c

File tree

8 files changed

+593
-0
lines changed

8 files changed

+593
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Roachtest Investigation Tips
2+
3+
This directory contains guides for investigating common roachtest failures.
4+
5+
## General Investigation Workflow
6+
7+
1. **Gather artifacts** using `./scripts/fetch-roachtest-artifacts`.
8+
9+
2. **Identify the failure type** from test logs and error messages.
10+
11+
3. **Follow the appropriate guide** based on the failure type:
12+
- [**Debugging Job Failures**](debugging-a-job.md) - How to investigate failed or misbehaving
13+
jobs (backups, imports, CDC, etc.).
14+
- [**Node Exits, Crashes, and OOMs**](exits_crashes_ooms.md) - Investigating unexpected node
15+
exits and system-level failures.
16+
- [**Workload Failures**](workload-failure.md) - How to investigate failures originating from
17+
workload commands (use in concert with other guides).
18+
- [**Synthetic Failure Injection**](synthetic-failure-injection.md) - Investigating tests with
19+
network partitions, node kills, and other chaos engineering patterns (use in concert with other guides).
20+
21+
4. **Document findings**
22+
Edit and extend `investigation.md` to capture your findings:
23+
- Include citations to specific file locations or log files, or the grep/awk/etc command used to
24+
find them.
25+
- Update or add to the unified timeline as described below.
26+
- Before adding new sections to the investigation notes, review existing sections to see if
27+
there are already sections that should be amended or expanded instead.
28+
29+
5. **Review the code**: based on the observations above, perform a cursory review of code involved
30+
based on the errors, jobs, etc observed:
31+
- What are the major components, key functions, etc?
32+
- Are there conditionals or checks, settings or comments in the code that seem related?
33+
- What is the test trying to test? what is it doing to test that?
34+
- Map the call chain, request senders, component layers leading up to the site of failures to try
35+
to determine which layer or component was intended to be responsible for ensuring the condition
36+
that was ultimately violated.
37+
- In particular, is there logic specific to this which apparently failed? Or is it missing?
38+
39+
6. **Search git history** for recent changes related to the failure:
40+
- Check recent changes to the test: `git log --oneline -n 10 pkg/path/to/test`.
41+
- Check recent changes to relevant packages: `git log --oneline -n 20 pkg/path/to/relevant/package/`.
42+
- Look for related setting/feature/error names: `git log -S "keyword" --oneline`.
43+
44+
## Timeline Hygiene
45+
46+
Maintain a single, unified timeline for ease of reference throughout the investigation.
47+
- Use a consistent format: one line per entry, each prefixed with HH:MM:SS in UTC.
48+
- Do not create separate timelines for different components or log sources.
49+
- All events should be integrated into one chronological sequence.
50+
- Don't put too much detail in an entry; leave that to the detailed observation in the rest of the
51+
document.
52+
- For example, say you found hundreds of addsstable errors in the logs over 30 minutes during
53+
which a job makes no progress:
54+
Your timeline might show:
55+
`08:55:24 - job 456 records last progress change to 22% before stalling there for 34mins`
56+
`08:55:27 - first addsstable 'Command too large' error appears in logs on n2`
57+
`08:55:27-09:28:04 addsstable 'Command too large' errors on all nodes for 33mins`
58+
And there would be a section later in the document describing addsstable errors with complete
59+
examples, number, time period and location observed, etc.
60+
- Distinguish test framework actions (network partitions, node kills) from cluster behavior
61+
responses.
62+
- Note periods of baseline/normal behavior before or after issues to establish what changed, e.g.
63+
if a job progresses steadily at 1.5%/min for 20 min before stalling.
64+
- Align test.log steps with node logs and system events in the same timeline.
65+
66+
## Systematic Event Correlation Analysis
67+
68+
When investigating performance changes or failures, use systematic correlation to identify what
69+
differs between working and non-working periods:
70+
71+
- **Identify transition periods**: Look for times when behavior changes (e.g., job progress rate
72+
shifts, workload performance degrades, error patterns emerge).
73+
- **Define before/after windows**: Establish specific time ranges for comparison (e.g., "at 11:32
74+
we're going fast and at 11:37 we're slow").
75+
- **Compare multiple dimensions systematically**:
76+
- Node distribution patterns (which nodes handle work before vs after)
77+
- Retry frequencies and error message patterns
78+
- Resource indicators (flow control blocking, admission control, memory pressure)
79+
- Infrastructure events (replica changes, range splits, rebalancing operations)
80+
- **Look for timing correlations**: Document what events occurred during or just before transition
81+
periods, but present findings as "timing correlations" rather than definitive causation.
82+
- **Document pattern changes**: Describe how workload distribution, failure patterns, or
83+
performance characteristics differ between periods.
84+
85+
This technique is particularly useful when investigating jobs that show changes in progress rates
86+
or workloads that show performance degradation over time.
87+
88+
## Related Tools
89+
90+
- `./scripts/fetch-roachtest-artifacts [issue_num|issue_link|issue_comment_link]` - Downloads test
91+
artifacts for investigation.
92+
- `.claude/commands/roachtest-failure.md` - Claude Code automation instructions for following this
93+
guide.
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Debugging Roachtest Job Failures
2+
3+
Roachtests that run specific jobs verify both successful completion and correct behavior, including
4+
performance characteristics. This guide provides a systematic approach to investigating job-related
5+
test failures.
6+
7+
## Find the Job ID
8+
9+
Identifying the specific job ID is critical for investigation. If not found in test.log, search
10+
`debug/system.jobs.txt` for the relevant job type:
11+
- `c2c/*` (cluster-to-cluster) → `STREAM INGESTION` (in `dest_debug`).
12+
- `ldr/*``LOGICAL REPLICATION`.
13+
- `cdc/*`, `changefeed/*``CHANGEFEED`.
14+
- `backup/*``BACKUP`, `restore/*``RESTORE`.
15+
- `import/*``IMPORT`.
16+
- `schemachange/*``SCHEMA CHANGE`.
17+
18+
Note for c2c tests: The cluster is split into source and destination clusters, with separate debug
19+
folders `source_debug` and `dest_debug/`.
20+
21+
Verify the job is correctly identified before proceeding:
22+
- If multiple jobs are found: Stop and ask which job ID is relevant for this test failure.
23+
- If no jobs are found: Stop and ask if this test involves a different job type not listed above.
24+
- Record the identified job ID for reference throughout the investigation.
25+
26+
## General Investigation Approach
27+
28+
Document all observations with citations and references to their locations in the test output.
29+
Create an investigation.md file to track findings systematically.
30+
31+
### Collect job-related logs
32+
33+
Create a focused, combined log containing all entries from all nodes that mention the job ID:
34+
35+
```bash
36+
grep -h "JOB_ID" PATH/TO/FETCHED/artifacts/logs/*.unredacted/cockroach.log | cut -c2- | sort > job_JOB_ID.log
37+
```
38+
NB: Use the unredacted logs, not the redacted ones in artifacts/logs/*.cockroach.log or you may be
39+
missing details.
40+
41+
Review these logs for unusual patterns, error messages, retries, notable events and their
42+
timestamps.
43+
44+
### Review system tables
45+
46+
Examine these system tables for job-specific information:
47+
48+
- `debug/crdb_internal.jobs.txt`: job status and errors.
49+
- Error messages are almost always significant.
50+
- Start and end times help build a timeline.
51+
- `debug/system.job_message.txt`: status messages recorded by the job.
52+
- Include relevant messages in your investigation timeline.
53+
- `debug/system.job_progress_history.txt`: historical progress values sorted by time.
54+
- Some jobs track "fraction_completed" progress; others use "resolved" timestamp progress.
55+
- Jobs may switch between progress types during execution.
56+
- Analyze progress patterns: is it linear? does it stall? are there inflection points?
57+
- For jobs that timeout or get stuck, use `./scripts/job-progress-plot <job-id> <path>` to
58+
generate an ASCII plot and insert it in your investigation document below the timeline.
59+
- Identify periods with different progress rates and document them in your timeline:
60+
- Sudden changes: X%/min immediately shifts to Y%/min at time T
61+
- Gradual changes: progress slows from X%/min at time T1 to Y%/min by time T2 over N minutes
62+
- When rate changes occur (sudden or gradual), investigate what else happened around those
63+
time periods and document correlations in the timeline.
64+
- Compare before/after periods: examine node distribution, retry patterns, flow control
65+
blocking, and infrastructure events to identify what differs between fast and slow periods.
66+
- `debug/jobs/<jobID>/*` sometimes contains zipped up job trace data.
67+
68+
Note: C2C tests have separate system tables for source and destination clusters. The destination
69+
cluster typically contains the most relevant events.
70+
71+
### Investigation strategies
72+
73+
- Never assume job success means correct behavior; tests exist because jobs can succeed while
74+
behaving incorrectly.
75+
- If a job succeeds but the test fails, investigate both the job behavior and test logic.
76+
- Retries and replannings:
77+
- Jobs with retry loops may "get stuck" if an underlying issue causes it to retry without making
78+
progress.
79+
- A series of retries without making progress is an indicator something is happening that might be
80+
wrong _even though the job might not fail outright_.
81+
- Investigate progress stalls: When a job stops making progress, examine events around that time:
82+
- Increased retry attempts.
83+
- Logged errors.
84+
- Node failures or restarts.
85+
- KV range splits or other cluster events?
86+
- If you find yourself digging into a slow job, look for a job trace zip.
87+
- Combined trace has count of and cumulative time spent per operation.
88+
- Note: many jobs run many processors, each with many workers that perform operations
89+
concurrently; cumulative time often exceeds wall time.
90+
- For distributed jobs: coordination nodes aggregate cumulative times from all worker nodes - a
91+
single node showing 100+ hours may actually represent work distributed across many nodes.
92+
- Many jobs utilize a "frontier" to track progress happening across many tasks, spans, ranges, etc.
93+
- The frontier tracks and reports the _furthest behind_ task or span, so if the "frontier is
94+
lagging" it means a span is behind, not that there is a problem with the frontier itself.
95+
96+
Timestamp handling: Timestamps appear in different formats:
97+
- `<seconds>.<fractional-seconds>,<logical>`.
98+
- `<nanoseconds>.<logical>`.
99+
100+
When searching logs, use only the higher (seconds) digits to match both formats. Maintain
101+
consistency in your notes by using one format throughout.
102+
103+
### Processor Start Messages
104+
105+
Most distributed jobs utilize "processors" that run on multiple nodes to execute work in parallel.
106+
107+
Analyzing processor distribution and behavior can be key to spotting issues in work distribution,
108+
and can be a signal of replannings or retry loops if those aren't explicitly logged already.
109+
110+
Processor log messages are already in the job-focused log since they are tagged with the job ID.
111+
112+
Processor start messages indicate a job was planned or replanned: if this wasn't observed in the
113+
logs, this is often worth noting in the timeline (along with if the start message is seen on all
114+
nodes or just some).
115+
116+
Sometimes these messages include a number of spans or work units assigned; if this is not relatively
117+
balanced across nodes, or some nodes do not report starting processors when others do, this is
118+
notable.
119+
120+
### Data Writing Job Behavior
121+
122+
- Jobs which write data such as restore, import and schema changes are expected to perform many
123+
splits, particularly when they begin.
124+
- Heavy splitting during initial phases is normal and expected behavior.
125+
- `SSTable cannot be added spanning range bounds` errors are automatically handled by retrying
126+
each half of the SST.
127+
- These errors are unlikely to be the cause of a failure and are usually benign.
128+
- However, if a very high proportion of all SSTable requests hit spanning errors, it suggests the
129+
SSTable adding process is hitting splits it did not expect, and can slow down the process of
130+
adding SSTables.
131+
132+
## Job-Specific Investigation Approaches
133+
134+
This section is continuously evolving. If your job type isn't covered here:
135+
- Ask for specific guidance on what to investigate.
136+
- Document useful patterns you discover for future additions to this guide.
137+
138+
### Backup and Restore Jobs (backup/* and restore/* tests)
139+
140+
For BACKUP and RESTORE jobs, always record these key infrastructure details in your investigation:
141+
142+
- **Cloud provider**: (GCE, AWS, Azure, etc.).
143+
- **Cluster region(s)/zone(s)**: Where the CockroachDB nodes are located.
144+
- **Bucket location**: The cloud storage bucket region/location.
145+
- **Cross-region setup**: Note if cluster and bucket are in different regions.
146+
147+
### Physical Replication Stream Ingestion Jobs (c2c/* tests)
148+
149+
Job phases:
150+
1. Initial scan: copies all data as of the initial scan timestamp.
151+
2. Replication: applies streamed changes, updating replicated time.
152+
3. Cutover: rewinds applied changes to the chosen cutover timestamp (≤ replicated time).
153+
154+
Key terminology:
155+
- Replicated time ("resolved timestamp" internally): highest time to which cutover is allowed.
156+
- "Lag"/"latency": how far replicated time trails behind current time.
157+
- Retained time: protected timestamp trailing resolved time; earliest cutover time allowed.
158+
159+
Expected behaviors:
160+
- Initial scan phase: replicated time remains zero; lag shows as "thousands of years" (normal).
161+
- Replication phase:
162+
- Replicated/resolved time should steadily increase, or "advance", when running normally.
163+
- If resolved time is increasing, lag remains small and constant (usually ~30 seconds).
164+
- Minutes or more of lag suggests replication is behind and trying to catch up.
165+
- High *and steadily increasing* lag suggests replication has stalled and may be stuck.
166+
- NB: constant lag == increasing resolved time; constant resolved time == increasing lag.
167+
- Cutover phase: progress shows as fraction of rewind completed.
168+
- Job completion: when cutover to the target timestamp finishes.
169+
170+
Replication components:
171+
- Source side: stream producer (event_stream and rangefeeds).
172+
- Destination side: stream ingestion job (stream ingestion processor, AddSSTable).
173+
174+
C2C-specific checklists:
175+
1. Augment the investigation timeline to document:
176+
- Initial scan start/completion and the value of the initial scan timestamp.
177+
- Replication phase.
178+
- Cutover start/completion, and cutover timestamp.
179+
- Key replicated/resolved timestamp values around any notable events.
180+
Be sure to note these phases and what phase the job was in when the test failed.
181+
182+
2. Resolved progress and lag analysis:
183+
Look at the progress history, and logged resolved timestamps and logged lag observations:
184+
185+
During replication phase:
186+
- Are there periods where lag increases?
187+
- Is resolved time advancing consistently or does it stall at a constant value for extended
188+
periods?
189+
- Do stalls correlate with other cluster events or errors?
190+
191+
Update the timeline with any observations, particularly periods of constant vs increasing lag, or
192+
sudden changes in resolved time.
193+
194+
3. Component investigation: if replication stalls, examine both source and destination components for
195+
issues.
196+
197+
4. Understanding frontier stalls: If resolved time stops advancing during replication:
198+
- The frontier algorithm takes the minimum timestamp across all spans.
199+
- A stalled frontier indicates one or more spans are stuck, not a coordination failure.
200+
- Look for patterns in which nodes are "lagging behind the frontier" - this indicates which spans
201+
are stuck.
202+
- Check for replanning events: frequent replanning (search "replanning") suggests persistent stuck
203+
spans.
204+
- Examine AdminSplit operations and SSTable retries on specific tables that correlate with stuck
205+
spans.
206+
207+
5. Data validation: tests typically compare fingerprints between clusters.
208+
- Fingerprint mismatches indicate data corruption or replication errors.
209+
- "Start timestamp greater than end timestamp" errors mean cutover time was below retained time.
210+
- Start time, i.e. retained time, usually should be minutes, or hours, before end time.
211+
- Usually caused by extreme replication lag; re-examine the replication phase more closely.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Investigating Node Exits, Crashes, and OOMs
2+
3+
When investigating a node reported as exiting unexpectedly, look for messages, stack traces, and
4+
exit codes, either logged by the node itself as it crashed or by the system if it killed it.
5+
6+
## Common Exit Scenarios
7+
8+
- **Panic/log.Fatal**: Check `logs/N.unredacted/cockroach.stderr.log` for stack traces and fatal
9+
messages.
10+
- **Disk Stall**: Look for "disk stall detected: unable to sync log files" in stderr/main logs.
11+
12+
## Investigation Steps
13+
14+
1. Check `logs/N.unredacted/cockroach.exit.log` files for exit codes (only nodes that crashed will
15+
have these).
16+
2. Search stderr logs: `grep -n "panic\|fatal\|abort" logs/*/unredacted/cockroach.stderr.log`.
17+
3. Check for disk stalls: `grep "disk stall detected" logs/*/unredacted/cockroach.log`.
18+
4. Look for oomkiller messages in `N.dmesg.txt` files, and messages about cockroach in
19+
`artifacts/*.journalctl.txt`.
20+
21+
## Timeline and Context Analysis
22+
23+
Before diving into crash details, establish the timeline context:
24+
25+
**Timing Analysis:**
26+
- When did crashes occur relative to test start and test failure?
27+
- Did the test fail immediately after crashes or continue running?
28+
- For tests that ran significantly longer after crashes, the crash may not be the direct cause.
29+
30+
**Test Type Context:**
31+
- **Node shutdown tests**: Crashes are often expected behavior - focus on whether the test
32+
properly handled the shutdown.
33+
- **Chaos tests**: Some crashes may be intentionally induced - check if the crash matches the
34+
chaos scenario.
35+
- **Performance tests**: Crashes may indicate resource exhaustion under load.
36+
- **Restore/backup tests**: Crashes during long-running operations may indicate coordination
37+
issues.
38+
39+
**Check for Intentional Node Stops:**
40+
Many tests intentionally stop nodes as part of their test scenario. Before investigating a crash
41+
as a problem:
42+
- Search the test log for messages about stopping/killing nodes (e.g., "stopping node", "killing
43+
node", "shutdown").
44+
- Look for test framework calls that intentionally terminate nodes.
45+
- Check if the crash timing aligns with expected test operations.
46+
47+
**Failure Mode Assessment:**
48+
- **Immediate failures after crash**: Crash is likely the direct cause.
49+
- **Timeout failures with early crashes**: Investigate what prevented the test from completing
50+
or failing fast.
51+
- **Connection errors to crashed nodes**: Expected after crash, but investigate what was trying
52+
to connect and why.
53+
54+
## Exit Code Reference
55+
56+
Check `artifacts/test.log` and the crashed node's stderr for exit codes:
57+
58+
- **Exit code 7**: Disk stall detected - look for "disk stall detected: unable to sync log
59+
files".
60+
- **Exit code 10**: Disk full - check disk space and storage errors.
61+
- **Exit code 132**: Illegal instruction (SIGILL) - hardware/build issue.
62+
- **Exit code 134**: Assertion failure (SIGABRT) - internal consistency check failed.
63+
- **Exit code 137**: SIGKILL - process was forcibly terminated (OOM killer, manual kill, or
64+
system limits).
65+
- **Exit code 139**: Segmentation fault (SIGSEGV) - look for stack traces in stderr.
66+
67+
## Investigation Priority
68+
69+
When multiple issues are present, investigate in this order:
70+
71+
1. **Direct test failure mechanism first** - timeout, assertion failure, etc.
72+
2. **Timeline correlation** - do crashes correlate with the direct failure?
73+
3. **Crash details** - only after understanding the failure context.

0 commit comments

Comments
 (0)