|
| 1 | +# Debugging Roachtest Job Failures |
| 2 | + |
| 3 | +Roachtests that run specific jobs verify both successful completion and correct behavior, including |
| 4 | +performance characteristics. This guide provides a systematic approach to investigating job-related |
| 5 | +test failures. |
| 6 | + |
| 7 | +## Find the Job ID |
| 8 | + |
| 9 | +Identifying the specific job ID is critical for investigation. If not found in test.log, search |
| 10 | +`debug/system.jobs.txt` for the relevant job type: |
| 11 | +- `c2c/*` (cluster-to-cluster) → `STREAM INGESTION` (in `dest_debug`). |
| 12 | +- `ldr/*` → `LOGICAL REPLICATION`. |
| 13 | +- `cdc/*`, `changefeed/*` → `CHANGEFEED`. |
| 14 | +- `backup/*` → `BACKUP`, `restore/*` → `RESTORE`. |
| 15 | +- `import/*` → `IMPORT`. |
| 16 | +- `schemachange/*` → `SCHEMA CHANGE`. |
| 17 | + |
| 18 | +Note for c2c tests: The cluster is split into source and destination clusters, with separate debug |
| 19 | +folders `source_debug` and `dest_debug/`. |
| 20 | + |
| 21 | +Verify the job is correctly identified before proceeding: |
| 22 | +- If multiple jobs are found: Stop and ask which job ID is relevant for this test failure. |
| 23 | +- If no jobs are found: Stop and ask if this test involves a different job type not listed above. |
| 24 | +- Record the identified job ID for reference throughout the investigation. |
| 25 | + |
| 26 | +## General Investigation Approach |
| 27 | + |
| 28 | +Document all observations with citations and references to their locations in the test output. |
| 29 | +Create an investigation.md file to track findings systematically. |
| 30 | + |
| 31 | +### Collect job-related logs |
| 32 | + |
| 33 | +Create a focused, combined log containing all entries from all nodes that mention the job ID: |
| 34 | + |
| 35 | +```bash |
| 36 | +grep -h "JOB_ID" PATH/TO/FETCHED/artifacts/logs/*.unredacted/cockroach.log | cut -c2- | sort > job_JOB_ID.log |
| 37 | +``` |
| 38 | +NB: Use the unredacted logs, not the redacted ones in artifacts/logs/*.cockroach.log or you may be |
| 39 | +missing details. |
| 40 | + |
| 41 | +Review these logs for unusual patterns, error messages, retries, notable events and their |
| 42 | +timestamps. |
| 43 | + |
| 44 | +### Review system tables |
| 45 | + |
| 46 | +Examine these system tables for job-specific information: |
| 47 | + |
| 48 | +- `debug/crdb_internal.jobs.txt`: job status and errors. |
| 49 | + - Error messages are almost always significant. |
| 50 | + - Start and end times help build a timeline. |
| 51 | +- `debug/system.job_message.txt`: status messages recorded by the job. |
| 52 | + - Include relevant messages in your investigation timeline. |
| 53 | +- `debug/system.job_progress_history.txt`: historical progress values sorted by time. |
| 54 | + - Some jobs track "fraction_completed" progress; others use "resolved" timestamp progress. |
| 55 | + - Jobs may switch between progress types during execution. |
| 56 | + - Analyze progress patterns: is it linear? does it stall? are there inflection points? |
| 57 | + - For jobs that timeout or get stuck, use `./scripts/job-progress-plot <job-id> <path>` to |
| 58 | + generate an ASCII plot and insert it in your investigation document below the timeline. |
| 59 | + - Identify periods with different progress rates and document them in your timeline: |
| 60 | + - Sudden changes: X%/min immediately shifts to Y%/min at time T |
| 61 | + - Gradual changes: progress slows from X%/min at time T1 to Y%/min by time T2 over N minutes |
| 62 | + - When rate changes occur (sudden or gradual), investigate what else happened around those |
| 63 | + time periods and document correlations in the timeline. |
| 64 | + - Compare before/after periods: examine node distribution, retry patterns, flow control |
| 65 | + blocking, and infrastructure events to identify what differs between fast and slow periods. |
| 66 | +- `debug/jobs/<jobID>/*` sometimes contains zipped up job trace data. |
| 67 | + |
| 68 | +Note: C2C tests have separate system tables for source and destination clusters. The destination |
| 69 | +cluster typically contains the most relevant events. |
| 70 | + |
| 71 | +### Investigation strategies |
| 72 | + |
| 73 | +- Never assume job success means correct behavior; tests exist because jobs can succeed while |
| 74 | + behaving incorrectly. |
| 75 | +- If a job succeeds but the test fails, investigate both the job behavior and test logic. |
| 76 | +- Retries and replannings: |
| 77 | + - Jobs with retry loops may "get stuck" if an underlying issue causes it to retry without making |
| 78 | + progress. |
| 79 | + - A series of retries without making progress is an indicator something is happening that might be |
| 80 | + wrong _even though the job might not fail outright_. |
| 81 | +- Investigate progress stalls: When a job stops making progress, examine events around that time: |
| 82 | + - Increased retry attempts. |
| 83 | + - Logged errors. |
| 84 | + - Node failures or restarts. |
| 85 | + - KV range splits or other cluster events? |
| 86 | +- If you find yourself digging into a slow job, look for a job trace zip. |
| 87 | + - Combined trace has count of and cumulative time spent per operation. |
| 88 | + - Note: many jobs run many processors, each with many workers that perform operations |
| 89 | + concurrently; cumulative time often exceeds wall time. |
| 90 | + - For distributed jobs: coordination nodes aggregate cumulative times from all worker nodes - a |
| 91 | + single node showing 100+ hours may actually represent work distributed across many nodes. |
| 92 | +- Many jobs utilize a "frontier" to track progress happening across many tasks, spans, ranges, etc. |
| 93 | + - The frontier tracks and reports the _furthest behind_ task or span, so if the "frontier is |
| 94 | + lagging" it means a span is behind, not that there is a problem with the frontier itself. |
| 95 | + |
| 96 | +Timestamp handling: Timestamps appear in different formats: |
| 97 | +- `<seconds>.<fractional-seconds>,<logical>`. |
| 98 | +- `<nanoseconds>.<logical>`. |
| 99 | + |
| 100 | +When searching logs, use only the higher (seconds) digits to match both formats. Maintain |
| 101 | +consistency in your notes by using one format throughout. |
| 102 | + |
| 103 | +### Processor Start Messages |
| 104 | + |
| 105 | +Most distributed jobs utilize "processors" that run on multiple nodes to execute work in parallel. |
| 106 | + |
| 107 | +Analyzing processor distribution and behavior can be key to spotting issues in work distribution, |
| 108 | +and can be a signal of replannings or retry loops if those aren't explicitly logged already. |
| 109 | + |
| 110 | +Processor log messages are already in the job-focused log since they are tagged with the job ID. |
| 111 | + |
| 112 | +Processor start messages indicate a job was planned or replanned: if this wasn't observed in the |
| 113 | +logs, this is often worth noting in the timeline (along with if the start message is seen on all |
| 114 | +nodes or just some). |
| 115 | + |
| 116 | +Sometimes these messages include a number of spans or work units assigned; if this is not relatively |
| 117 | +balanced across nodes, or some nodes do not report starting processors when others do, this is |
| 118 | +notable. |
| 119 | + |
| 120 | +### Data Writing Job Behavior |
| 121 | + |
| 122 | +- Jobs which write data such as restore, import and schema changes are expected to perform many |
| 123 | + splits, particularly when they begin. |
| 124 | +- Heavy splitting during initial phases is normal and expected behavior. |
| 125 | +- `SSTable cannot be added spanning range bounds` errors are automatically handled by retrying |
| 126 | + each half of the SST. |
| 127 | +- These errors are unlikely to be the cause of a failure and are usually benign. |
| 128 | +- However, if a very high proportion of all SSTable requests hit spanning errors, it suggests the |
| 129 | + SSTable adding process is hitting splits it did not expect, and can slow down the process of |
| 130 | + adding SSTables. |
| 131 | + |
| 132 | +## Job-Specific Investigation Approaches |
| 133 | + |
| 134 | +This section is continuously evolving. If your job type isn't covered here: |
| 135 | +- Ask for specific guidance on what to investigate. |
| 136 | +- Document useful patterns you discover for future additions to this guide. |
| 137 | + |
| 138 | +### Backup and Restore Jobs (backup/* and restore/* tests) |
| 139 | + |
| 140 | +For BACKUP and RESTORE jobs, always record these key infrastructure details in your investigation: |
| 141 | + |
| 142 | +- **Cloud provider**: (GCE, AWS, Azure, etc.). |
| 143 | +- **Cluster region(s)/zone(s)**: Where the CockroachDB nodes are located. |
| 144 | +- **Bucket location**: The cloud storage bucket region/location. |
| 145 | +- **Cross-region setup**: Note if cluster and bucket are in different regions. |
| 146 | + |
| 147 | +### Physical Replication Stream Ingestion Jobs (c2c/* tests) |
| 148 | + |
| 149 | +Job phases: |
| 150 | +1. Initial scan: copies all data as of the initial scan timestamp. |
| 151 | +2. Replication: applies streamed changes, updating replicated time. |
| 152 | +3. Cutover: rewinds applied changes to the chosen cutover timestamp (≤ replicated time). |
| 153 | + |
| 154 | +Key terminology: |
| 155 | +- Replicated time ("resolved timestamp" internally): highest time to which cutover is allowed. |
| 156 | +- "Lag"/"latency": how far replicated time trails behind current time. |
| 157 | +- Retained time: protected timestamp trailing resolved time; earliest cutover time allowed. |
| 158 | + |
| 159 | +Expected behaviors: |
| 160 | +- Initial scan phase: replicated time remains zero; lag shows as "thousands of years" (normal). |
| 161 | +- Replication phase: |
| 162 | + - Replicated/resolved time should steadily increase, or "advance", when running normally. |
| 163 | + - If resolved time is increasing, lag remains small and constant (usually ~30 seconds). |
| 164 | + - Minutes or more of lag suggests replication is behind and trying to catch up. |
| 165 | + - High *and steadily increasing* lag suggests replication has stalled and may be stuck. |
| 166 | + - NB: constant lag == increasing resolved time; constant resolved time == increasing lag. |
| 167 | +- Cutover phase: progress shows as fraction of rewind completed. |
| 168 | +- Job completion: when cutover to the target timestamp finishes. |
| 169 | + |
| 170 | +Replication components: |
| 171 | +- Source side: stream producer (event_stream and rangefeeds). |
| 172 | +- Destination side: stream ingestion job (stream ingestion processor, AddSSTable). |
| 173 | + |
| 174 | +C2C-specific checklists: |
| 175 | +1. Augment the investigation timeline to document: |
| 176 | + - Initial scan start/completion and the value of the initial scan timestamp. |
| 177 | + - Replication phase. |
| 178 | + - Cutover start/completion, and cutover timestamp. |
| 179 | + - Key replicated/resolved timestamp values around any notable events. |
| 180 | + Be sure to note these phases and what phase the job was in when the test failed. |
| 181 | + |
| 182 | +2. Resolved progress and lag analysis: |
| 183 | + Look at the progress history, and logged resolved timestamps and logged lag observations: |
| 184 | + |
| 185 | + During replication phase: |
| 186 | + - Are there periods where lag increases? |
| 187 | + - Is resolved time advancing consistently or does it stall at a constant value for extended |
| 188 | + periods? |
| 189 | + - Do stalls correlate with other cluster events or errors? |
| 190 | + |
| 191 | + Update the timeline with any observations, particularly periods of constant vs increasing lag, or |
| 192 | + sudden changes in resolved time. |
| 193 | + |
| 194 | +3. Component investigation: if replication stalls, examine both source and destination components for |
| 195 | + issues. |
| 196 | + |
| 197 | +4. Understanding frontier stalls: If resolved time stops advancing during replication: |
| 198 | + - The frontier algorithm takes the minimum timestamp across all spans. |
| 199 | + - A stalled frontier indicates one or more spans are stuck, not a coordination failure. |
| 200 | + - Look for patterns in which nodes are "lagging behind the frontier" - this indicates which spans |
| 201 | + are stuck. |
| 202 | + - Check for replanning events: frequent replanning (search "replanning") suggests persistent stuck |
| 203 | + spans. |
| 204 | + - Examine AdminSplit operations and SSTable retries on specific tables that correlate with stuck |
| 205 | + spans. |
| 206 | + |
| 207 | +5. Data validation: tests typically compare fingerprints between clusters. |
| 208 | + - Fingerprint mismatches indicate data corruption or replication errors. |
| 209 | + - "Start timestamp greater than end timestamp" errors mean cutover time was below retained time. |
| 210 | + - Start time, i.e. retained time, usually should be minutes, or hours, before end time. |
| 211 | + - Usually caused by extreme replication lag; re-examine the replication phase more closely. |
0 commit comments