|
| 1 | +# Crash Tracker Unix Socket Communication Protocol |
| 2 | + |
| 3 | +**Date**: September 23, 2025 |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This document describes the Unix domain socket communication protocol used between the crash tracker's collector and receiver processes. The crash tracker uses a two-process architecture where the collector (a fork of the crashing process) communicates crash data to the receiver (a fork+execve process) via an anonymous Unix domain socket pair. |
| 8 | + |
| 9 | +## Socket Creation and Setup |
| 10 | + |
| 11 | +The communication channel is established using `socketpair()` to create an anonymous Unix domain socket pair: |
| 12 | + |
| 13 | +```rust |
| 14 | +let (uds_parent, uds_child) = socket::socketpair( |
| 15 | + socket::AddressFamily::Unix, |
| 16 | + socket::SockType::Stream, |
| 17 | + None, |
| 18 | + socket::SockFlag::empty(), |
| 19 | +)?; |
| 20 | +``` |
| 21 | + |
| 22 | +**Location**: `datadog-crashtracker/src/collector/receiver_manager.rs:78-85` |
| 23 | + |
| 24 | +### File Descriptor Management |
| 25 | + |
| 26 | +1. **Parent Process**: Retains `uds_parent` for tracking |
| 27 | +2. **Collector Process**: Inherits `uds_parent` as the write end |
| 28 | +3. **Receiver Process**: Gets `uds_child` redirected to stdin via `dup2(uds_child, 0)` |
| 29 | + |
| 30 | +## Communication Protocol |
| 31 | + |
| 32 | +### Data Format |
| 33 | + |
| 34 | +The crash data is transmitted as a structured text stream with distinct sections delimited by markers defined in `datadog-crashtracker/src/shared/constants.rs`. |
| 35 | + |
| 36 | +### Message Structure |
| 37 | + |
| 38 | +Each crash report follows this sequence: |
| 39 | + |
| 40 | +1. **Metadata Section** |
| 41 | +2. **Configuration Section** |
| 42 | +3. **Signal Information Section** |
| 43 | +4. **Process Context Section** |
| 44 | +5. **Process Information Section** |
| 45 | +6. **Counters Section** |
| 46 | +7. **Spans Section** |
| 47 | +8. **Additional Tags Section** |
| 48 | +9. **Traces Section** |
| 49 | +10. **Memory Maps Section** (Linux only) |
| 50 | +11. **Stack Trace Section** |
| 51 | +12. **Completion Marker** |
| 52 | + |
| 53 | +### Section Details |
| 54 | + |
| 55 | +#### 1. Metadata Section |
| 56 | +``` |
| 57 | +DD_CRASHTRACK_BEGIN_METADATA |
| 58 | +{JSON metadata object} |
| 59 | +DD_CRASHTRACK_END_METADATA |
| 60 | +``` |
| 61 | + |
| 62 | +Contains serialized `Metadata` object with application context, tags, and environment information. |
| 63 | + |
| 64 | +#### 2. Configuration Section |
| 65 | +``` |
| 66 | +DD_CRASHTRACK_BEGIN_CONFIG |
| 67 | +{JSON configuration object} |
| 68 | +DD_CRASHTRACK_END_CONFIG |
| 69 | +``` |
| 70 | + |
| 71 | +Contains serialized `CrashtrackerConfiguration` with crash tracking settings, endpoint information, and processing options. |
| 72 | + |
| 73 | +#### 3. Signal Information Section |
| 74 | +``` |
| 75 | +DD_CRASHTRACK_BEGIN_SIGINFO |
| 76 | +{ |
| 77 | + "si_code": <signal_code>, |
| 78 | + "si_code_human_readable": "<description>", |
| 79 | + "si_signo": <signal_number>, |
| 80 | + "si_signo_human_readable": "<signal_name>", |
| 81 | + "si_addr": "<fault_address>" // Optional, for memory faults |
| 82 | +} |
| 83 | +DD_CRASHTRACK_END_SIGINFO |
| 84 | +``` |
| 85 | + |
| 86 | +Contains signal details extracted from `siginfo_t` structure. |
| 87 | + |
| 88 | +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:223-263` |
| 89 | + |
| 90 | +#### 4. Process Context Section (ucontext) |
| 91 | +``` |
| 92 | +DD_CRASHTRACK_BEGIN_UCONTEXT |
| 93 | +<platform-specific context dump> |
| 94 | +DD_CRASHTRACK_END_UCONTEXT |
| 95 | +``` |
| 96 | + |
| 97 | +Contains processor state at crash time from `ucontext_t`. Format varies by platform: |
| 98 | +- **Linux**: Direct debug print of `ucontext_t` |
| 99 | +- **macOS**: Includes both `ucontext_t` and machine context (`mcontext`) |
| 100 | + |
| 101 | +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:190-221` |
| 102 | + |
| 103 | +#### 5. Process Information Section |
| 104 | +``` |
| 105 | +DD_CRASHTRACK_BEGIN_PROCINFO |
| 106 | +{"pid": <process_id>} |
| 107 | +DD_CRASHTRACK_END_PROCINFO |
| 108 | +``` |
| 109 | + |
| 110 | +Contains the process ID of the crashing process. |
| 111 | + |
| 112 | +#### 6. Counters Section |
| 113 | +``` |
| 114 | +DD_CRASHTRACK_BEGIN_COUNTERS |
| 115 | +<counter data> |
| 116 | +DD_CRASHTRACK_END_COUNTERS |
| 117 | +``` |
| 118 | + |
| 119 | +Contains internal crash tracker counters and metrics. |
| 120 | + |
| 121 | +#### 7. Spans Section |
| 122 | +``` |
| 123 | +DD_CRASHTRACK_BEGIN_SPANS |
| 124 | +<span data> |
| 125 | +DD_CRASHTRACK_END_SPANS |
| 126 | +``` |
| 127 | + |
| 128 | +Contains active distributed tracing spans at crash time. |
| 129 | + |
| 130 | +#### 8. Additional Tags Section |
| 131 | +``` |
| 132 | +DD_CRASHTRACK_BEGIN_TAGS |
| 133 | +<tag data> |
| 134 | +DD_CRASHTRACK_END_TAGS |
| 135 | +``` |
| 136 | + |
| 137 | +Contains additional tags collected at crash time. |
| 138 | + |
| 139 | +#### 9. Traces Section |
| 140 | +``` |
| 141 | +DD_CRASHTRACK_BEGIN_TRACES |
| 142 | +<trace data> |
| 143 | +DD_CRASHTRACK_END_TRACES |
| 144 | +``` |
| 145 | + |
| 146 | +Contains active trace information. |
| 147 | + |
| 148 | +#### 10. Memory Maps Section (Linux Only) |
| 149 | +``` |
| 150 | +DD_CRASHTRACK_BEGIN_FILE /proc/self/maps |
| 151 | +<contents of /proc/self/maps> |
| 152 | +DD_CRASHTRACK_END_FILE "/proc/self/maps" |
| 153 | +``` |
| 154 | + |
| 155 | +Contains memory mapping information from `/proc/self/maps` for symbol resolution. |
| 156 | + |
| 157 | +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:184-187` |
| 158 | + |
| 159 | +#### 11. Stack Trace Section |
| 160 | +``` |
| 161 | +DD_CRASHTRACK_BEGIN_STACKTRACE |
| 162 | +{"ip": "<instruction_pointer>", "module_base_address": "<base>", "sp": "<stack_pointer>", "symbol_address": "<addr>"} |
| 163 | +{"ip": "<instruction_pointer>", "module_base_address": "<base>", "sp": "<stack_pointer>", "symbol_address": "<addr>", "function": "<name>", "file": "<path>", "line": <number>} |
| 164 | +... |
| 165 | +DD_CRASHTRACK_END_STACKTRACE |
| 166 | +``` |
| 167 | + |
| 168 | +Each line represents one stack frame. Frame format depends on symbol resolution setting: |
| 169 | + |
| 170 | +- **Disabled/Receiver-only**: Only addresses (`ip`, `sp`, `symbol_address`, optional `module_base_address`) |
| 171 | +- **In-process symbols**: Includes debug information (`function`, `file`, `line`, `column`) |
| 172 | + |
| 173 | +Stack frames with stack pointer less than the fault stack pointer are filtered out to exclude crash tracker frames. |
| 174 | + |
| 175 | +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:45-117` |
| 176 | + |
| 177 | +#### 12. Completion Marker |
| 178 | +``` |
| 179 | +DD_CRASHTRACK_DONE |
| 180 | +``` |
| 181 | + |
| 182 | +Indicates end of crash report transmission. |
| 183 | + |
| 184 | +## Communication Flow |
| 185 | + |
| 186 | +### 1. Collector Side (Write End) |
| 187 | + |
| 188 | +**File**: `datadog-crashtracker/src/collector/collector_manager.rs:92-102` |
| 189 | + |
| 190 | +```rust |
| 191 | +let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; |
| 192 | + |
| 193 | +let report = emit_crashreport( |
| 194 | + &mut unix_stream, |
| 195 | + config, |
| 196 | + config_str, |
| 197 | + metadata_str, |
| 198 | + sig_info, |
| 199 | + ucontext, |
| 200 | + ppid, |
| 201 | +); |
| 202 | +``` |
| 203 | + |
| 204 | +The collector: |
| 205 | +1. Creates `UnixStream` from inherited file descriptor |
| 206 | +2. Calls `emit_crashreport()` to serialize and write all crash data |
| 207 | +3. Flushes the stream after each section for reliability |
| 208 | +4. Exits with `libc::_exit(0)` on completion |
| 209 | + |
| 210 | +### 2. Receiver Side (Read End) |
| 211 | + |
| 212 | +**File**: `datadog-crashtracker/src/receiver/entry_points.rs:97-119` |
| 213 | + |
| 214 | +```rust |
| 215 | +pub(crate) async fn receiver_entry_point( |
| 216 | + timeout: Duration, |
| 217 | + stream: impl AsyncBufReadExt + std::marker::Unpin, |
| 218 | +) -> anyhow::Result<()> { |
| 219 | + if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? { |
| 220 | + // Process crash data |
| 221 | + if let Err(e) = resolve_frames(&config, &mut crash_info) { |
| 222 | + crash_info.log_messages.push(format!("Error resolving frames: {e}")); |
| 223 | + } |
| 224 | + if config.demangle_names() { |
| 225 | + if let Err(e) = crash_info.demangle_names() { |
| 226 | + crash_info.log_messages.push(format!("Error demangling names: {e}")); |
| 227 | + } |
| 228 | + } |
| 229 | + crash_info.async_upload_to_endpoint(config.endpoint()).await?; |
| 230 | + } |
| 231 | + Ok(()) |
| 232 | +} |
| 233 | +``` |
| 234 | + |
| 235 | +The receiver: |
| 236 | +1. Reads from stdin (Unix socket via `dup2`) |
| 237 | +2. Parses the structured stream into `CrashInfo` and `CrashtrackerConfiguration` |
| 238 | +3. Performs symbol resolution if configured |
| 239 | +4. Uploads formatted crash report to backend |
| 240 | + |
| 241 | +### 3. Stream Parsing |
| 242 | + |
| 243 | +**File**: `datadog-crashtracker/src/receiver/receive_report.rs` |
| 244 | + |
| 245 | +The receiver parses the stream by: |
| 246 | +1. Reading line-by-line with timeout protection |
| 247 | +2. Matching delimiter patterns to identify sections |
| 248 | +3. Accumulating section data between delimiters |
| 249 | +4. Deserializing JSON sections into appropriate data structures |
| 250 | +5. Handling the `DD_CRASHTRACK_DONE` completion marker |
| 251 | + |
| 252 | +## Error Handling and Reliability |
| 253 | + |
| 254 | +### Signal Safety |
| 255 | +- All collector operations use only async-signal-safe functions |
| 256 | +- No memory allocation in signal handler context |
| 257 | +- Pre-prepared data structures (`PreparedExecve`) to avoid allocations |
| 258 | + |
| 259 | +### Timeout Protection |
| 260 | +- Receiver has configurable timeout (default: 4000ms) |
| 261 | +- Environment variable: `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` |
| 262 | +- Prevents hanging on incomplete/corrupted streams |
| 263 | + |
| 264 | +### Process Cleanup |
| 265 | +- Parent process uses `wait_for_pollhup()` to detect socket closure |
| 266 | +- Kills child processes with `SIGKILL` if needed |
| 267 | +- Reaps zombie processes to prevent resource leaks |
| 268 | + |
| 269 | +**File**: `datadog-crashtracker/src/collector/process_handle.rs:19-40` |
| 270 | + |
| 271 | +### Data Integrity |
| 272 | +- Each section is flushed immediately after writing |
| 273 | +- Structured delimiters allow detection of incomplete transmissions |
| 274 | +- Error messages are accumulated rather than failing fast |
| 275 | + |
| 276 | +## Alternative Communication Modes |
| 277 | + |
| 278 | +### Named Socket Mode |
| 279 | +When `unix_socket_path` is configured, the collector connects to an existing Unix socket instead of using the fork+execve receiver: |
| 280 | + |
| 281 | +```rust |
| 282 | +let receiver = if unix_socket_path.is_empty() { |
| 283 | + Receiver::spawn_from_stored_config()? // Fork+execve mode |
| 284 | +} else { |
| 285 | + Receiver::from_socket(unix_socket_path)? // Named socket mode |
| 286 | +}; |
| 287 | +``` |
| 288 | + |
| 289 | +This allows integration with long-lived receiver processes. |
| 290 | + |
| 291 | +**Linux Abstract Sockets**: On Linux, socket paths not starting with `.` or `/` are treated as abstract socket names. |
| 292 | + |
| 293 | +## Security Considerations |
| 294 | + |
| 295 | +### File Descriptor Isolation |
| 296 | +- Collector closes stdio file descriptors (0, 1, 2) |
| 297 | +- Receiver redirects socket to stdin, stdout/stderr to configured files |
| 298 | +- Minimizes attack surface during crash processing |
| 299 | + |
| 300 | +### Process Isolation |
| 301 | +- Fork+execve provides strong process boundary |
| 302 | +- Crash in collector doesn't affect receiver |
| 303 | +- Signal handlers are reset in receiver child |
| 304 | + |
| 305 | +### Resource Limits |
| 306 | +- Timeout prevents resource exhaustion |
| 307 | +- Fixed buffer sizes for file operations |
| 308 | +- Immediate flushing prevents large memory usage |
| 309 | + |
| 310 | +## Debugging and Monitoring |
| 311 | + |
| 312 | +### Log Output |
| 313 | +- Receiver can be configured with `stdout_filename` and `stderr_filename` |
| 314 | +- Error messages are accumulated in crash report |
| 315 | +- Debug assertions validate critical operations |
| 316 | + |
| 317 | +### Environment Variables |
| 318 | +- `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`: Receiver timeout |
| 319 | +- Standard Unix environment passed through execve |
| 320 | + |
| 321 | +This communication protocol ensures reliable crash data collection and transmission even when the main process is in an unstable state, providing robust crash reporting capabilities for production systems. |
0 commit comments