Skip to content

Commit e18cf50

Browse files
committed
Add doc to outline UNIX communication
1 parent 0cc3f7f commit e18cf50

File tree

1 file changed

+321
-0
lines changed

1 file changed

+321
-0
lines changed
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
# Crash Tracker Unix Socket Communication Protocol
2+
3+
**Date**: September 23, 2025
4+
5+
## Overview
6+
7+
This document describes the Unix domain socket communication protocol used between the crash tracker's collector and receiver processes. The crash tracker uses a two-process architecture where the collector (a fork of the crashing process) communicates crash data to the receiver (a fork+execve process) via an anonymous Unix domain socket pair.
8+
9+
## Socket Creation and Setup
10+
11+
The communication channel is established using `socketpair()` to create an anonymous Unix domain socket pair:
12+
13+
```rust
14+
let (uds_parent, uds_child) = socket::socketpair(
15+
socket::AddressFamily::Unix,
16+
socket::SockType::Stream,
17+
None,
18+
socket::SockFlag::empty(),
19+
)?;
20+
```
21+
22+
**Location**: `datadog-crashtracker/src/collector/receiver_manager.rs:78-85`
23+
24+
### File Descriptor Management
25+
26+
1. **Parent Process**: Retains `uds_parent` for tracking
27+
2. **Collector Process**: Inherits `uds_parent` as the write end
28+
3. **Receiver Process**: Gets `uds_child` redirected to stdin via `dup2(uds_child, 0)`
29+
30+
## Communication Protocol
31+
32+
### Data Format
33+
34+
The crash data is transmitted as a structured text stream with distinct sections delimited by markers defined in `datadog-crashtracker/src/shared/constants.rs`.
35+
36+
### Message Structure
37+
38+
Each crash report follows this sequence:
39+
40+
1. **Metadata Section**
41+
2. **Configuration Section**
42+
3. **Signal Information Section**
43+
4. **Process Context Section**
44+
5. **Process Information Section**
45+
6. **Counters Section**
46+
7. **Spans Section**
47+
8. **Additional Tags Section**
48+
9. **Traces Section**
49+
10. **Memory Maps Section** (Linux only)
50+
11. **Stack Trace Section**
51+
12. **Completion Marker**
52+
53+
### Section Details
54+
55+
#### 1. Metadata Section
56+
```
57+
DD_CRASHTRACK_BEGIN_METADATA
58+
{JSON metadata object}
59+
DD_CRASHTRACK_END_METADATA
60+
```
61+
62+
Contains serialized `Metadata` object with application context, tags, and environment information.
63+
64+
#### 2. Configuration Section
65+
```
66+
DD_CRASHTRACK_BEGIN_CONFIG
67+
{JSON configuration object}
68+
DD_CRASHTRACK_END_CONFIG
69+
```
70+
71+
Contains serialized `CrashtrackerConfiguration` with crash tracking settings, endpoint information, and processing options.
72+
73+
#### 3. Signal Information Section
74+
```
75+
DD_CRASHTRACK_BEGIN_SIGINFO
76+
{
77+
"si_code": <signal_code>,
78+
"si_code_human_readable": "<description>",
79+
"si_signo": <signal_number>,
80+
"si_signo_human_readable": "<signal_name>",
81+
"si_addr": "<fault_address>" // Optional, for memory faults
82+
}
83+
DD_CRASHTRACK_END_SIGINFO
84+
```
85+
86+
Contains signal details extracted from `siginfo_t` structure.
87+
88+
**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:223-263`
89+
90+
#### 4. Process Context Section (ucontext)
91+
```
92+
DD_CRASHTRACK_BEGIN_UCONTEXT
93+
<platform-specific context dump>
94+
DD_CRASHTRACK_END_UCONTEXT
95+
```
96+
97+
Contains processor state at crash time from `ucontext_t`. Format varies by platform:
98+
- **Linux**: Direct debug print of `ucontext_t`
99+
- **macOS**: Includes both `ucontext_t` and machine context (`mcontext`)
100+
101+
**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:190-221`
102+
103+
#### 5. Process Information Section
104+
```
105+
DD_CRASHTRACK_BEGIN_PROCINFO
106+
{"pid": <process_id>}
107+
DD_CRASHTRACK_END_PROCINFO
108+
```
109+
110+
Contains the process ID of the crashing process.
111+
112+
#### 6. Counters Section
113+
```
114+
DD_CRASHTRACK_BEGIN_COUNTERS
115+
<counter data>
116+
DD_CRASHTRACK_END_COUNTERS
117+
```
118+
119+
Contains internal crash tracker counters and metrics.
120+
121+
#### 7. Spans Section
122+
```
123+
DD_CRASHTRACK_BEGIN_SPANS
124+
<span data>
125+
DD_CRASHTRACK_END_SPANS
126+
```
127+
128+
Contains active distributed tracing spans at crash time.
129+
130+
#### 8. Additional Tags Section
131+
```
132+
DD_CRASHTRACK_BEGIN_TAGS
133+
<tag data>
134+
DD_CRASHTRACK_END_TAGS
135+
```
136+
137+
Contains additional tags collected at crash time.
138+
139+
#### 9. Traces Section
140+
```
141+
DD_CRASHTRACK_BEGIN_TRACES
142+
<trace data>
143+
DD_CRASHTRACK_END_TRACES
144+
```
145+
146+
Contains active trace information.
147+
148+
#### 10. Memory Maps Section (Linux Only)
149+
```
150+
DD_CRASHTRACK_BEGIN_FILE /proc/self/maps
151+
<contents of /proc/self/maps>
152+
DD_CRASHTRACK_END_FILE "/proc/self/maps"
153+
```
154+
155+
Contains memory mapping information from `/proc/self/maps` for symbol resolution.
156+
157+
**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:184-187`
158+
159+
#### 11. Stack Trace Section
160+
```
161+
DD_CRASHTRACK_BEGIN_STACKTRACE
162+
{"ip": "<instruction_pointer>", "module_base_address": "<base>", "sp": "<stack_pointer>", "symbol_address": "<addr>"}
163+
{"ip": "<instruction_pointer>", "module_base_address": "<base>", "sp": "<stack_pointer>", "symbol_address": "<addr>", "function": "<name>", "file": "<path>", "line": <number>}
164+
...
165+
DD_CRASHTRACK_END_STACKTRACE
166+
```
167+
168+
Each line represents one stack frame. Frame format depends on symbol resolution setting:
169+
170+
- **Disabled/Receiver-only**: Only addresses (`ip`, `sp`, `symbol_address`, optional `module_base_address`)
171+
- **In-process symbols**: Includes debug information (`function`, `file`, `line`, `column`)
172+
173+
Stack frames with stack pointer less than the fault stack pointer are filtered out to exclude crash tracker frames.
174+
175+
**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:45-117`
176+
177+
#### 12. Completion Marker
178+
```
179+
DD_CRASHTRACK_DONE
180+
```
181+
182+
Indicates end of crash report transmission.
183+
184+
## Communication Flow
185+
186+
### 1. Collector Side (Write End)
187+
188+
**File**: `datadog-crashtracker/src/collector/collector_manager.rs:92-102`
189+
190+
```rust
191+
let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) };
192+
193+
let report = emit_crashreport(
194+
&mut unix_stream,
195+
config,
196+
config_str,
197+
metadata_str,
198+
sig_info,
199+
ucontext,
200+
ppid,
201+
);
202+
```
203+
204+
The collector:
205+
1. Creates `UnixStream` from inherited file descriptor
206+
2. Calls `emit_crashreport()` to serialize and write all crash data
207+
3. Flushes the stream after each section for reliability
208+
4. Exits with `libc::_exit(0)` on completion
209+
210+
### 2. Receiver Side (Read End)
211+
212+
**File**: `datadog-crashtracker/src/receiver/entry_points.rs:97-119`
213+
214+
```rust
215+
pub(crate) async fn receiver_entry_point(
216+
timeout: Duration,
217+
stream: impl AsyncBufReadExt + std::marker::Unpin,
218+
) -> anyhow::Result<()> {
219+
if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? {
220+
// Process crash data
221+
if let Err(e) = resolve_frames(&config, &mut crash_info) {
222+
crash_info.log_messages.push(format!("Error resolving frames: {e}"));
223+
}
224+
if config.demangle_names() {
225+
if let Err(e) = crash_info.demangle_names() {
226+
crash_info.log_messages.push(format!("Error demangling names: {e}"));
227+
}
228+
}
229+
crash_info.async_upload_to_endpoint(config.endpoint()).await?;
230+
}
231+
Ok(())
232+
}
233+
```
234+
235+
The receiver:
236+
1. Reads from stdin (Unix socket via `dup2`)
237+
2. Parses the structured stream into `CrashInfo` and `CrashtrackerConfiguration`
238+
3. Performs symbol resolution if configured
239+
4. Uploads formatted crash report to backend
240+
241+
### 3. Stream Parsing
242+
243+
**File**: `datadog-crashtracker/src/receiver/receive_report.rs`
244+
245+
The receiver parses the stream by:
246+
1. Reading line-by-line with timeout protection
247+
2. Matching delimiter patterns to identify sections
248+
3. Accumulating section data between delimiters
249+
4. Deserializing JSON sections into appropriate data structures
250+
5. Handling the `DD_CRASHTRACK_DONE` completion marker
251+
252+
## Error Handling and Reliability
253+
254+
### Signal Safety
255+
- All collector operations use only async-signal-safe functions
256+
- No memory allocation in signal handler context
257+
- Pre-prepared data structures (`PreparedExecve`) to avoid allocations
258+
259+
### Timeout Protection
260+
- Receiver has configurable timeout (default: 4000ms)
261+
- Environment variable: `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`
262+
- Prevents hanging on incomplete/corrupted streams
263+
264+
### Process Cleanup
265+
- Parent process uses `wait_for_pollhup()` to detect socket closure
266+
- Kills child processes with `SIGKILL` if needed
267+
- Reaps zombie processes to prevent resource leaks
268+
269+
**File**: `datadog-crashtracker/src/collector/process_handle.rs:19-40`
270+
271+
### Data Integrity
272+
- Each section is flushed immediately after writing
273+
- Structured delimiters allow detection of incomplete transmissions
274+
- Error messages are accumulated rather than failing fast
275+
276+
## Alternative Communication Modes
277+
278+
### Named Socket Mode
279+
When `unix_socket_path` is configured, the collector connects to an existing Unix socket instead of using the fork+execve receiver:
280+
281+
```rust
282+
let receiver = if unix_socket_path.is_empty() {
283+
Receiver::spawn_from_stored_config()? // Fork+execve mode
284+
} else {
285+
Receiver::from_socket(unix_socket_path)? // Named socket mode
286+
};
287+
```
288+
289+
This allows integration with long-lived receiver processes.
290+
291+
**Linux Abstract Sockets**: On Linux, socket paths not starting with `.` or `/` are treated as abstract socket names.
292+
293+
## Security Considerations
294+
295+
### File Descriptor Isolation
296+
- Collector closes stdio file descriptors (0, 1, 2)
297+
- Receiver redirects socket to stdin, stdout/stderr to configured files
298+
- Minimizes attack surface during crash processing
299+
300+
### Process Isolation
301+
- Fork+execve provides strong process boundary
302+
- Crash in collector doesn't affect receiver
303+
- Signal handlers are reset in receiver child
304+
305+
### Resource Limits
306+
- Timeout prevents resource exhaustion
307+
- Fixed buffer sizes for file operations
308+
- Immediate flushing prevents large memory usage
309+
310+
## Debugging and Monitoring
311+
312+
### Log Output
313+
- Receiver can be configured with `stdout_filename` and `stderr_filename`
314+
- Error messages are accumulated in crash report
315+
- Debug assertions validate critical operations
316+
317+
### Environment Variables
318+
- `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`: Receiver timeout
319+
- Standard Unix environment passed through execve
320+
321+
This communication protocol ensures reliable crash data collection and transmission even when the main process is in an unstable state, providing robust crash reporting capabilities for production systems.

0 commit comments

Comments
 (0)