From e18cf50483a1e09c29e5624a6db1c03204f0051e Mon Sep 17 00:00:00 2001 From: Gyuheon Oh Date: Tue, 23 Sep 2025 19:35:47 +0000 Subject: [PATCH 1/2] Add doc to outline UNIX communication --- .../crashtracker-unix-socket-communication.md | 321 ++++++++++++++++++ 1 file changed, 321 insertions(+) create mode 100644 docs/crashtracker-unix-socket-communication.md diff --git a/docs/crashtracker-unix-socket-communication.md b/docs/crashtracker-unix-socket-communication.md new file mode 100644 index 000000000..a3b1dfe19 --- /dev/null +++ b/docs/crashtracker-unix-socket-communication.md @@ -0,0 +1,321 @@ +# Crash Tracker Unix Socket Communication Protocol + +**Date**: September 23, 2025 + +## Overview + +This document describes the Unix domain socket communication protocol used between the crash tracker's collector and receiver processes. The crash tracker uses a two-process architecture where the collector (a fork of the crashing process) communicates crash data to the receiver (a fork+execve process) via an anonymous Unix domain socket pair. + +## Socket Creation and Setup + +The communication channel is established using `socketpair()` to create an anonymous Unix domain socket pair: + +```rust +let (uds_parent, uds_child) = socket::socketpair( + socket::AddressFamily::Unix, + socket::SockType::Stream, + None, + socket::SockFlag::empty(), +)?; +``` + +**Location**: `datadog-crashtracker/src/collector/receiver_manager.rs:78-85` + +### File Descriptor Management + +1. **Parent Process**: Retains `uds_parent` for tracking +2. **Collector Process**: Inherits `uds_parent` as the write end +3. **Receiver Process**: Gets `uds_child` redirected to stdin via `dup2(uds_child, 0)` + +## Communication Protocol + +### Data Format + +The crash data is transmitted as a structured text stream with distinct sections delimited by markers defined in `datadog-crashtracker/src/shared/constants.rs`. + +### Message Structure + +Each crash report follows this sequence: + +1. **Metadata Section** +2. **Configuration Section** +3. **Signal Information Section** +4. **Process Context Section** +5. **Process Information Section** +6. **Counters Section** +7. **Spans Section** +8. **Additional Tags Section** +9. **Traces Section** +10. **Memory Maps Section** (Linux only) +11. **Stack Trace Section** +12. **Completion Marker** + +### Section Details + +#### 1. Metadata Section +``` +DD_CRASHTRACK_BEGIN_METADATA +{JSON metadata object} +DD_CRASHTRACK_END_METADATA +``` + +Contains serialized `Metadata` object with application context, tags, and environment information. + +#### 2. Configuration Section +``` +DD_CRASHTRACK_BEGIN_CONFIG +{JSON configuration object} +DD_CRASHTRACK_END_CONFIG +``` + +Contains serialized `CrashtrackerConfiguration` with crash tracking settings, endpoint information, and processing options. + +#### 3. Signal Information Section +``` +DD_CRASHTRACK_BEGIN_SIGINFO +{ + "si_code": , + "si_code_human_readable": "", + "si_signo": , + "si_signo_human_readable": "", + "si_addr": "" // Optional, for memory faults +} +DD_CRASHTRACK_END_SIGINFO +``` + +Contains signal details extracted from `siginfo_t` structure. + +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:223-263` + +#### 4. Process Context Section (ucontext) +``` +DD_CRASHTRACK_BEGIN_UCONTEXT + +DD_CRASHTRACK_END_UCONTEXT +``` + +Contains processor state at crash time from `ucontext_t`. Format varies by platform: +- **Linux**: Direct debug print of `ucontext_t` +- **macOS**: Includes both `ucontext_t` and machine context (`mcontext`) + +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:190-221` + +#### 5. Process Information Section +``` +DD_CRASHTRACK_BEGIN_PROCINFO +{"pid": } +DD_CRASHTRACK_END_PROCINFO +``` + +Contains the process ID of the crashing process. + +#### 6. Counters Section +``` +DD_CRASHTRACK_BEGIN_COUNTERS + +DD_CRASHTRACK_END_COUNTERS +``` + +Contains internal crash tracker counters and metrics. + +#### 7. Spans Section +``` +DD_CRASHTRACK_BEGIN_SPANS + +DD_CRASHTRACK_END_SPANS +``` + +Contains active distributed tracing spans at crash time. + +#### 8. Additional Tags Section +``` +DD_CRASHTRACK_BEGIN_TAGS + +DD_CRASHTRACK_END_TAGS +``` + +Contains additional tags collected at crash time. + +#### 9. Traces Section +``` +DD_CRASHTRACK_BEGIN_TRACES + +DD_CRASHTRACK_END_TRACES +``` + +Contains active trace information. + +#### 10. Memory Maps Section (Linux Only) +``` +DD_CRASHTRACK_BEGIN_FILE /proc/self/maps + +DD_CRASHTRACK_END_FILE "/proc/self/maps" +``` + +Contains memory mapping information from `/proc/self/maps` for symbol resolution. + +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:184-187` + +#### 11. Stack Trace Section +``` +DD_CRASHTRACK_BEGIN_STACKTRACE +{"ip": "", "module_base_address": "", "sp": "", "symbol_address": ""} +{"ip": "", "module_base_address": "", "sp": "", "symbol_address": "", "function": "", "file": "", "line": } +... +DD_CRASHTRACK_END_STACKTRACE +``` + +Each line represents one stack frame. Frame format depends on symbol resolution setting: + +- **Disabled/Receiver-only**: Only addresses (`ip`, `sp`, `symbol_address`, optional `module_base_address`) +- **In-process symbols**: Includes debug information (`function`, `file`, `line`, `column`) + +Stack frames with stack pointer less than the fault stack pointer are filtered out to exclude crash tracker frames. + +**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:45-117` + +#### 12. Completion Marker +``` +DD_CRASHTRACK_DONE +``` + +Indicates end of crash report transmission. + +## Communication Flow + +### 1. Collector Side (Write End) + +**File**: `datadog-crashtracker/src/collector/collector_manager.rs:92-102` + +```rust +let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; + +let report = emit_crashreport( + &mut unix_stream, + config, + config_str, + metadata_str, + sig_info, + ucontext, + ppid, +); +``` + +The collector: +1. Creates `UnixStream` from inherited file descriptor +2. Calls `emit_crashreport()` to serialize and write all crash data +3. Flushes the stream after each section for reliability +4. Exits with `libc::_exit(0)` on completion + +### 2. Receiver Side (Read End) + +**File**: `datadog-crashtracker/src/receiver/entry_points.rs:97-119` + +```rust +pub(crate) async fn receiver_entry_point( + timeout: Duration, + stream: impl AsyncBufReadExt + std::marker::Unpin, +) -> anyhow::Result<()> { + if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? { + // Process crash data + if let Err(e) = resolve_frames(&config, &mut crash_info) { + crash_info.log_messages.push(format!("Error resolving frames: {e}")); + } + if config.demangle_names() { + if let Err(e) = crash_info.demangle_names() { + crash_info.log_messages.push(format!("Error demangling names: {e}")); + } + } + crash_info.async_upload_to_endpoint(config.endpoint()).await?; + } + Ok(()) +} +``` + +The receiver: +1. Reads from stdin (Unix socket via `dup2`) +2. Parses the structured stream into `CrashInfo` and `CrashtrackerConfiguration` +3. Performs symbol resolution if configured +4. Uploads formatted crash report to backend + +### 3. Stream Parsing + +**File**: `datadog-crashtracker/src/receiver/receive_report.rs` + +The receiver parses the stream by: +1. Reading line-by-line with timeout protection +2. Matching delimiter patterns to identify sections +3. Accumulating section data between delimiters +4. Deserializing JSON sections into appropriate data structures +5. Handling the `DD_CRASHTRACK_DONE` completion marker + +## Error Handling and Reliability + +### Signal Safety +- All collector operations use only async-signal-safe functions +- No memory allocation in signal handler context +- Pre-prepared data structures (`PreparedExecve`) to avoid allocations + +### Timeout Protection +- Receiver has configurable timeout (default: 4000ms) +- Environment variable: `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` +- Prevents hanging on incomplete/corrupted streams + +### Process Cleanup +- Parent process uses `wait_for_pollhup()` to detect socket closure +- Kills child processes with `SIGKILL` if needed +- Reaps zombie processes to prevent resource leaks + +**File**: `datadog-crashtracker/src/collector/process_handle.rs:19-40` + +### Data Integrity +- Each section is flushed immediately after writing +- Structured delimiters allow detection of incomplete transmissions +- Error messages are accumulated rather than failing fast + +## Alternative Communication Modes + +### Named Socket Mode +When `unix_socket_path` is configured, the collector connects to an existing Unix socket instead of using the fork+execve receiver: + +```rust +let receiver = if unix_socket_path.is_empty() { + Receiver::spawn_from_stored_config()? // Fork+execve mode +} else { + Receiver::from_socket(unix_socket_path)? // Named socket mode +}; +``` + +This allows integration with long-lived receiver processes. + +**Linux Abstract Sockets**: On Linux, socket paths not starting with `.` or `/` are treated as abstract socket names. + +## Security Considerations + +### File Descriptor Isolation +- Collector closes stdio file descriptors (0, 1, 2) +- Receiver redirects socket to stdin, stdout/stderr to configured files +- Minimizes attack surface during crash processing + +### Process Isolation +- Fork+execve provides strong process boundary +- Crash in collector doesn't affect receiver +- Signal handlers are reset in receiver child + +### Resource Limits +- Timeout prevents resource exhaustion +- Fixed buffer sizes for file operations +- Immediate flushing prevents large memory usage + +## Debugging and Monitoring + +### Log Output +- Receiver can be configured with `stdout_filename` and `stderr_filename` +- Error messages are accumulated in crash report +- Debug assertions validate critical operations + +### Environment Variables +- `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`: Receiver timeout +- Standard Unix environment passed through execve + +This communication protocol ensures reliable crash data collection and transmission even when the main process is in an unstable state, providing robust crash reporting capabilities for production systems. From e6d47ef26f34e6f2b4510f567cc9686003e8e239 Mon Sep 17 00:00:00 2001 From: Gyuheon Oh Date: Thu, 25 Sep 2025 17:18:05 +0000 Subject: [PATCH 2/2] Rust docs --- .../src/collector/collector_manager.rs | 130 ++++++- .../src/collector/emitters.rs | 79 ++++ .../src/collector/receiver_manager.rs | 125 +++++- .../src/receiver/entry_points.rs | 98 ++++- .../src/receiver/receive_report.rs | 65 +++- datadog-crashtracker/src/shared/constants.rs | 90 ++++- datadog-crashtracker/src/shared/mod.rs | 2 + .../src/shared/unix_socket_communication.rs | 365 ++++++++++++++++++ .../crashtracker-unix-socket-communication.md | 321 --------------- 9 files changed, 909 insertions(+), 366 deletions(-) create mode 100644 datadog-crashtracker/src/shared/unix_socket_communication.rs delete mode 100644 docs/crashtracker-unix-socket-communication.md diff --git a/datadog-crashtracker/src/collector/collector_manager.rs b/datadog-crashtracker/src/collector/collector_manager.rs index 7353ef4fa..1af253ccb 100644 --- a/datadog-crashtracker/src/collector/collector_manager.rs +++ b/datadog-crashtracker/src/collector/collector_manager.rs @@ -1,6 +1,46 @@ // Copyright 2025-Present Datadog, Inc. https://www.datadoghq.com/ // SPDX-License-Identifier: Apache-2.0 +//! Crash data collector process management for Unix socket communication. +//! +//! This module manages the collector process that writes crash data to Unix sockets. +//! The collector runs in a forked child process and is responsible for serializing +//! and transmitting crash information to the receiver process. +//! +//! ## Communication Flow (Collector Side) +//! +//! The collector performs these steps to transmit crash data: +//! +//! 1. **Process Setup**: Forks from crashing process, closes stdio, disables SIGPIPE +//! 2. **Socket Creation**: Creates `UnixStream` from inherited file descriptor +//! 3. **Data Serialization**: Calls [`emit_crashreport()`] to write structured crash data +//! 4. **Graceful Exit**: Flushes data and exits with `libc::_exit(0)` +//! +//! ```text +//! ┌─────────────────────┐ ┌──────────────────────┐ +//! │ Signal Handler │ │ Collector Process │ +//! │ (Original Process) │ │ (Forked Child) │ +//! │ │ │ │ +//! │ 1. Catch crash │────fork()──────────►│ 2. Setup stdio │ +//! │ 2. Fork collector │ │ 3. Create UnixStream │ +//! │ 3. Wait for child │ │ 4. Write crash data │ +//! │ │◄────wait()──────────│ 5. Exit cleanly │ +//! └─────────────────────┘ └──────────────────────┘ +//! ``` +//! +//! ## Signal Safety +//! +//! All collector operations use only async-signal-safe functions since the collector +//! runs in a signal handler context: +//! +//! - No memory allocations +//! - Pre-prepared data structures +//! - Only safe system calls +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. +//! +//! [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport + use super::process_handle::ProcessHandle; use super::receiver_manager::Receiver; use ddcommon::timeout::TimeoutManager; @@ -25,6 +65,42 @@ pub enum CollectorSpawnError { } impl Collector { + /// Spawns a collector process to write crash data to the Unix socket. + /// + /// This method forks a child process that will serialize and transmit crash data + /// to the receiver process via the Unix socket established in the receiver. + /// + /// ## Process Architecture + /// + /// ```text + /// Parent Process (Signal Handler) Child Process (Collector) + /// ┌─────────────────────────────┐ ┌─────────────────────────────┐ + /// │ 1. Catches crash signal │ │ 4. Closes stdio (0,1,2) │ + /// │ 2. Forks collector process │──►│ 5. Disables SIGPIPE │ + /// │ 3. Returns to caller │ │ 6. Creates UnixStream │ + /// │ │ │ 7. Calls emit_crashreport() │ + /// │ │ │ 8. Exits with _exit(0) │ + /// └─────────────────────────────┘ └─────────────────────────────┘ + /// ``` + /// + /// ## Arguments + /// + /// * `receiver` - The receiver process that will read crash data from the Unix socket + /// * `config` - Crash tracker configuration + /// * `config_str` - JSON-serialized configuration string + /// * `metadata_str` - JSON-serialized metadata string + /// * `sig_info` - Signal information from the crash + /// * `ucontext` - Process context at crash time + /// + /// ## Returns + /// + /// * `Ok(Collector)` - Handle to the spawned collector process + /// * `Err(CollectorSpawnError::ForkFailed)` - If the fork operation fails + /// + /// ## Safety + /// + /// This function is called from signal handler context and uses only async-signal-safe operations. + /// The child process performs all potentially unsafe operations after fork. pub(crate) fn spawn( receiver: &Receiver, config: &CrashtrackerConfiguration, @@ -33,8 +109,8 @@ impl Collector { sig_info: *const siginfo_t, ucontext: *const ucontext_t, ) -> Result { - // When we spawn the child, our pid becomes the ppid. - // SAFETY: This function has no safety requirements. + // When we spawn the child, our pid becomes the ppid for process tracking. + // SAFETY: getpid() is async-signal-safe. let pid = unsafe { libc::getpid() }; let fork_result = alt_fork(); @@ -66,6 +142,42 @@ impl Collector { } } +/// Collector child process entry point - serializes and transmits crash data via Unix socket. +/// +/// This function runs in the forked collector process and performs the actual crash data +/// transmission. It establishes the Unix socket connection and writes all crash information +/// using the structured protocol. +/// +/// ## Process Flow +/// +/// 1. **Isolate from parent**: Closes stdin, stdout, stderr to prevent interference +/// 2. **Signal handling**: Disables SIGPIPE to handle broken pipe gracefully +/// 3. **Socket setup**: Creates `UnixStream` from inherited file descriptor +/// 4. **Data transmission**: Calls [`emit_crashreport()`] to write structured crash data +/// 5. **Clean exit**: Exits with `_exit(0)` to avoid cleanup issues +/// +/// ## Communication Protocol +/// +/// The crash data is written as a structured stream with delimited sections: +/// - Metadata, Configuration, Signal Info, Process Context +/// - Counters, Spans, Tags, Traces, Memory Maps, Stack Trace +/// - Completion marker +/// +/// For details, see [`crate::shared::unix_socket_communication`]. +/// +/// ## Arguments +/// +/// * `config` - Crash tracker configuration object +/// * `config_str` - JSON-serialized configuration for receiver +/// * `metadata_str` - JSON-serialized metadata for receiver +/// * `sig_info` - Signal information from crash context +/// * `ucontext` - Processor context at crash time +/// * `uds_fd` - Unix socket file descriptor for writing crash data +/// * `ppid` - Parent process ID for identification +/// +/// This function never returns - it always exits via `_exit(0)` or `terminate()`. +/// +/// [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport pub(crate) fn run_collector_child( config: &CrashtrackerConfiguration, config_str: &str, @@ -75,12 +187,13 @@ pub(crate) fn run_collector_child( uds_fd: RawFd, ppid: libc::pid_t, ) -> ! { - // Close stdio - let _ = unsafe { libc::close(0) }; - let _ = unsafe { libc::close(1) }; - let _ = unsafe { libc::close(2) }; + // Close stdio to isolate from parent process and prevent interference with crash data transmission + let _ = unsafe { libc::close(0) }; // stdin + let _ = unsafe { libc::close(1) }; // stdout + let _ = unsafe { libc::close(2) }; // stderr - // Disable SIGPIPE + // Disable SIGPIPE - if receiver closes socket early, we want to handle it gracefully + // rather than being killed by SIGPIPE let _ = unsafe { signal::sigaction( signal::SIGPIPE, @@ -88,9 +201,10 @@ pub(crate) fn run_collector_child( ) }; - // Emit crashreport + // Create Unix socket stream for crash data transmission let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; + // Serialize and transmit all crash data using structured protocol let report = emit_crashreport( &mut unix_stream, config, diff --git a/datadog-crashtracker/src/collector/emitters.rs b/datadog-crashtracker/src/collector/emitters.rs index 4794a1410..236a0cd88 100644 --- a/datadog-crashtracker/src/collector/emitters.rs +++ b/datadog-crashtracker/src/collector/emitters.rs @@ -1,6 +1,39 @@ // Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ // SPDX-License-Identifier: Apache-2.0 +//! Crash data emission and Unix socket protocol serialization. +//! +//! This module implements the collector-side serialization of crash data using the +//! Unix socket communication protocol. It writes structured crash information to +//! Unix domain sockets for consumption by receiver processes. +//! +//! ## Protocol Emission +//! +//! The emitter writes crash data as a series of delimited sections: +//! +//! 1. **Section Delimiters**: Uses constants from [`crate::shared::constants`] to mark boundaries +//! 2. **Structured Data**: Writes JSON, text, or binary data within sections +//! 3. **Immediate Flushing**: Flushes each section to ensure data integrity +//! 4. **Completion Marker**: Ends transmission with `DD_CRASHTRACK_DONE` +//! +//! ## Section Format Implementation +//! +//! Each section follows this pattern: +//! ```text +//! DD_CRASHTRACK_BEGIN_[SECTION] +//! [section data - JSON, text, or binary] +//! DD_CRASHTRACK_END_[SECTION] +//! ``` +//! +//! ### Key Sections +//! +//! - **Stack Trace** (`emit_backtrace_by_frames`): Stack frames with optional symbol resolution +//! - **Signal Info** (`emit_siginfo`): Signal details from `siginfo_t` +//! - **Process Context** (`emit_ucontext`): Processor state from `ucontext_t` +//! - **Memory Maps** (`emit_file`): `/proc/self/maps` for symbol resolution +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. + use crate::collector::additional_tags::consume_and_emit_additional_tags; use crate::collector::counters::emit_counters; use crate::collector::spans::{emit_spans, emit_traces}; @@ -116,6 +149,52 @@ unsafe fn emit_backtrace_by_frames( Ok(()) } +/// Emits a complete crash report using the Unix socket communication protocol. +/// +/// This is the main function that orchestrates the emission of all crash data sections +/// to the Unix socket. It writes the structured crash report in the order specified +/// by the protocol, with proper delimiters and flushing for data integrity. +/// +/// ## Section Emission Order +/// +/// The crash report is written in this specific order: +/// 1. **Metadata** - Application context, tags, environment info +/// 2. **Configuration** - Crash tracker settings and endpoint info +/// 3. **Signal Information** - Details from `siginfo_t` +/// 4. **Process Context** - CPU state from `ucontext_t` +/// 5. **Process Information** - Process ID +/// 6. **Counters** - Internal crash tracker metrics +/// 7. **Spans** - Active distributed tracing spans +/// 8. **Additional Tags** - Extra tags collected at crash time +/// 9. **Traces** - Active trace information +/// 10. **Memory Maps** (Linux only) - `/proc/self/maps` content +/// 11. **Stack Trace** - Stack frames with symbol resolution +/// 12. **Completion Marker** - `DD_CRASHTRACK_DONE` +/// +/// ## Data Integrity +/// +/// Each section is immediately flushed after writing to ensure the receiver +/// can process partial data even if the collector crashes during transmission. +/// +/// ## Arguments +/// +/// * `pipe` - Write stream (typically Unix socket) +/// * `config` - Crash tracker configuration object +/// * `config_str` - JSON-serialized configuration for receiver +/// * `metadata_string` - JSON-serialized metadata +/// * `sig_info` - Signal information from crash context +/// * `ucontext` - Processor context at crash time +/// * `ppid` - Parent process ID +/// +/// ## Returns +/// +/// * `Ok(())` - All crash data written successfully +/// * `Err(EmitterError)` - I/O error or data serialization failure +/// +/// ## Signal Safety +/// +/// This function is designed to be called from signal handler context and uses +/// only async-signal-safe operations where possible. pub(crate) fn emit_crashreport( pipe: &mut impl Write, config: &CrashtrackerConfiguration, diff --git a/datadog-crashtracker/src/collector/receiver_manager.rs b/datadog-crashtracker/src/collector/receiver_manager.rs index cf567a411..4e58cd720 100644 --- a/datadog-crashtracker/src/collector/receiver_manager.rs +++ b/datadog-crashtracker/src/collector/receiver_manager.rs @@ -1,6 +1,31 @@ // Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ // SPDX-License-Identifier: Apache-2.0 +//! Crash tracker receiver process management and Unix socket communication setup. +//! +//! This module handles the creation and management of receiver processes that collect +//! crash data via Unix domain sockets. The crash tracker uses a two-process architecture: +//! +//! 1. **Collector Process**: Forks from the crashing process and writes crash data to a Unix socket +//! 2. **Receiver Process**: Created via fork+execve, reads from Unix socket and processes/uploads crash data +//! +//! ## Socket Communication Architecture +//! +//! The communication uses anonymous Unix domain socket pairs created with [`socketpair()`]: +//! +//! ```text +//! ┌─────────────────┐ socketpair() ┌─────────────────┐ +//! │ Collector │◄───────────────────►│ Receiver │ +//! │ (Crashing proc) │ │ (Fork+execve) │ +//! │ │ Write End │ │ +//! │ uds_parent ────────────────────────────────► stdin (fd=0) │ +//! └─────────────────┘ └─────────────────┘ +//! ``` +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. +//! +//! [`socketpair()`]: nix::sys::socket::socketpair + use super::process_handle::ProcessHandle; use ddcommon::timeout::TimeoutManager; @@ -40,10 +65,26 @@ pub(crate) struct Receiver { } impl Receiver { + /// Creates a receiver that connects to an existing Unix socket. + /// + /// This mode is used when connecting to a long-lived receiver process instead of + /// spawning a new one via fork+execve. The collector will write crash data directly + /// to the provided Unix socket. + /// + /// ## Socket Path Formats + /// + /// - **File system sockets**: Paths starting with `.` or `/` (e.g., `/tmp/receiver.sock`) + /// - **Abstract sockets** (Linux only): Names not starting with `.` or `/` (e.g., `crashtracker-receiver`) + /// + /// ## Arguments + /// + /// * `unix_socket_path` - Path to the Unix socket (file system or abstract name) + /// + /// ## Errors + /// + /// * [`ReceiverError::NoReceiverPath`] - If the path is empty + /// * [`ReceiverError::ConnectionError`] - If connection to the socket fails pub(crate) fn from_socket(unix_socket_path: &str) -> Result { - // Creates a fake "Receiver", which can be waited on like a normal receiver. - // This is intended to support configurations where the collector is speaking to a - // long-lived, async receiver process. if unix_socket_path.is_empty() { return Err(ReceiverError::NoReceiverPath); } @@ -66,6 +107,46 @@ impl Receiver { }) } + /// Spawns a new receiver process using fork+execve with Unix socket communication. + /// + /// This is the primary method for creating receiver processes. It: + /// + /// 1. **Creates socket pair**: Uses [`socketpair()`] to establish anonymous Unix domain socket communication + /// 2. **Forks process**: Creates child process that will become the receiver + /// 3. **Sets up file descriptors**: Redirects socket to stdin, configures stdout/stderr + /// 4. **Executes receiver**: Child process executes the receiver binary + /// + /// ## Socket Architecture + /// + /// ```text + /// Parent Process Child Process + /// ┌─────────────────────┐ ┌─────────────────────┐ + /// │ Collector │ │ Receiver │ + /// │ │ │ │ + /// │ uds_parent (write) ─────────────────► stdin (fd=0) │ + /// │ │ │ stdout (configured) │ + /// │ │ │ stderr (configured) │ + /// └─────────────────────┘ └─────────────────────┘ + /// ``` + /// + /// ## File Descriptor Management + /// + /// - **Parent**: Keeps `uds_parent` for writing crash data + /// - **Child**: `uds_child` redirected to stdin, original socket closed + /// - **Stdio**: Child's stdout/stderr redirected to configured files or `/dev/null` + /// + /// ## Arguments + /// + /// * `config` - Receiver configuration including binary path and I/O redirection + /// * `prepared_exec` - Pre-prepared execve arguments and environment (avoids allocations in signal handler) + /// + /// ## Errors + /// + /// * [`ReceiverError::FileOpenError`] - Failed to open stdout/stderr files + /// * [`ReceiverError::SocketPairError`] - Failed to create socket pair + /// * [`ReceiverError::ForkFailed`] - Fork operation failed + /// + /// [`socketpair()`]: nix::sys::socket::socketpair pub(crate) fn spawn_from_config( config: &CrashtrackerReceiverConfig, prepared_exec: &PreparedExecve, @@ -75,7 +156,10 @@ impl Receiver { let stdout = open_file_or_quiet(config.stdout_filename.as_deref()) .map_err(ReceiverError::FileOpenError)?; - // Create anonymous Unix domain socket pair for communication + // Create anonymous Unix domain socket pair for communication between collector and receiver. + // This establishes a bidirectional communication channel where: + // - uds_parent: Used by collector (parent/grandparent) process for writing crash data + // - uds_child: Used by receiver process, redirected to stdin for reading crash data let (uds_parent, uds_child) = socket::socketpair( socket::AddressFamily::Unix, socket::SockType::Stream, @@ -148,18 +232,41 @@ impl Receiver { } } -/// Wrapper around the child process that will run the crash receiver +/// Child process entry point that sets up file descriptors and executes the receiver binary. +/// +/// This function is called only in the child process after fork. It performs critical +/// file descriptor setup to establish the Unix socket communication channel: +/// +/// ## File Descriptor Setup +/// +/// 1. **stdin (fd=0)**: Redirected to `uds_child` socket for receiving crash data +/// 2. **stdout (fd=1)**: Redirected to configured output file or `/dev/null` +/// 3. **stderr (fd=2)**: Redirected to configured error file or `/dev/null` +/// +/// ## Signal Handler Reset +/// +/// Signal handlers are reset to default disposition to ensure clean receiver operation. +/// +/// ## Arguments +/// +/// * `prepared_exec` - Pre-prepared execve arguments and environment +/// * `uds_child` - Unix socket file descriptor for reading crash data +/// * `stderr` - File descriptor for stderr redirection +/// * `stdout` - File descriptor for stdout redirection +/// +/// This function never returns - it either successfully executes the receiver binary +/// or terminates the process. fn run_receiver_child( prepared_exec: &PreparedExecve, uds_child: RawFd, stderr: RawFd, stdout: RawFd, ) -> ! { - // File descriptor management + // File descriptor management: Redirect Unix socket to stdin so receiver can read crash data unsafe { - let _ = libc::dup2(uds_child, 0); - let _ = libc::dup2(stdout, 1); - let _ = libc::dup2(stderr, 2); + let _ = libc::dup2(uds_child, 0); // stdin = Unix socket (crash data input) + let _ = libc::dup2(stdout, 1); // stdout = configured output file + let _ = libc::dup2(stderr, 2); // stderr = configured error file } // Close unused file descriptors diff --git a/datadog-crashtracker/src/receiver/entry_points.rs b/datadog-crashtracker/src/receiver/entry_points.rs index 4aabcddd8..0ef125af8 100644 --- a/datadog-crashtracker/src/receiver/entry_points.rs +++ b/datadog-crashtracker/src/receiver/entry_points.rs @@ -1,6 +1,45 @@ // SPDX-License-Identifier: Apache-2.0 // Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ +//! Crash tracker receiver entry points for Unix socket communication. +//! +//! This module provides the receiver-side implementation of the crash tracker's +//! Unix socket communication protocol. The receiver processes crash data sent +//! by the collector via Unix domain sockets. +//! +//! ## Receiver Architecture +//! +//! The receiver operates in multiple modes: +//! +//! ### 1. Fork+Execve Mode (Primary) +//! ```text +//! ┌─────────────────────┐ Unix Socket ┌─────────────────────┐ +//! │ Collector Process │──────────────────►│ Receiver Process │ +//! │ (Write crash data) │ │ (Read via stdin) │ +//! └─────────────────────┘ └─────────────────────┘ +//! ``` +//! +//! ### 2. Named Socket Mode (Long-lived receiver) +//! ```text +//! ┌─────────────────────┐ Named Socket ┌─────────────────────┐ +//! │ Collector Process │──────────────────►│ Long-lived Receiver │ +//! │ (Connect to socket) │ │ (Listen on socket) │ +//! └─────────────────────┘ └─────────────────────┘ +//! ``` +//! +//! ## Processing Pipeline +//! +//! The receiver performs these operations on crash data: +//! +//! 1. **Parse Stream**: Read structured crash data using [`receive_report_from_stream()`] +//! 2. **Symbol Resolution**: Resolve stack frame symbols if configured +//! 3. **Name Demangling**: Demangle C++/Rust symbol names if enabled +//! 4. **Upload/Output**: Send formatted crash report to configured endpoint +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. +//! +//! [`receive_report_from_stream()`]: crate::receiver::receive_report::receive_report_from_stream + use super::receive_report::receive_report_from_stream; use crate::{crash_info::CrashInfo, CrashtrackerConfiguration, StacktraceCollection}; use anyhow::Context; @@ -85,25 +124,68 @@ pub fn get_receiver_unix_socket(socket_path: impl AsRef) -> anyhow::Result< unix_listener.context("Could not create the unix socket") } -/// Receives data from a crash collector via a stream, formats it into -/// `CrashInfo` json, and emits it to the endpoint/file defined in `config`. +/// Core receiver entry point that processes crash data from Unix socket stream. +/// +/// This is the main processing function that handles crash data received via Unix domain sockets. +/// It parses the structured crash data stream, performs symbol resolution and name demangling, +/// and uploads the formatted crash report to the configured endpoint. +/// +/// ## Processing Pipeline +/// +/// 1. **Stream Parsing**: Reads crash data using the structured Unix socket protocol +/// 2. **Symbol Resolution**: Resolves memory addresses to function names, file names, and line numbers +/// 3. **Name Demangling**: Converts mangled C++/Rust symbols to readable names +/// 4. **Error Accumulation**: Collects any processing errors in the crash report +/// 5. **Upload**: Transmits the formatted crash report to the backend endpoint +/// +/// ## Protocol Flow +/// +/// ```text +/// Unix Socket Stream → Parse Sections → Resolve Symbols → Demangle Names → Upload +/// │ │ │ │ │ +/// v v v v v +/// ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────┐ ┌─────────┐ +/// │ Delimited │ │ CrashInfo + │ │ Enriched │ │ Readable│ │ Backend │ +/// │ Sections │──►│ Config │──►│ Stack Frames│──►│ Symbols │──►│ Upload │ +/// │ (Protocol) │ │ Objects │ │ (Addresses) │ │ (Names) │ │ (JSON) │ +/// └─────────────┘ └─────────────┘ └─────────────┘ └────────┘ └─────────┘ +/// ``` +/// +/// ## Arguments /// -/// At a high-level, this exists because doing anything in a -/// signal handler is dangerous, so we fork a sidecar to do the stuff we aren't -/// allowed to do in the handler. +/// * `timeout` - Maximum time to wait for complete crash data stream +/// * `stream` - Async buffered stream containing crash data (usually Unix socket via stdin) /// -/// See comments in [datadog-crashtracker/lib.rs] for a full architecture -/// description. +/// ## Returns +/// +/// * `Ok(())` - Crash report processed and uploaded successfully +/// * `Err(anyhow::Error)` - Stream parsing, processing, or upload failed +/// +/// ## Timeout Behavior +/// +/// If the crash data stream is incomplete or corrupted, the function will timeout +/// after the specified duration to prevent hanging indefinitely. The timeout can +/// be configured via `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` environment variable. +/// +/// ## Error Handling +/// +/// Processing errors (symbol resolution, demangling) are non-fatal and are accumulated +/// in the crash report's log messages. Only stream parsing and upload errors cause +/// the function to return an error. pub(crate) async fn receiver_entry_point( timeout: Duration, stream: impl AsyncBufReadExt + std::marker::Unpin, ) -> anyhow::Result<()> { + // Parse structured crash data stream into configuration and crash information if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? { + // Attempt symbol resolution - errors are accumulated, not fatal if let Err(e) = resolve_frames(&config, &mut crash_info) { crash_info .log_messages .push(format!("Error resolving frames: {e}")); } + + // Attempt name demangling if enabled - errors are accumulated, not fatal if config.demangle_names() { if let Err(e) = crash_info.demangle_names() { crash_info @@ -111,6 +193,8 @@ pub(crate) async fn receiver_entry_point( .push(format!("Error demangling names: {e}")); } } + + // Upload formatted crash report to backend endpoint crash_info .async_upload_to_endpoint(config.endpoint()) .await?; diff --git a/datadog-crashtracker/src/receiver/receive_report.rs b/datadog-crashtracker/src/receiver/receive_report.rs index f96e972fa..6f55fbc72 100644 --- a/datadog-crashtracker/src/receiver/receive_report.rs +++ b/datadog-crashtracker/src/receiver/receive_report.rs @@ -1,6 +1,42 @@ // Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ // SPDX-License-Identifier: Apache-2.0 +//! Unix socket stream parsing for crash tracker receiver. +//! +//! This module implements the receiver-side parsing of the Unix socket communication protocol. +//! It reads the structured crash data stream sent by the collector and reconstructs the +//! crash information and configuration objects. +//! +//! ## Stream Parsing Process +//! +//! The parser operates as a state machine that processes the delimited sections: +//! +//! 1. **Line-by-line reading**: Reads the stream with timeout protection +//! 2. **Delimiter matching**: Identifies section boundaries using protocol markers +//! 3. **Section accumulation**: Collects data between begin/end delimiters +//! 4. **JSON deserialization**: Converts section data into appropriate data structures +//! 5. **State transitions**: Moves between parsing states until completion marker +//! +//! ```text +//! ┌─────────────────┐ Read Line ┌─────────────────┐ Match Delimiter +//! │ Unix Socket │─────────────────►│ Line Buffer │─────────────────────┐ +//! │ Stream │ │ │ │ +//! └─────────────────┘ └─────────────────┘ │ +//! │ +//! v +//! ┌─────────────────┐ Build Objects ┌─────────────────┐ Accumulate Data │ +//! │ CrashInfo + │◄─────────────────│ Section Data │◄────────────────────┘ +//! │ Configuration │ │ Collection │ +//! └─────────────────┘ └─────────────────┘ +//! ``` +//! +//! ## State Machine +//! +//! The [`StdinState`] enum tracks the current parsing state and accumulates data +//! for multi-line sections until complete. +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. + use crate::{ crash_info::{CrashInfo, CrashInfoBuilder, ErrorKind, Span, TelemetryCrashUploader}, shared::constants::*, @@ -35,18 +71,39 @@ async fn send_crash_ping_to_url( Ok(()) } -/// The crashtracker collector sends data in blocks. -/// This enum tracks which block we're currently in, and, for multi-line blocks, -/// collects the partial data until the block is closed and it can be appended -/// to the CrashReport. +/// State machine for parsing Unix socket crash data stream. +/// +/// This enum tracks the current parsing state as the receiver processes the structured +/// crash data stream. Each variant represents a different section of the crash report +/// protocol, and for multi-line sections, accumulates partial data until the section +/// is complete. +/// +/// ## State Transitions +/// +/// The parser transitions between states based on delimiter markers: +/// - `DD_CRASHTRACK_BEGIN_*` markers transition to data collection states +/// - `DD_CRASHTRACK_END_*` markers complete sections and process accumulated data +/// - `DD_CRASHTRACK_DONE` transitions to the final Done state +/// +/// ## Multi-line Section Handling +/// +/// Some states like `File` and `Stacktrace` accumulate multiple lines of data +/// between their begin/end delimiters before processing. #[derive(Debug)] pub(crate) enum StdinState { + /// Parsing additional tags section AdditionalTags, + /// Parsing configuration section (JSON) Config, + /// Parsing internal counters section Counters, + /// Parsing complete - crash report transmission finished Done, + /// Parsing file section (filename, content lines) File(String, Vec), + /// Parsing metadata section (JSON) Metadata, + /// Parsing process information section (JSON) ProcInfo, SigInfo, SpanIds, diff --git a/datadog-crashtracker/src/shared/constants.rs b/datadog-crashtracker/src/shared/constants.rs index e2da98c1a..922d6132e 100644 --- a/datadog-crashtracker/src/shared/constants.rs +++ b/datadog-crashtracker/src/shared/constants.rs @@ -1,30 +1,86 @@ // Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ // SPDX-License-Identifier: Apache-2.0 +//! Constants used for the Unix socket communication protocol between crash tracker collector and receiver. +//! +//! This module contains all the delimiter constants that structure the crash report data stream. +//! These constants are used to mark the beginning and end of different sections in the crash report, +//! allowing the receiver to properly parse and reconstruct the crash information. +//! +//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. + use std::time::Duration; -pub const DD_CRASHTRACK_BEGIN_ADDITIONAL_TAGS: &str = "DD_CRASHTRACK_BEGIN_ADDITIONAL_TAGS"; -pub const DD_CRASHTRACK_BEGIN_CONFIG: &str = "DD_CRASHTRACK_BEGIN_CONFIG"; -pub const DD_CRASHTRACK_BEGIN_COUNTERS: &str = "DD_CRASHTRACK_BEGIN_COUNTERS"; -pub const DD_CRASHTRACK_BEGIN_FILE: &str = "DD_CRASHTRACK_BEGIN_FILE"; +// Section delimiters for the crash report stream protocol + +/// Marks the beginning of the metadata section containing application context, tags, and environment information. +/// The section contains a JSON-serialized `Metadata` object. pub const DD_CRASHTRACK_BEGIN_METADATA: &str = "DD_CRASHTRACK_BEGIN_METADATA"; -pub const DD_CRASHTRACK_BEGIN_PROCINFO: &str = "DD_CRASHTRACK_BEGIN_PROCESSINFO"; +/// Marks the end of the metadata section. +pub const DD_CRASHTRACK_END_METADATA: &str = "DD_CRASHTRACK_END_METADATA"; + +/// Marks the beginning of the configuration section containing crash tracking settings. +/// The section contains a JSON-serialized `CrashtrackerConfiguration` object with endpoint information and processing options. +pub const DD_CRASHTRACK_BEGIN_CONFIG: &str = "DD_CRASHTRACK_BEGIN_CONFIG"; +/// Marks the end of the configuration section. +pub const DD_CRASHTRACK_END_CONFIG: &str = "DD_CRASHTRACK_END_CONFIG"; + +/// Marks the beginning of the signal information section containing crash signal details. +/// The section contains JSON with signal code, number, human-readable names, and fault address (if applicable). pub const DD_CRASHTRACK_BEGIN_SIGINFO: &str = "DD_CRASHTRACK_BEGIN_SIGINFO"; -pub const DD_CRASHTRACK_BEGIN_SPAN_IDS: &str = "DD_CRASHTRACK_BEGIN_SPAN_IDS"; -pub const DD_CRASHTRACK_BEGIN_STACKTRACE: &str = "DD_CRASHTRACK_BEGIN_STACKTRACE"; -pub const DD_CRASHTRACK_BEGIN_TRACE_IDS: &str = "DD_CRASHTRACK_BEGIN_TRACE_IDS"; +/// Marks the end of the signal information section. +pub const DD_CRASHTRACK_END_SIGINFO: &str = "DD_CRASHTRACK_END_SIGINFO"; + +/// Marks the beginning of the process context section containing processor state at crash time. +/// The section contains platform-specific context dump from `ucontext_t`. pub const DD_CRASHTRACK_BEGIN_UCONTEXT: &str = "DD_CRASHTRACK_BEGIN_UCONTEXT"; -pub const DD_CRASHTRACK_DONE: &str = "DD_CRASHTRACK_DONE"; -pub const DD_CRASHTRACK_END_ADDITIONAL_TAGS: &str = "DD_CRASHTRACK_END_ADDITIONAL_TAGS"; -pub const DD_CRASHTRACK_END_CONFIG: &str = "DD_CRASHTRACK_END_CONFIG"; -pub const DD_CRASHTRACK_END_COUNTERS: &str = "DD_CRASHTRACK_END_COUNTERS"; -pub const DD_CRASHTRACK_END_FILE: &str = "DD_CRASHTRACK_END_FILE"; -pub const DD_CRASHTRACK_END_METADATA: &str = "DD_CRASHTRACK_END_METADATA"; +/// Marks the end of the process context section. +pub const DD_CRASHTRACK_END_UCONTEXT: &str = "DD_CRASHTRACK_END_UCONTEXT"; + +/// Marks the beginning of the process information section containing the PID of the crashing process. +/// The section contains JSON with the process ID. +pub const DD_CRASHTRACK_BEGIN_PROCINFO: &str = "DD_CRASHTRACK_BEGIN_PROCESSINFO"; +/// Marks the end of the process information section. pub const DD_CRASHTRACK_END_PROCINFO: &str = "DD_CRASHTRACK_END_PROCESSINFO"; -pub const DD_CRASHTRACK_END_SIGINFO: &str = "DD_CRASHTRACK_END_SIGINFO"; + +/// Marks the beginning of the counters section containing internal crash tracker metrics. +pub const DD_CRASHTRACK_BEGIN_COUNTERS: &str = "DD_CRASHTRACK_BEGIN_COUNTERS"; +/// Marks the end of the counters section. +pub const DD_CRASHTRACK_END_COUNTERS: &str = "DD_CRASHTRACK_END_COUNTERS"; + +/// Marks the beginning of the spans section containing active distributed tracing spans at crash time. +pub const DD_CRASHTRACK_BEGIN_SPAN_IDS: &str = "DD_CRASHTRACK_BEGIN_SPAN_IDS"; +/// Marks the end of the spans section. pub const DD_CRASHTRACK_END_SPAN_IDS: &str = "DD_CRASHTRACK_END_SPAN_IDS"; -pub const DD_CRASHTRACK_END_STACKTRACE: &str = "DD_CRASHTRACK_END_STACKTRACE"; + +/// Marks the beginning of the additional tags section containing extra tags collected at crash time. +pub const DD_CRASHTRACK_BEGIN_ADDITIONAL_TAGS: &str = "DD_CRASHTRACK_BEGIN_ADDITIONAL_TAGS"; +/// Marks the end of the additional tags section. +pub const DD_CRASHTRACK_END_ADDITIONAL_TAGS: &str = "DD_CRASHTRACK_END_ADDITIONAL_TAGS"; + +/// Marks the beginning of the traces section containing active trace information. +pub const DD_CRASHTRACK_BEGIN_TRACE_IDS: &str = "DD_CRASHTRACK_BEGIN_TRACE_IDS"; +/// Marks the end of the traces section. pub const DD_CRASHTRACK_END_TRACE_IDS: &str = "DD_CRASHTRACK_END_TRACE_IDS"; -pub const DD_CRASHTRACK_END_UCONTEXT: &str = "DD_CRASHTRACK_END_UCONTEXT"; +/// Marks the beginning of a file section (e.g., `/proc/self/maps` on Linux). +/// Used for memory mapping information needed for symbol resolution. +pub const DD_CRASHTRACK_BEGIN_FILE: &str = "DD_CRASHTRACK_BEGIN_FILE"; +/// Marks the end of a file section. +pub const DD_CRASHTRACK_END_FILE: &str = "DD_CRASHTRACK_END_FILE"; + +/// Marks the beginning of the stack trace section containing stack frames. +/// Each line in this section represents a stack frame with addresses and optional debug information. +/// Frame format depends on symbol resolution settings. +pub const DD_CRASHTRACK_BEGIN_STACKTRACE: &str = "DD_CRASHTRACK_BEGIN_STACKTRACE"; +/// Marks the end of the stack trace section. +pub const DD_CRASHTRACK_END_STACKTRACE: &str = "DD_CRASHTRACK_END_STACKTRACE"; + +/// Marks the completion of the entire crash report transmission. +/// This is the final marker sent by the collector to indicate all data has been transmitted. +pub const DD_CRASHTRACK_DONE: &str = "DD_CRASHTRACK_DONE"; + +/// Default timeout for receiver operations in milliseconds. +/// This prevents the receiver from hanging indefinitely on incomplete or corrupted streams. +/// Can be overridden by the `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` environment variable. pub const DD_CRASHTRACK_DEFAULT_TIMEOUT: Duration = Duration::from_millis(5_000); diff --git a/datadog-crashtracker/src/shared/mod.rs b/datadog-crashtracker/src/shared/mod.rs index 0628ecabf..cd7623676 100644 --- a/datadog-crashtracker/src/shared/mod.rs +++ b/datadog-crashtracker/src/shared/mod.rs @@ -10,3 +10,5 @@ pub(crate) mod constants; #[cfg(feature = "benchmarking")] pub mod constants; + +pub mod unix_socket_communication; diff --git a/datadog-crashtracker/src/shared/unix_socket_communication.rs b/datadog-crashtracker/src/shared/unix_socket_communication.rs new file mode 100644 index 000000000..9170b63bf --- /dev/null +++ b/datadog-crashtracker/src/shared/unix_socket_communication.rs @@ -0,0 +1,365 @@ +// Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ +// SPDX-License-Identifier: Apache-2.0 + +//! # Crash Tracker Unix Socket Communication Protocol +//! +//! This module documents the Unix domain socket communication protocol used between the crash +//! tracker's collector and receiver processes. The crash tracker uses a two-process architecture +//! where the collector (a fork of the crashing process) communicates crash data to the receiver +//! (a fork+execve process) via an anonymous Unix domain socket pair. +//! +//! ## Overview +//! +//! The communication protocol ensures reliable crash data collection and transmission even when +//! the main process is in an unstable state, providing robust crash reporting capabilities for +//! production systems. +//! +//! ## Socket Creation and Setup +//! +//! The communication channel is established using [`socketpair()`] to create an anonymous Unix +//! domain socket pair: +//! +//! ```rust,no_run +//! use nix::sys::socket; +//! +//! let (uds_parent, uds_child) = socket::socketpair( +//! socket::AddressFamily::Unix, +//! socket::SockType::Stream, +//! None, +//! socket::SockFlag::empty(), +//! )?; +//! # Ok::<(), nix::Error>(()) +//! ``` +//! +//! **Location**: [`collector::receiver_manager`] +//! +//! ### File Descriptor Management +//! +//! 1. **Parent Process**: Retains `uds_parent` for tracking +//! 2. **Collector Process**: Inherits `uds_parent` as the write end +//! 3. **Receiver Process**: Gets `uds_child` redirected to stdin via `dup2(uds_child, 0)` +//! +//! ## Communication Protocol +//! +//! ### Data Format +//! +//! The crash data is transmitted as a structured text stream with distinct sections delimited +//! by markers defined in [`shared::constants`]. +//! +//! ### Message Structure +//! +//! Each crash report follows this sequence: +//! +//! 1. **Metadata Section** - Application context, tags, and environment information +//! 2. **Configuration Section** - Crash tracking settings, endpoint information, processing options +//! 3. **Signal Information Section** - Signal details from `siginfo_t` structure +//! 4. **Process Context Section** - Processor state at crash time from `ucontext_t` +//! 5. **Process Information Section** - Process ID of the crashing process +//! 6. **Counters Section** - Internal crash tracker counters and metrics +//! 7. **Spans Section** - Active distributed tracing spans at crash time +//! 8. **Additional Tags Section** - Additional tags collected at crash time +//! 9. **Traces Section** - Active trace information +//! 10. **Memory Maps Section** (Linux only) - Memory mapping information from `/proc/self/maps` +//! 11. **Stack Trace Section** - Stack frames with optional symbol resolution +//! 12. **Completion Marker** - End of crash report transmission +//! +//! ### Section Details +//! +//! #### 1. Metadata Section +//! ```text +//! DD_CRASHTRACK_BEGIN_METADATA +//! {JSON metadata object} +//! DD_CRASHTRACK_END_METADATA +//! ``` +//! +//! Contains serialized `Metadata` object with application context, tags, and environment information. +//! +//! #### 2. Configuration Section +//! ```text +//! DD_CRASHTRACK_BEGIN_CONFIG +//! {JSON configuration object} +//! DD_CRASHTRACK_END_CONFIG +//! ``` +//! +//! Contains serialized `CrashtrackerConfiguration` with crash tracking settings, endpoint +//! information, and processing options. +//! +//! #### 3. Signal Information Section +//! ```text +//! DD_CRASHTRACK_BEGIN_SIGINFO +//! { +//! "si_code": , +//! "si_code_human_readable": "", +//! "si_signo": , +//! "si_signo_human_readable": "", +//! "si_addr": "" // Optional, for memory faults +//! } +//! DD_CRASHTRACK_END_SIGINFO +//! ``` +//! +//! Contains signal details extracted from `siginfo_t` structure. +//! **Implementation**: [`collector::emitters`] (lines 223-263) +//! +//! #### 4. Process Context Section (ucontext) +//! ```text +//! DD_CRASHTRACK_BEGIN_UCONTEXT +//! +//! DD_CRASHTRACK_END_UCONTEXT +//! ``` +//! +//! Contains processor state at crash time from `ucontext_t`. Format varies by platform: +//! - **Linux**: Direct debug print of `ucontext_t` +//! - **macOS**: Includes both `ucontext_t` and machine context (`mcontext`) +//! +//! **Implementation**: [`collector::emitters`] (lines 190-221) +//! +//! #### 5. Process Information Section +//! ```text +//! DD_CRASHTRACK_BEGIN_PROCINFO +//! {"pid": } +//! DD_CRASHTRACK_END_PROCINFO +//! ``` +//! +//! Contains the process ID of the crashing process. +//! +//! #### 6. Counters Section +//! ```text +//! DD_CRASHTRACK_BEGIN_COUNTERS +//! +//! DD_CRASHTRACK_END_COUNTERS +//! ``` +//! +//! Contains internal crash tracker counters and metrics. +//! +//! #### 7. Spans Section +//! ```text +//! DD_CRASHTRACK_BEGIN_SPANS +//! +//! DD_CRASHTRACK_END_SPANS +//! ``` +//! +//! Contains active distributed tracing spans at crash time. +//! +//! #### 8. Additional Tags Section +//! ```text +//! DD_CRASHTRACK_BEGIN_TAGS +//! +//! DD_CRASHTRACK_END_TAGS +//! ``` +//! +//! Contains additional tags collected at crash time. +//! +//! #### 9. Traces Section +//! ```text +//! DD_CRASHTRACK_BEGIN_TRACES +//! +//! DD_CRASHTRACK_END_TRACES +//! ``` +//! +//! Contains active trace information. +//! +//! #### 10. Memory Maps Section (Linux Only) +//! ```text +//! DD_CRASHTRACK_BEGIN_FILE /proc/self/maps +//! +//! DD_CRASHTRACK_END_FILE "/proc/self/maps" +//! ``` +//! +//! Contains memory mapping information from `/proc/self/maps` for symbol resolution. +//! **Implementation**: [`collector::emitters`] (lines 184-187) +//! +//! #### 11. Stack Trace Section +//! ```text +//! DD_CRASHTRACK_BEGIN_STACKTRACE +//! {"ip": "", "module_base_address": "", "sp": "", "symbol_address": ""} +//! {"ip": "", "module_base_address": "", "sp": "", "symbol_address": "", "function": "", "file": "", "line": } +//! ... +//! DD_CRASHTRACK_END_STACKTRACE +//! ``` +//! +//! Each line represents one stack frame. Frame format depends on symbol resolution setting: +//! +//! - **Disabled/Receiver-only**: Only addresses (`ip`, `sp`, `symbol_address`, optional `module_base_address`) +//! - **In-process symbols**: Includes debug information (`function`, `file`, `line`, `column`) +//! +//! Stack frames with stack pointer less than the fault stack pointer are filtered out to exclude crash tracker frames. +//! **Implementation**: [`collector::emitters`] (lines 45-117) +//! +//! #### 12. Completion Marker +//! ```text +//! DD_CRASHTRACK_DONE +//! ``` +//! +//! Indicates end of crash report transmission. +//! +//! ## Communication Flow +//! +//! ### 1. Collector Side (Write End) +//! +//! **File**: [`collector::collector_manager`] +//! +//! ```rust,no_run +//! use std::os::unix::net::UnixStream; +//! use std::os::unix::io::FromRawFd; +//! +//! let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; +//! +//! let report = emit_crashreport( +//! &mut unix_stream, +//! config, +//! config_str, +//! metadata_str, +//! sig_info, +//! ucontext, +//! ppid, +//! ); +//! # let _: () = report; // suppress unused warning for doc test +//! ``` +//! +//! The collector: +//! 1. Creates `UnixStream` from inherited file descriptor +//! 2. Calls `emit_crashreport()` to serialize and write all crash data +//! 3. Flushes the stream after each section for reliability +//! 4. Exits with `libc::_exit(0)` on completion +//! +//! ### 2. Receiver Side (Read End) +//! +//! **File**: [`receiver::entry_points`] +//! +//! ```rust,no_run +//! use std::time::Duration; +//! use tokio::io::AsyncBufReadExt; +//! +//! pub async fn receiver_entry_point( +//! timeout: Duration, +//! stream: impl AsyncBufReadExt + std::marker::Unpin, +//! ) -> anyhow::Result<()> { +//! if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? { +//! // Process crash data +//! if let Err(e) = resolve_frames(&config, &mut crash_info) { +//! crash_info.log_messages.push(format!("Error resolving frames: {e}")); +//! } +//! if config.demangle_names() { +//! if let Err(e) = crash_info.demangle_names() { +//! crash_info.log_messages.push(format!("Error demangling names: {e}")); +//! } +//! } +//! crash_info.async_upload_to_endpoint(config.endpoint()).await?; +//! } +//! Ok(()) +//! } +//! # fn resolve_frames(_config: &(), _crash_info: &mut ()) -> Result<(), &'static str> { Ok(()) } +//! # fn receive_report_from_stream(_timeout: Duration, _stream: impl AsyncBufReadExt + std::marker::Unpin) -> impl std::future::Future>> { async { Ok(None) } } +//! # struct CrashInfo { log_messages: Vec } +//! # impl CrashInfo { +//! # fn demangle_names(&mut self) -> Result<(), &'static str> { Ok(()) } +//! # async fn async_upload_to_endpoint(&self, _endpoint: ()) -> anyhow::Result<()> { Ok(()) } +//! # } +//! ``` +//! +//! The receiver: +//! 1. Reads from stdin (Unix socket via `dup2`) +//! 2. Parses the structured stream into `CrashInfo` and `CrashtrackerConfiguration` +//! 3. Performs symbol resolution if configured +//! 4. Uploads formatted crash report to backend +//! +//! ### 3. Stream Parsing +//! +//! **File**: [`receiver::receive_report`] +//! +//! The receiver parses the stream by: +//! 1. Reading line-by-line with timeout protection +//! 2. Matching delimiter patterns to identify sections +//! 3. Accumulating section data between delimiters +//! 4. Deserializing JSON sections into appropriate data structures +//! 5. Handling the `DD_CRASHTRACK_DONE` completion marker +//! +//! ## Error Handling and Reliability +//! +//! ### Signal Safety +//! - All collector operations use only async-signal-safe functions +//! - No memory allocation in signal handler context +//! - Pre-prepared data structures (`PreparedExecve`) to avoid allocations +//! +//! ### Timeout Protection +//! - Receiver has configurable timeout (default: 4000ms) +//! - Environment variable: `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` +//! - Prevents hanging on incomplete/corrupted streams +//! +//! ### Process Cleanup +//! - Parent process uses `wait_for_pollhup()` to detect socket closure +//! - Kills child processes with `SIGKILL` if needed +//! - Reaps zombie processes to prevent resource leaks +//! +//! **File**: [`collector::process_handle`] +//! +//! ### Data Integrity +//! - Each section is flushed immediately after writing +//! - Structured delimiters allow detection of incomplete transmissions +//! - Error messages are accumulated rather than failing fast +//! +//! ## Alternative Communication Modes +//! +//! ### Named Socket Mode +//! When `unix_socket_path` is configured, the collector connects to an existing Unix socket +//! instead of using the fork+execve receiver: +//! +//! ```rust,no_run +//! # struct Receiver; +//! # impl Receiver { +//! # fn spawn_from_stored_config() -> Result { Ok(Receiver) } +//! # fn from_socket(_path: &str) -> Result { Ok(Receiver) } +//! # } +//! # let unix_socket_path = ""; +//! let receiver = if unix_socket_path.is_empty() { +//! Receiver::spawn_from_stored_config()? // Fork+execve mode +//! } else { +//! Receiver::from_socket(unix_socket_path)? // Named socket mode +//! }; +//! # let _: Receiver = receiver; // suppress unused warning +//! # Ok::<(), &str>(()) +//! ``` +//! +//! This allows integration with long-lived receiver processes. +//! +//! **Linux Abstract Sockets**: On Linux, socket paths not starting with `.` or `/` are treated +//! as abstract socket names. +//! +//! ## Security Considerations +//! +//! ### File Descriptor Isolation +//! - Collector closes stdio file descriptors (0, 1, 2) +//! - Receiver redirects socket to stdin, stdout/stderr to configured files +//! - Minimizes attack surface during crash processing +//! +//! ### Process Isolation +//! - Fork+execve provides strong process boundary +//! - Crash in collector doesn't affect receiver +//! - Signal handlers are reset in receiver child +//! +//! ### Resource Limits +//! - Timeout prevents resource exhaustion +//! - Fixed buffer sizes for file operations +//! - Immediate flushing prevents large memory usage +//! +//! ## Debugging and Monitoring +//! +//! ### Log Output +//! - Receiver can be configured with `stdout_filename` and `stderr_filename` +//! - Error messages are accumulated in crash report +//! - Debug assertions validate critical operations +//! +//! ### Environment Variables +//! - `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`: Receiver timeout +//! - Standard Unix environment passed through execve +//! +//! [`socketpair()`]: nix::sys::socket::socketpair +//! [`collector::receiver_manager`]: crate::collector::receiver_manager +//! [`shared::constants`]: crate::shared::constants +//! [`collector::emitters`]: crate::collector::emitters +//! [`collector::collector_manager`]: crate::collector::collector_manager +//! [`receiver::entry_points`]: crate::receiver::entry_points +//! [`receiver::receive_report`]: crate::receiver::receive_report +//! [`collector::process_handle`]: crate::collector::process_handle + +// This module is pure documentation - no actual code needed \ No newline at end of file diff --git a/docs/crashtracker-unix-socket-communication.md b/docs/crashtracker-unix-socket-communication.md deleted file mode 100644 index a3b1dfe19..000000000 --- a/docs/crashtracker-unix-socket-communication.md +++ /dev/null @@ -1,321 +0,0 @@ -# Crash Tracker Unix Socket Communication Protocol - -**Date**: September 23, 2025 - -## Overview - -This document describes the Unix domain socket communication protocol used between the crash tracker's collector and receiver processes. The crash tracker uses a two-process architecture where the collector (a fork of the crashing process) communicates crash data to the receiver (a fork+execve process) via an anonymous Unix domain socket pair. - -## Socket Creation and Setup - -The communication channel is established using `socketpair()` to create an anonymous Unix domain socket pair: - -```rust -let (uds_parent, uds_child) = socket::socketpair( - socket::AddressFamily::Unix, - socket::SockType::Stream, - None, - socket::SockFlag::empty(), -)?; -``` - -**Location**: `datadog-crashtracker/src/collector/receiver_manager.rs:78-85` - -### File Descriptor Management - -1. **Parent Process**: Retains `uds_parent` for tracking -2. **Collector Process**: Inherits `uds_parent` as the write end -3. **Receiver Process**: Gets `uds_child` redirected to stdin via `dup2(uds_child, 0)` - -## Communication Protocol - -### Data Format - -The crash data is transmitted as a structured text stream with distinct sections delimited by markers defined in `datadog-crashtracker/src/shared/constants.rs`. - -### Message Structure - -Each crash report follows this sequence: - -1. **Metadata Section** -2. **Configuration Section** -3. **Signal Information Section** -4. **Process Context Section** -5. **Process Information Section** -6. **Counters Section** -7. **Spans Section** -8. **Additional Tags Section** -9. **Traces Section** -10. **Memory Maps Section** (Linux only) -11. **Stack Trace Section** -12. **Completion Marker** - -### Section Details - -#### 1. Metadata Section -``` -DD_CRASHTRACK_BEGIN_METADATA -{JSON metadata object} -DD_CRASHTRACK_END_METADATA -``` - -Contains serialized `Metadata` object with application context, tags, and environment information. - -#### 2. Configuration Section -``` -DD_CRASHTRACK_BEGIN_CONFIG -{JSON configuration object} -DD_CRASHTRACK_END_CONFIG -``` - -Contains serialized `CrashtrackerConfiguration` with crash tracking settings, endpoint information, and processing options. - -#### 3. Signal Information Section -``` -DD_CRASHTRACK_BEGIN_SIGINFO -{ - "si_code": , - "si_code_human_readable": "", - "si_signo": , - "si_signo_human_readable": "", - "si_addr": "" // Optional, for memory faults -} -DD_CRASHTRACK_END_SIGINFO -``` - -Contains signal details extracted from `siginfo_t` structure. - -**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:223-263` - -#### 4. Process Context Section (ucontext) -``` -DD_CRASHTRACK_BEGIN_UCONTEXT - -DD_CRASHTRACK_END_UCONTEXT -``` - -Contains processor state at crash time from `ucontext_t`. Format varies by platform: -- **Linux**: Direct debug print of `ucontext_t` -- **macOS**: Includes both `ucontext_t` and machine context (`mcontext`) - -**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:190-221` - -#### 5. Process Information Section -``` -DD_CRASHTRACK_BEGIN_PROCINFO -{"pid": } -DD_CRASHTRACK_END_PROCINFO -``` - -Contains the process ID of the crashing process. - -#### 6. Counters Section -``` -DD_CRASHTRACK_BEGIN_COUNTERS - -DD_CRASHTRACK_END_COUNTERS -``` - -Contains internal crash tracker counters and metrics. - -#### 7. Spans Section -``` -DD_CRASHTRACK_BEGIN_SPANS - -DD_CRASHTRACK_END_SPANS -``` - -Contains active distributed tracing spans at crash time. - -#### 8. Additional Tags Section -``` -DD_CRASHTRACK_BEGIN_TAGS - -DD_CRASHTRACK_END_TAGS -``` - -Contains additional tags collected at crash time. - -#### 9. Traces Section -``` -DD_CRASHTRACK_BEGIN_TRACES - -DD_CRASHTRACK_END_TRACES -``` - -Contains active trace information. - -#### 10. Memory Maps Section (Linux Only) -``` -DD_CRASHTRACK_BEGIN_FILE /proc/self/maps - -DD_CRASHTRACK_END_FILE "/proc/self/maps" -``` - -Contains memory mapping information from `/proc/self/maps` for symbol resolution. - -**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:184-187` - -#### 11. Stack Trace Section -``` -DD_CRASHTRACK_BEGIN_STACKTRACE -{"ip": "", "module_base_address": "", "sp": "", "symbol_address": ""} -{"ip": "", "module_base_address": "", "sp": "", "symbol_address": "", "function": "", "file": "", "line": } -... -DD_CRASHTRACK_END_STACKTRACE -``` - -Each line represents one stack frame. Frame format depends on symbol resolution setting: - -- **Disabled/Receiver-only**: Only addresses (`ip`, `sp`, `symbol_address`, optional `module_base_address`) -- **In-process symbols**: Includes debug information (`function`, `file`, `line`, `column`) - -Stack frames with stack pointer less than the fault stack pointer are filtered out to exclude crash tracker frames. - -**Implementation**: `datadog-crashtracker/src/collector/emitters.rs:45-117` - -#### 12. Completion Marker -``` -DD_CRASHTRACK_DONE -``` - -Indicates end of crash report transmission. - -## Communication Flow - -### 1. Collector Side (Write End) - -**File**: `datadog-crashtracker/src/collector/collector_manager.rs:92-102` - -```rust -let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; - -let report = emit_crashreport( - &mut unix_stream, - config, - config_str, - metadata_str, - sig_info, - ucontext, - ppid, -); -``` - -The collector: -1. Creates `UnixStream` from inherited file descriptor -2. Calls `emit_crashreport()` to serialize and write all crash data -3. Flushes the stream after each section for reliability -4. Exits with `libc::_exit(0)` on completion - -### 2. Receiver Side (Read End) - -**File**: `datadog-crashtracker/src/receiver/entry_points.rs:97-119` - -```rust -pub(crate) async fn receiver_entry_point( - timeout: Duration, - stream: impl AsyncBufReadExt + std::marker::Unpin, -) -> anyhow::Result<()> { - if let Some((config, mut crash_info)) = receive_report_from_stream(timeout, stream).await? { - // Process crash data - if let Err(e) = resolve_frames(&config, &mut crash_info) { - crash_info.log_messages.push(format!("Error resolving frames: {e}")); - } - if config.demangle_names() { - if let Err(e) = crash_info.demangle_names() { - crash_info.log_messages.push(format!("Error demangling names: {e}")); - } - } - crash_info.async_upload_to_endpoint(config.endpoint()).await?; - } - Ok(()) -} -``` - -The receiver: -1. Reads from stdin (Unix socket via `dup2`) -2. Parses the structured stream into `CrashInfo` and `CrashtrackerConfiguration` -3. Performs symbol resolution if configured -4. Uploads formatted crash report to backend - -### 3. Stream Parsing - -**File**: `datadog-crashtracker/src/receiver/receive_report.rs` - -The receiver parses the stream by: -1. Reading line-by-line with timeout protection -2. Matching delimiter patterns to identify sections -3. Accumulating section data between delimiters -4. Deserializing JSON sections into appropriate data structures -5. Handling the `DD_CRASHTRACK_DONE` completion marker - -## Error Handling and Reliability - -### Signal Safety -- All collector operations use only async-signal-safe functions -- No memory allocation in signal handler context -- Pre-prepared data structures (`PreparedExecve`) to avoid allocations - -### Timeout Protection -- Receiver has configurable timeout (default: 4000ms) -- Environment variable: `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS` -- Prevents hanging on incomplete/corrupted streams - -### Process Cleanup -- Parent process uses `wait_for_pollhup()` to detect socket closure -- Kills child processes with `SIGKILL` if needed -- Reaps zombie processes to prevent resource leaks - -**File**: `datadog-crashtracker/src/collector/process_handle.rs:19-40` - -### Data Integrity -- Each section is flushed immediately after writing -- Structured delimiters allow detection of incomplete transmissions -- Error messages are accumulated rather than failing fast - -## Alternative Communication Modes - -### Named Socket Mode -When `unix_socket_path` is configured, the collector connects to an existing Unix socket instead of using the fork+execve receiver: - -```rust -let receiver = if unix_socket_path.is_empty() { - Receiver::spawn_from_stored_config()? // Fork+execve mode -} else { - Receiver::from_socket(unix_socket_path)? // Named socket mode -}; -``` - -This allows integration with long-lived receiver processes. - -**Linux Abstract Sockets**: On Linux, socket paths not starting with `.` or `/` are treated as abstract socket names. - -## Security Considerations - -### File Descriptor Isolation -- Collector closes stdio file descriptors (0, 1, 2) -- Receiver redirects socket to stdin, stdout/stderr to configured files -- Minimizes attack surface during crash processing - -### Process Isolation -- Fork+execve provides strong process boundary -- Crash in collector doesn't affect receiver -- Signal handlers are reset in receiver child - -### Resource Limits -- Timeout prevents resource exhaustion -- Fixed buffer sizes for file operations -- Immediate flushing prevents large memory usage - -## Debugging and Monitoring - -### Log Output -- Receiver can be configured with `stdout_filename` and `stderr_filename` -- Error messages are accumulated in crash report -- Debug assertions validate critical operations - -### Environment Variables -- `DD_CRASHTRACKER_RECEIVER_TIMEOUT_MS`: Receiver timeout -- Standard Unix environment passed through execve - -This communication protocol ensures reliable crash data collection and transmission even when the main process is in an unstable state, providing robust crash reporting capabilities for production systems.