Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 122 additions & 8 deletions datadog-crashtracker/src/collector/collector_manager.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,46 @@
// Copyright 2025-Present Datadog, Inc. https://www.datadoghq.com/
// SPDX-License-Identifier: Apache-2.0

//! Crash data collector process management for Unix socket communication.
//!
//! This module manages the collector process that writes crash data to Unix sockets.
//! The collector runs in a forked child process and is responsible for serializing
//! and transmitting crash information to the receiver process.
//!
//! ## Communication Flow (Collector Side)
//!
//! The collector performs these steps to transmit crash data:
//!
//! 1. **Process Setup**: Forks from crashing process, closes stdio, disables SIGPIPE
//! 2. **Socket Creation**: Creates `UnixStream` from inherited file descriptor
//! 3. **Data Serialization**: Calls [`emit_crashreport()`] to write structured crash data
//! 4. **Graceful Exit**: Flushes data and exits with `libc::_exit(0)`
//!
//! ```text
//! ┌─────────────────────┐ ┌──────────────────────┐
//! │ Signal Handler │ │ Collector Process │
//! │ (Original Process) │ │ (Forked Child) │
//! │ │ │ │
//! │ 1. Catch crash │────fork()──────────►│ 2. Setup stdio │
//! │ 2. Fork collector │ │ 3. Create UnixStream │
//! │ 3. Wait for child │ │ 4. Write crash data │
//! │ │◄────wait()──────────│ 5. Exit cleanly │
//! └─────────────────────┘ └──────────────────────┘
Comment on lines +20 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you line up the fork and wait arrows with the lines of code the represent?

//! ```
//!
//! ## Signal Safety
//!
//! All collector operations use only async-signal-safe functions since the collector
//! runs in a signal handler context:
//!
//! - No memory allocations
//! - Pre-prepared data structures
//! - Only safe system calls
//!
//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`].
//!
//! [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport

use super::process_handle::ProcessHandle;
use super::receiver_manager::Receiver;
use ddcommon::timeout::TimeoutManager;
Expand All @@ -25,6 +65,42 @@ pub enum CollectorSpawnError {
}

impl Collector {
/// Spawns a collector process to write crash data to the Unix socket.
///
/// This method forks a child process that will serialize and transmit crash data
/// to the receiver process via the Unix socket established in the receiver.
///
/// ## Process Architecture
///
/// ```text
/// Parent Process (Signal Handler) Child Process (Collector)
/// ┌─────────────────────────────┐ ┌─────────────────────────────┐
/// │ 1. Catches crash signal │ │ 4. Closes stdio (0,1,2) │
/// │ 2. Forks collector process │──►│ 5. Disables SIGPIPE │
/// │ 3. Returns to caller │ │ 6. Creates UnixStream │
/// │ │ │ 7. Calls emit_crashreport() │
/// │ │ │ 8. Exits with _exit(0) │
/// └─────────────────────────────┘ └─────────────────────────────┘
Comment on lines +77 to +83
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interleaving of steps here is a little confusing. Can we have whitespace to indicate which steps are concurrent?

/// ```
///
/// ## Arguments
///
/// * `receiver` - The receiver process that will read crash data from the Unix socket
/// * `config` - Crash tracker configuration
/// * `config_str` - JSON-serialized configuration string
/// * `metadata_str` - JSON-serialized metadata string
/// * `sig_info` - Signal information from the crash
/// * `ucontext` - Process context at crash time
///
/// ## Returns
///
/// * `Ok(Collector)` - Handle to the spawned collector process
/// * `Err(CollectorSpawnError::ForkFailed)` - If the fork operation fails
///
/// ## Safety
///
/// This function is called from signal handler context and uses only async-signal-safe operations.
/// The child process performs all potentially unsafe operations after fork.
pub(crate) fn spawn(
receiver: &Receiver,
config: &CrashtrackerConfiguration,
Expand All @@ -33,8 +109,8 @@ impl Collector {
sig_info: *const siginfo_t,
ucontext: *const ucontext_t,
) -> Result<Self, CollectorSpawnError> {
// When we spawn the child, our pid becomes the ppid.
// SAFETY: This function has no safety requirements.
// When we spawn the child, our pid becomes the ppid for process tracking.
// SAFETY: getpid() is async-signal-safe.
let pid = unsafe { libc::getpid() };

let fork_result = alt_fork();
Expand Down Expand Up @@ -66,6 +142,42 @@ impl Collector {
}
}

/// Collector child process entry point - serializes and transmits crash data via Unix socket.
///
/// This function runs in the forked collector process and performs the actual crash data
/// transmission. It establishes the Unix socket connection and writes all crash information
/// using the structured protocol.
///
/// ## Process Flow
///
/// 1. **Isolate from parent**: Closes stdin, stdout, stderr to prevent interference
/// 2. **Signal handling**: Disables SIGPIPE to handle broken pipe gracefully
/// 3. **Socket setup**: Creates `UnixStream` from inherited file descriptor
/// 4. **Data transmission**: Calls [`emit_crashreport()`] to write structured crash data
/// 5. **Clean exit**: Exits with `_exit(0)` to avoid cleanup issues
///
/// ## Communication Protocol
///
/// The crash data is written as a structured stream with delimited sections:
/// - Metadata, Configuration, Signal Info, Process Context
/// - Counters, Spans, Tags, Traces, Memory Maps, Stack Trace
/// - Completion marker
///
/// For details, see [`crate::shared::unix_socket_communication`].
///
/// ## Arguments
///
/// * `config` - Crash tracker configuration object
/// * `config_str` - JSON-serialized configuration for receiver
/// * `metadata_str` - JSON-serialized metadata for receiver
/// * `sig_info` - Signal information from crash context
/// * `ucontext` - Processor context at crash time
/// * `uds_fd` - Unix socket file descriptor for writing crash data
/// * `ppid` - Parent process ID for identification
///
/// This function never returns - it always exits via `_exit(0)` or `terminate()`.
///
/// [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport
pub(crate) fn run_collector_child(
config: &CrashtrackerConfiguration,
config_str: &str,
Expand All @@ -75,22 +187,24 @@ pub(crate) fn run_collector_child(
uds_fd: RawFd,
ppid: libc::pid_t,
) -> ! {
// Close stdio
let _ = unsafe { libc::close(0) };
let _ = unsafe { libc::close(1) };
let _ = unsafe { libc::close(2) };
// Close stdio to isolate from parent process and prevent interference with crash data transmission
let _ = unsafe { libc::close(0) }; // stdin
let _ = unsafe { libc::close(1) }; // stdout
let _ = unsafe { libc::close(2) }; // stderr

// Disable SIGPIPE
// Disable SIGPIPE - if receiver closes socket early, we want to handle it gracefully
// rather than being killed by SIGPIPE
let _ = unsafe {
signal::sigaction(
signal::SIGPIPE,
&SigAction::new(SigHandler::SigIgn, SaFlags::empty(), SigSet::empty()),
)
};

// Emit crashreport
// Create Unix socket stream for crash data transmission
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) };

// Serialize and transmit all crash data using structured protocol
let report = emit_crashreport(
&mut unix_stream,
config,
Expand Down
79 changes: 79 additions & 0 deletions datadog-crashtracker/src/collector/emitters.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,39 @@
// Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/
// SPDX-License-Identifier: Apache-2.0

//! Crash data emission and Unix socket protocol serialization.
//!
//! This module implements the collector-side serialization of crash data using the
//! Unix socket communication protocol. It writes structured crash information to
//! Unix domain sockets for consumption by receiver processes.
//!
//! ## Protocol Emission
//!
//! The emitter writes crash data as a series of delimited sections:
//!
//! 1. **Section Delimiters**: Uses constants from [`crate::shared::constants`] to mark boundaries
//! 2. **Structured Data**: Writes JSON, text, or binary data within sections
//! 3. **Immediate Flushing**: Flushes each section to ensure data integrity
//! 4. **Completion Marker**: Ends transmission with `DD_CRASHTRACK_DONE`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a little confusing, because the delimiters wrap the structured data, but the list here makes them seem like they only come before.

//!
//! ## Section Format Implementation
//!
//! Each section follows this pattern:
//! ```text
//! DD_CRASHTRACK_BEGIN_[SECTION]
//! [section data - JSON, text, or binary]
//! DD_CRASHTRACK_END_[SECTION]
//! ```
//!
//! ### Key Sections
//!
//! - **Stack Trace** (`emit_backtrace_by_frames`): Stack frames with optional symbol resolution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still resolve symbols in process?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pass a configuration to allow this.

//! - **Signal Info** (`emit_siginfo`): Signal details from `siginfo_t`
//! - **Process Context** (`emit_ucontext`): Processor state from `ucontext_t`
//! - **Memory Maps** (`emit_file`): `/proc/self/maps` for symbol resolution
//!
//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`].

use crate::collector::additional_tags::consume_and_emit_additional_tags;
use crate::collector::counters::emit_counters;
use crate::collector::spans::{emit_spans, emit_traces};
Expand Down Expand Up @@ -116,6 +149,52 @@ unsafe fn emit_backtrace_by_frames(
Ok(())
}

/// Emits a complete crash report using the Unix socket communication protocol.
///
/// This is the main function that orchestrates the emission of all crash data sections
/// to the Unix socket. It writes the structured crash report in the order specified
/// by the protocol, with proper delimiters and flushing for data integrity.
///
/// ## Section Emission Order
///
/// The crash report is written in this specific order:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making this contractural? I'd say this is an implementation detail. The key part is that we try to emit more critical bits of info first, and bits that are more likely to crash last. Bits that are both critical and likely to crash are ¯\_(ツ)_/¯

/// 1. **Metadata** - Application context, tags, environment info
/// 2. **Configuration** - Crash tracker settings and endpoint info
/// 3. **Signal Information** - Details from `siginfo_t`
/// 4. **Process Context** - CPU state from `ucontext_t`
/// 5. **Process Information** - Process ID
/// 6. **Counters** - Internal crash tracker metrics
/// 7. **Spans** - Active distributed tracing spans
/// 8. **Additional Tags** - Extra tags collected at crash time
/// 9. **Traces** - Active trace information
/// 10. **Memory Maps** (Linux only) - `/proc/self/maps` content
/// 11. **Stack Trace** - Stack frames with symbol resolution
/// 12. **Completion Marker** - `DD_CRASHTRACK_DONE`
///
/// ## Data Integrity
///
/// Each section is immediately flushed after writing to ensure the receiver
/// can process partial data even if the collector crashes during transmission.
///
/// ## Arguments
///
/// * `pipe` - Write stream (typically Unix socket)
/// * `config` - Crash tracker configuration object
/// * `config_str` - JSON-serialized configuration for receiver
/// * `metadata_string` - JSON-serialized metadata
/// * `sig_info` - Signal information from crash context
/// * `ucontext` - Processor context at crash time
/// * `ppid` - Parent process ID
///
/// ## Returns
///
/// * `Ok(())` - All crash data written successfully
/// * `Err(EmitterError)` - I/O error or data serialization failure
///
/// ## Signal Safety
///
/// This function is designed to be called from signal handler context and uses
/// only async-signal-safe operations where possible.
pub(crate) fn emit_crashreport(
pipe: &mut impl Write,
config: &CrashtrackerConfiguration,
Expand Down
Loading
Loading