-
Notifications
You must be signed in to change notification settings - Fork 15
[crashtracking] add doc to outline UNIX pipe communication #1239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,46 @@ | ||
// Copyright 2025-Present Datadog, Inc. https://www.datadoghq.com/ | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
//! Crash data collector process management for Unix socket communication. | ||
//! | ||
//! This module manages the collector process that writes crash data to Unix sockets. | ||
//! The collector runs in a forked child process and is responsible for serializing | ||
//! and transmitting crash information to the receiver process. | ||
//! | ||
//! ## Communication Flow (Collector Side) | ||
//! | ||
//! The collector performs these steps to transmit crash data: | ||
//! | ||
//! 1. **Process Setup**: Forks from crashing process, closes stdio, disables SIGPIPE | ||
//! 2. **Socket Creation**: Creates `UnixStream` from inherited file descriptor | ||
//! 3. **Data Serialization**: Calls [`emit_crashreport()`] to write structured crash data | ||
//! 4. **Graceful Exit**: Flushes data and exits with `libc::_exit(0)` | ||
//! | ||
//! ```text | ||
//! ┌─────────────────────┐ ┌──────────────────────┐ | ||
//! │ Signal Handler │ │ Collector Process │ | ||
//! │ (Original Process) │ │ (Forked Child) │ | ||
//! │ │ │ │ | ||
//! │ 1. Catch crash │────fork()──────────►│ 2. Setup stdio │ | ||
//! │ 2. Fork collector │ │ 3. Create UnixStream │ | ||
//! │ 3. Wait for child │ │ 4. Write crash data │ | ||
//! │ │◄────wait()──────────│ 5. Exit cleanly │ | ||
//! └─────────────────────┘ └──────────────────────┘ | ||
//! ``` | ||
//! | ||
//! ## Signal Safety | ||
//! | ||
//! All collector operations use only async-signal-safe functions since the collector | ||
//! runs in a signal handler context: | ||
//! | ||
//! - No memory allocations | ||
//! - Pre-prepared data structures | ||
//! - Only safe system calls | ||
//! | ||
//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. | ||
//! | ||
//! [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport | ||
|
||
use super::process_handle::ProcessHandle; | ||
use super::receiver_manager::Receiver; | ||
use ddcommon::timeout::TimeoutManager; | ||
|
@@ -25,6 +65,42 @@ pub enum CollectorSpawnError { | |
} | ||
|
||
impl Collector { | ||
/// Spawns a collector process to write crash data to the Unix socket. | ||
/// | ||
/// This method forks a child process that will serialize and transmit crash data | ||
/// to the receiver process via the Unix socket established in the receiver. | ||
/// | ||
/// ## Process Architecture | ||
/// | ||
/// ```text | ||
/// Parent Process (Signal Handler) Child Process (Collector) | ||
/// ┌─────────────────────────────┐ ┌─────────────────────────────┐ | ||
/// │ 1. Catches crash signal │ │ 4. Closes stdio (0,1,2) │ | ||
/// │ 2. Forks collector process │──►│ 5. Disables SIGPIPE │ | ||
/// │ 3. Returns to caller │ │ 6. Creates UnixStream │ | ||
/// │ │ │ 7. Calls emit_crashreport() │ | ||
/// │ │ │ 8. Exits with _exit(0) │ | ||
/// └─────────────────────────────┘ └─────────────────────────────┘ | ||
Comment on lines
+77
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The interleaving of steps here is a little confusing. Can we have whitespace to indicate which steps are concurrent? |
||
/// ``` | ||
/// | ||
/// ## Arguments | ||
/// | ||
/// * `receiver` - The receiver process that will read crash data from the Unix socket | ||
/// * `config` - Crash tracker configuration | ||
/// * `config_str` - JSON-serialized configuration string | ||
/// * `metadata_str` - JSON-serialized metadata string | ||
/// * `sig_info` - Signal information from the crash | ||
/// * `ucontext` - Process context at crash time | ||
/// | ||
/// ## Returns | ||
/// | ||
/// * `Ok(Collector)` - Handle to the spawned collector process | ||
/// * `Err(CollectorSpawnError::ForkFailed)` - If the fork operation fails | ||
/// | ||
/// ## Safety | ||
/// | ||
/// This function is called from signal handler context and uses only async-signal-safe operations. | ||
/// The child process performs all potentially unsafe operations after fork. | ||
pub(crate) fn spawn( | ||
receiver: &Receiver, | ||
config: &CrashtrackerConfiguration, | ||
|
@@ -33,8 +109,8 @@ impl Collector { | |
sig_info: *const siginfo_t, | ||
ucontext: *const ucontext_t, | ||
) -> Result<Self, CollectorSpawnError> { | ||
// When we spawn the child, our pid becomes the ppid. | ||
// SAFETY: This function has no safety requirements. | ||
// When we spawn the child, our pid becomes the ppid for process tracking. | ||
// SAFETY: getpid() is async-signal-safe. | ||
let pid = unsafe { libc::getpid() }; | ||
|
||
let fork_result = alt_fork(); | ||
|
@@ -66,6 +142,42 @@ impl Collector { | |
} | ||
} | ||
|
||
/// Collector child process entry point - serializes and transmits crash data via Unix socket. | ||
/// | ||
/// This function runs in the forked collector process and performs the actual crash data | ||
/// transmission. It establishes the Unix socket connection and writes all crash information | ||
/// using the structured protocol. | ||
/// | ||
/// ## Process Flow | ||
/// | ||
/// 1. **Isolate from parent**: Closes stdin, stdout, stderr to prevent interference | ||
/// 2. **Signal handling**: Disables SIGPIPE to handle broken pipe gracefully | ||
/// 3. **Socket setup**: Creates `UnixStream` from inherited file descriptor | ||
/// 4. **Data transmission**: Calls [`emit_crashreport()`] to write structured crash data | ||
/// 5. **Clean exit**: Exits with `_exit(0)` to avoid cleanup issues | ||
/// | ||
/// ## Communication Protocol | ||
/// | ||
/// The crash data is written as a structured stream with delimited sections: | ||
/// - Metadata, Configuration, Signal Info, Process Context | ||
/// - Counters, Spans, Tags, Traces, Memory Maps, Stack Trace | ||
/// - Completion marker | ||
/// | ||
/// For details, see [`crate::shared::unix_socket_communication`]. | ||
/// | ||
/// ## Arguments | ||
/// | ||
/// * `config` - Crash tracker configuration object | ||
/// * `config_str` - JSON-serialized configuration for receiver | ||
/// * `metadata_str` - JSON-serialized metadata for receiver | ||
/// * `sig_info` - Signal information from crash context | ||
/// * `ucontext` - Processor context at crash time | ||
/// * `uds_fd` - Unix socket file descriptor for writing crash data | ||
/// * `ppid` - Parent process ID for identification | ||
/// | ||
/// This function never returns - it always exits via `_exit(0)` or `terminate()`. | ||
/// | ||
/// [`emit_crashreport()`]: crate::collector::emitters::emit_crashreport | ||
pub(crate) fn run_collector_child( | ||
config: &CrashtrackerConfiguration, | ||
config_str: &str, | ||
|
@@ -75,22 +187,24 @@ pub(crate) fn run_collector_child( | |
uds_fd: RawFd, | ||
ppid: libc::pid_t, | ||
) -> ! { | ||
// Close stdio | ||
let _ = unsafe { libc::close(0) }; | ||
let _ = unsafe { libc::close(1) }; | ||
let _ = unsafe { libc::close(2) }; | ||
// Close stdio to isolate from parent process and prevent interference with crash data transmission | ||
let _ = unsafe { libc::close(0) }; // stdin | ||
let _ = unsafe { libc::close(1) }; // stdout | ||
let _ = unsafe { libc::close(2) }; // stderr | ||
|
||
// Disable SIGPIPE | ||
// Disable SIGPIPE - if receiver closes socket early, we want to handle it gracefully | ||
// rather than being killed by SIGPIPE | ||
let _ = unsafe { | ||
signal::sigaction( | ||
signal::SIGPIPE, | ||
&SigAction::new(SigHandler::SigIgn, SaFlags::empty(), SigSet::empty()), | ||
) | ||
}; | ||
|
||
// Emit crashreport | ||
// Create Unix socket stream for crash data transmission | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice catch |
||
let mut unix_stream = unsafe { UnixStream::from_raw_fd(uds_fd) }; | ||
|
||
// Serialize and transmit all crash data using structured protocol | ||
let report = emit_crashreport( | ||
&mut unix_stream, | ||
config, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,39 @@ | ||
// Copyright 2023-Present Datadog, Inc. https://www.datadoghq.com/ | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
//! Crash data emission and Unix socket protocol serialization. | ||
//! | ||
//! This module implements the collector-side serialization of crash data using the | ||
//! Unix socket communication protocol. It writes structured crash information to | ||
//! Unix domain sockets for consumption by receiver processes. | ||
//! | ||
//! ## Protocol Emission | ||
//! | ||
//! The emitter writes crash data as a series of delimited sections: | ||
//! | ||
//! 1. **Section Delimiters**: Uses constants from [`crate::shared::constants`] to mark boundaries | ||
//! 2. **Structured Data**: Writes JSON, text, or binary data within sections | ||
//! 3. **Immediate Flushing**: Flushes each section to ensure data integrity | ||
//! 4. **Completion Marker**: Ends transmission with `DD_CRASHTRACK_DONE` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found this a little confusing, because the delimiters wrap the structured data, but the list here makes them seem like they only come before. |
||
//! | ||
//! ## Section Format Implementation | ||
//! | ||
//! Each section follows this pattern: | ||
//! ```text | ||
//! DD_CRASHTRACK_BEGIN_[SECTION] | ||
//! [section data - JSON, text, or binary] | ||
//! DD_CRASHTRACK_END_[SECTION] | ||
//! ``` | ||
//! | ||
//! ### Key Sections | ||
//! | ||
//! - **Stack Trace** (`emit_backtrace_by_frames`): Stack frames with optional symbol resolution | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we still resolve symbols in process? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can pass a configuration to allow this. |
||
//! - **Signal Info** (`emit_siginfo`): Signal details from `siginfo_t` | ||
//! - **Process Context** (`emit_ucontext`): Processor state from `ucontext_t` | ||
//! - **Memory Maps** (`emit_file`): `/proc/self/maps` for symbol resolution | ||
//! | ||
//! For complete protocol documentation, see [`crate::shared::unix_socket_communication`]. | ||
|
||
use crate::collector::additional_tags::consume_and_emit_additional_tags; | ||
use crate::collector::counters::emit_counters; | ||
use crate::collector::spans::{emit_spans, emit_traces}; | ||
|
@@ -116,6 +149,52 @@ unsafe fn emit_backtrace_by_frames( | |
Ok(()) | ||
} | ||
|
||
/// Emits a complete crash report using the Unix socket communication protocol. | ||
/// | ||
/// This is the main function that orchestrates the emission of all crash data sections | ||
/// to the Unix socket. It writes the structured crash report in the order specified | ||
/// by the protocol, with proper delimiters and flushing for data integrity. | ||
/// | ||
/// ## Section Emission Order | ||
/// | ||
/// The crash report is written in this specific order: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we making this contractural? I'd say this is an implementation detail. The key part is that we try to emit more critical bits of info first, and bits that are more likely to crash last. Bits that are both critical and likely to crash are |
||
/// 1. **Metadata** - Application context, tags, environment info | ||
/// 2. **Configuration** - Crash tracker settings and endpoint info | ||
/// 3. **Signal Information** - Details from `siginfo_t` | ||
/// 4. **Process Context** - CPU state from `ucontext_t` | ||
/// 5. **Process Information** - Process ID | ||
/// 6. **Counters** - Internal crash tracker metrics | ||
/// 7. **Spans** - Active distributed tracing spans | ||
/// 8. **Additional Tags** - Extra tags collected at crash time | ||
/// 9. **Traces** - Active trace information | ||
/// 10. **Memory Maps** (Linux only) - `/proc/self/maps` content | ||
/// 11. **Stack Trace** - Stack frames with symbol resolution | ||
/// 12. **Completion Marker** - `DD_CRASHTRACK_DONE` | ||
/// | ||
/// ## Data Integrity | ||
/// | ||
/// Each section is immediately flushed after writing to ensure the receiver | ||
/// can process partial data even if the collector crashes during transmission. | ||
/// | ||
/// ## Arguments | ||
/// | ||
/// * `pipe` - Write stream (typically Unix socket) | ||
/// * `config` - Crash tracker configuration object | ||
/// * `config_str` - JSON-serialized configuration for receiver | ||
/// * `metadata_string` - JSON-serialized metadata | ||
/// * `sig_info` - Signal information from crash context | ||
/// * `ucontext` - Processor context at crash time | ||
/// * `ppid` - Parent process ID | ||
/// | ||
/// ## Returns | ||
/// | ||
/// * `Ok(())` - All crash data written successfully | ||
/// * `Err(EmitterError)` - I/O error or data serialization failure | ||
/// | ||
/// ## Signal Safety | ||
/// | ||
/// This function is designed to be called from signal handler context and uses | ||
/// only async-signal-safe operations where possible. | ||
pub(crate) fn emit_crashreport( | ||
pipe: &mut impl Write, | ||
config: &CrashtrackerConfiguration, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you line up the fork and wait arrows with the lines of code the represent?