Skip to content

Commit 316dd3e

Browse files
authored
[crashtracking]: consolidate crashtracker log structure RFCs (#1204)
* New: Create first pass of consolidated crashtracker log structure RFCs * Specify v1.x
1 parent b58b727 commit 316dd3e

File tree

1 file changed

+291
-0
lines changed

1 file changed

+291
-0
lines changed
Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# RFC 0011: v1.X Crashtracker Structured Log Format
2+
3+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [IETF RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119).
4+
5+
## Summary
6+
7+
This document consolidates and describes the complete evolution of the crashinfo data format from version 1.0 through 1.4. It serves as the authoritative specification for the crashtracker structured log format, replacing RFCs 0005-0009. Future minor version modifications will be included in this revisable document.
8+
9+
## Motivation
10+
11+
The `libdatadog` crashtracker detects program crashes and automatically collects information relevant to characterizing and debugging the crash, including stack-traces, crash-type (e.g. SIGSEGV, SIGBUS, etc), library version, etc. This RFC consolidates the standardized logging format that has evolved through multiple iterations to support enhanced debugging capabilities.
12+
13+
### Why structured json
14+
15+
As a text-based format, json can be written to standard logging endpoints.
16+
It is (somewhat) human readable, so users can directly interpret the crash info off their log if necessary.
17+
As a structured format, it avoids the ambiguity of standard semi-structured stacktrace formats (as used by e.g. Java, .Net, etc).
18+
Due to the use of native extensions, it is possible for a single stack-trace to include frames from multiple languages (e.g. python may call C code, which calls Rust code, etc).
19+
Having a single structured format allows us to work across languages.
20+
21+
## Current Format (Version 1.4)
22+
23+
This section describes the current format (version 1.4), which incorporates all features from versions 1.0 through 1.4. A natural language description of the json format is given here. An example is given in Appendix A, and the schema is given in Appendix B.
24+
25+
Any field not listed as "Required" is optional. Consumers MUST accept json with elided optional fields.
26+
27+
### Extensibility
28+
29+
The data-format has a REQUIRED `data_schema_version` field, which represents the semver version ID of the data.
30+
Following semver, collectors may add additional fields without affecting the major version number.
31+
Parsers SHOULD therefore accept unexpected fields, either by ignoring them, or by displaying them as additional data.
32+
33+
### Version Compatibility
34+
35+
Consumers of the crash data format SHOULD be designed to handle all versions from 1.0 to 1.4. The version is indicated by the `data_schema_version` field. Key compatibility considerations:
36+
- Version 1.0: Base format
37+
- Version 1.1+: Stacktraces may include an `incomplete` field
38+
- Version 1.2+: Root level may include an `experimental` field
39+
- Version 1.3+: Stackframes may include a `comments` field
40+
- Version 1.4+: Stackframes may include a `mangled_name` field
41+
42+
### Fields
43+
44+
- `counters`: **[optional]**
45+
A map of names to integer values.
46+
At present, this is used by the profiler to track which operations were active at the time of the crash.
47+
- `data_schema_version`: **[required]**
48+
A string containing the semver ID of the crashtracker data schema. Current versions: "1.0", "1.1", "1.2", "1.3", "1.4".
49+
- `experimental`: **[optional]** *[Added in v1.2]*
50+
Any valid JSON object can be used as the value here.
51+
Note that the object MUST be valid JSON.
52+
Consumers of the format SHOULD pass this field along unmodified as the report is processed.
53+
This field allows developers to collect experimental data without requiring schema changes.
54+
- `error`: **[required]**
55+
- `threads`: **[optional]**
56+
An array of `Thread` objects.
57+
In a multi-threaded program, the collector SHOULD collect the stacktraces of all active threads, and report them here.
58+
A `Thread` object has the following fields:
59+
- `crashed`: **[required]**
60+
A boolean which tells if the thread crashed.
61+
- `name`: **[required]**
62+
Name of the thread (e.g. 'Thread 0').
63+
- `stack`: **[required]**
64+
The `StackTrace` of the thread.
65+
See below for more details on how stacktraces are formatted.
66+
- `state`: **[optional]**
67+
Platform-specific state of the thread when its state was captured (CPU registers dump for iOS, thread state enum for Android, etc.).
68+
Currently, this is a platform-dependent string.
69+
- `is_crash`: **[required]**
70+
Boolean true if the error was a crash, false otherwise.
71+
- `kind`: **[required]**
72+
The kind of error that occurred.
73+
For example, "Panic", "UnhandledException", "UnixSignal".
74+
- `message`: **[optional]**
75+
A human readable string containing an error message associated with the stack trace.
76+
- `source_type`: **[required]**
77+
The string "Crashtracking".
78+
- `stack`: **[required]**
79+
This represents the stack of the crashing thread.
80+
See below for more details on how stacktraces are formatted.
81+
- `files`: **[optional]**
82+
A `Map<filename, contents>` where `contents` is an array of plain text strings, one per line.
83+
Useful files for triage and debugging, such as `/proc/self/maps` or `/proc/meminfo`.
84+
- `fingerprint`: **[optional]**
85+
A string containing a summary or hash of crash information which can be used for deduplication.
86+
- `incomplete`: **[required]**
87+
Boolean `false` if the crashreport is complete (i.e. contains all intended data), `true` if there is expected missing data.
88+
This can happen becasue the crashtracker is architected to stream data to an out of process receiver, allowing a partial crash report to be emitted even in the case where the crashtracker itself crashed during stack trace collection.
89+
This MUST be set to `true` if any required field is missing.
90+
- `log_messages`: **[optional]**
91+
An array of strings containing log messages generated by the crashtracker.
92+
- `metadata`: **[required]**
93+
Metadata about the system in which the crash occurred:
94+
- `library_name`: **[required]**
95+
e.g. "dd-trace-python".
96+
- `library_version`: **[required]**
97+
e.g. "2.16.0".
98+
- `family`: **[required]**
99+
e.g. "python".
100+
- `tags`: **[optional]**
101+
A set of key:value pairs, representing any tags the crashtracking system wishes to associate with the crash.
102+
Examples would include "hostname", "service", and any configuration information the system wishes to track.
103+
- `os_info`: **[required]**
104+
The OS + processor architecture on which the crash occurred.
105+
Follows the display format of the [os_info crate](https://crates.io/crates/os_info).
106+
- `architecture`: **[required]**
107+
e.g. "arm64"
108+
- `bitness`: **[required]**
109+
e.g. "64-bit".
110+
- `os_type`: **[required]**
111+
e.g. "Mac OS".
112+
- `version`: **[required]**
113+
e.g. "14.7.0".
114+
- `proc_info`: **[optional]**
115+
A place to store information about the crashing process.
116+
In the future, this may have additional optional fields as more data is collected.
117+
- `pid`: **[required]**
118+
The PID of the crashing process.
119+
- `sig_info`: **[optional]**
120+
UNIX signal based collectors only: Useful information from the [siginfo_t](https://man7.org/linux/man-pages/man2/sigaction.2.html) structure.
121+
- `sid_addr`: **[optional]**
122+
A hexidecimal string with the memory address at which the fault occurred, e.g. "0xDEADBEEF".
123+
- `si_code`: **[required]**
124+
An integer storing the [UNIX signal code](https://man7.org/linux/man-pages/man7/signal.7.html), e.g. `1` for a `SEGV_MAPERR`.
125+
- `si_code_human_readable`: **[required]**
126+
The signal code expressed as a human readable string, e.g. "SEGV_MAPERR" for `SEGV_MAPERR`.
127+
Follows the naming convention in [the manpage](https://man7.org/linux/man-pages/man7/signal.7.html).
128+
- `si_signo`: **[required]**
129+
An integer storing the [UNIX signal number](https://man7.org/linux/man-pages/man7/signal.7.html), e.g. `11` for a segmentation violation.
130+
- `si_signo_human_readable`: **[required]**
131+
The signal name, e.g. "SIGSEGV".
132+
Follows the naming convention in [the manpage](https://man7.org/linux/man-pages/man7/signal.7.html).
133+
- `span_ids`: **[optional]**
134+
A vector representing active span ids at the time of program crash.
135+
The collector MAY cap the number of spans that it tracks.
136+
- `id`: **[required]**
137+
A string containing the span id.
138+
- `thread_name`: **[optional]**
139+
A string containing the thread name for the given span.
140+
- `timestamp`: **[required]**
141+
The time at which the crash occurred, in ISO 8601 format.
142+
- `trace_ids:`: **[optional]**
143+
A vector representing active span ids at the time of program crash.
144+
The collector MAY cap the number of spans that it tracks.
145+
- `id`: **[required]**
146+
A string containing the trace id.
147+
- `thread name`: **[optional]**
148+
A string containing the thread name for the given trace.
149+
- `uuid`: **[required]**
150+
A UUID v4 which uniquely identifies the crash.
151+
This will typically be generated at crash-time, and then associated with the uploaded crash.
152+
153+
### Stacktraces
154+
155+
Different languages and language runtimes have different representations of a stacktrace.
156+
The representation below attempts to collect as much information as possible.
157+
In addition, not all information may be available at crash-time on a given machine.
158+
For example, some libraries may have been shipped with debug symbols stripped, meaning that the only information available about a given frame may be the instruction pointer (`ip`) address, stored as a hex number "0xDEADBEEF".
159+
This address may be given as an absolute address, or a `NormalizedAddress`, which can be used by backend symbolication.
160+
161+
A stacktrace consists of
162+
163+
- `format`: **[required]**
164+
An identifier describing the format of the stack trace.
165+
Allows for extensibility to support different stack trace formats.
166+
The format described below is identified using the string "Datadog Crashtracker 1.0"
167+
- `frames`: **[required]**
168+
An array of `StackFrame`, described below.
169+
Note that each inlined function gets its own stack frame in this schema.
170+
- `incomplete`: **[optional]** *[Added in v1.1]*
171+
A boolean denoting whether the stacktrace may be missing frames, either due to intentional truncation, or an inability to fully collect a corrupted stack.
172+
173+
#### StackFrames
174+
175+
- **Absolute Addresses**
176+
The actual in-memory addresses used in the crashing process.
177+
Combined with mapping information, such as from `/proc/self/maps`, and the relevant binaries, this can be used to reconstruct relevant symbols.
178+
These fields follow the scheme used by the [backtrace crate](https://docs.rs/backtrace/latest/backtrace/struct.Frame.html)
179+
- `ip`: **[optional]**
180+
The current instruction pointer of this frame.
181+
This is normally the next instruction to execute in the frame, but not all implementations list this with 100% accuracy (but it’s generally pretty close).
182+
- `sp`: **[optional]**
183+
The current stack pointer of this frame.
184+
- `symbol_address`: **[optional]**
185+
The starting symbol address of the frame of this function.
186+
This will attempt to rewind the instruction pointer returned by ip to the start of the function, returning that value.
187+
In some cases, however, backends will just return ip from this function.
188+
- `module_base_address`: **[optional]**
189+
The base address of the module to which the frame belongs
190+
- **Relative Addresses**
191+
Addresses expressed as an offset into a given library or executable.
192+
Can be used by backend symbolication to generate debug names etc.
193+
Note that tracking this per stack frame can entail significant duplication of information.
194+
Adding a "modules" section and referencing it by index, as in the pprof specification, is future work.
195+
- `build_id`: **[optional]**
196+
A string identifying the build id of the module the address belongs to.
197+
For example, GNU build ids are hex strings "9944168df12b0b9b152113c4ad663bc27797fb15".
198+
Pdb build ids can be stored as a concatenation of the guid and the age (using a well-known separator).
199+
- `build_id_type`: **[required if `build_id` is set, optional otherwise]**
200+
The type of the `build_id`. E.g. "SHA1/GNU/GO/PDB/PE".
201+
- `file_type`: **[required if `relative_address` is set, optional otherwise]**
202+
The file type of the module containing the symbol, e.g. "ELF", "PDB", etc.
203+
- `relative_address`: **[optional]**
204+
The relative offset of the symbol in the base file (e.g. an ELF virtual address), given as a hexidecimal string.
205+
- `path`: **[required if `relative_address` is set, optional otherwise]**
206+
The path to the module containing the symbol.
207+
- **Debug information (e.g. "names")**
208+
Human readable debug information representing the location of the stack frame in the high-level code.
209+
Note that this is a best effort collection: for optimized code, it may be difficult to associate a given instruction back to file, line and column.
210+
- `column`: **[optional]**
211+
The column number in the given file where the symbol was defined.
212+
- `file`: **[optional]**
213+
The file name where this function was defined.
214+
Note that this may be either an absolute or relative path.
215+
- `line`: **[optional]**
216+
The line number in the given file where the symbol was defined.
217+
- `function`: **[optional]**
218+
The name of the function.
219+
This may or may not include module information.
220+
It may or may not be demangled (e.g. "\_ZNSt28**atomic_futex_unsigned_base26_M_futex_wait_until_steadyEPjjbNSt6chrono8durationIlSt5ratioILl1ELl1EEEENS2_IlS3_ILl1ELl1000000000EEEE" vs "std::**atomic_futex_unsigned_base::\_M_futex_wait_until_steady")
221+
- `comments`: **[optional]** *[Added in v1.3]*
222+
An array of strings containing comments about the given stackframe.
223+
For example, if a stackframe failed to symbolicate, the crashtracker implementation may record the reason for the failure.
224+
- `mangled_name`: **[optional]** *[Added in v1.4]*
225+
A string containing the original mangled name of the function, if the function name was demangled.
226+
This field is only present when the function name has been demangled and the original mangled name differs from the demangled name.
227+
228+
## Version History
229+
230+
This section documents the evolution of the crashtracker structured log format across versions 1.0 through 1.4. The current specification above reflects version 1.4, which includes all features from previous versions.
231+
232+
### Version 1.0 (RFC 0005)
233+
*Initial version*
234+
235+
- Established the base JSON schema for crash reporting
236+
- Defined core fields: `counters`, `data_schema_version`, `error`, `files`, `fingerprint`, `incomplete`, `log_messages`, `metadata`, `os_info`, `proc_info`, `sig_info`, `span_ids`, `timestamp`, `trace_ids`, `uuid`
237+
- Defined stacktrace format with `format` and `frames` fields
238+
- Defined comprehensive stackframe schema with absolute addresses, relative addresses, and debug information
239+
240+
### Version 1.1 (RFC 0006)
241+
*Added incomplete stacktraces*
242+
243+
**Changes from v1.0:**
244+
- Added `incomplete` field to `StackTrace` objects (optional boolean)
245+
- Updated `data_schema_version` to "1.1"
246+
247+
**Motivation:** Some stacktraces may be incomplete due to intentional truncation for performance reasons or unintentional failure (e.g., corrupted stack). The `incomplete` flag allows consumers to know that frames may be missing.
248+
249+
### Version 1.2 (RFC 0007)
250+
*Added experimental field*
251+
252+
**Changes from v1.1:**
253+
- Added `experimental` field at root level (optional JSON object)
254+
- Updated `data_schema_version` to "1.2"
255+
256+
**Motivation:** Developers may wish to collect experimental data without requiring frequent schema changes. The `experimental` field allows ad-hoc data collection that can later be promoted to structured fields once proven valuable.
257+
258+
### Version 1.3 (RFC 0008)
259+
*Added stackframe comments*
260+
261+
**Changes from v1.2:**
262+
- Added `comments` field to `StackFrame` objects (optional array of strings)
263+
- Updated `data_schema_version` to "1.3"
264+
265+
**Motivation:** Crashtracker implementations may have additional information about stackframes that doesn't fit the current schema (e.g., symbolication failure reasons). Comments provide a way to record this information for debugging purposes.
266+
267+
### Version 1.4 (RFC 0009)
268+
*Added stackframe mangled names*
269+
270+
**Changes from v1.3:**
271+
- Added `mangled_name` field to `StackFrame` objects (optional string)
272+
- Updated `data_schema_version` to "1.4"
273+
274+
**Motivation:** When symbol names are demangled for readability, the original mangled names are lost. This makes debugging difficult when mangled names are needed (e.g., comparing against compiler-generated symbols). The `mangled_name` field preserves the original mangled name when demangling occurs.
275+
276+
## Appendix A: Example output
277+
278+
An example crash report in version 1.0 format is [available here](artifacts/0005-crashtracker-example.json).
279+
280+
Note: This example uses version 1.0 format. Version 1.1+ may include additional fields such as `incomplete` in stacktraces, `experimental` at the root level, `comments` in stackframes, and `mangled_name` in stackframes.
281+
282+
## Appendix B: Json Schema
283+
284+
The current JSON schema (version 1.4) is [available here](artifacts/0009-crashtracker-schema.json).
285+
286+
Historical schemas are also available:
287+
- [Version 1.0 schema](artifacts/0005-crashtracker-schema.json)
288+
- [Version 1.1 schema](artifacts/0006-crashtracker-schema.json)
289+
- [Version 1.2 schema](artifacts/0007-crashtracker-schema.json)
290+
- [Version 1.3 schema](artifacts/0008-crashtracker-schema.json)
291+
- [Version 1.4 schema](artifacts/0009-crashtracker-schema.json)

0 commit comments

Comments
 (0)