Skip to content

Commit a505815

Browse files
philipphofmannromtsnconstantiniussentrivanashellmayr
authored
feat(develop-docs): TelemetryBuffer Process Terminations (#15274)
This PR migrates the logs for crashes [RFC](https://github.com/getsentry/rfcs/blob/main/text/0148-logs-for-crashes.md) to the BatchProcessor. Co-authored-by: Roman Zavarnitsyn <[email protected]> Co-authored-by: Fabian Schindler <[email protected]> Co-authored-by: Ivana Kellyer <[email protected]> Co-authored-by: Simon Hellmayr <[email protected]> Co-authored-by: Abhijeet Prasad <[email protected]> Co-authored-by: Serhii Snitsaruk <[email protected]> Co-authored-by: Dominik Dorfmeister <[email protected]> Co-authored-by: Sarah Mischinger <[email protected]> Co-authored-by: Shannon Anahata <[email protected]> Co-authored-by: Shannon Anahata <[email protected]> Co-authored-by: Abdellah Hariti <[email protected]> Co-authored-by: Alex Krawiec <[email protected]>
1 parent 91e3ca4 commit a505815

File tree

2 files changed

+107
-2
lines changed

2 files changed

+107
-2
lines changed

develop-docs/sdk/telemetry/telemetry-buffer/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ The BatchProcessor MUST forward all spans and logs in memory to the transport to
5757
2. When the user calls `SentrySDK.close()`, the BatchProcessor MUST forward all data in memory to the transport. SDKs SHOULD keep their existing closing behavior.
5858
3. When the application shuts down gracefully, the BatchProcessor SHOULD forward all data in memory to the transport. The transport SHOULD keep its existing behavior, which usually stores the data to disk as an envelope. It is not required to call transport `flush`. This is mostly relevant for mobile SDKs already subscribed to these hooks, such as [applicationWillTerminate](https://developer.apple.com/documentation/uikit/uiapplicationdelegate/applicationwillterminate(_:)) on iOS.
5959
4. When the application moves to the background, the BatchProcessor SHOULD forward all data in memory to the transport and stop the timer. The transport SHOULD keep its existing behavior, which usually stores the data to disk as an envelope. It is not required to call transport `flush`. This is mostly relevant for mobile SDKs.
60-
5. We're working on concept for crashes, and will update the specification when we have more details.
60+
5. Mobile SDKs MUST minimize data loss when sudden process terminations occur. Refer to the [Mobile Telemetry Buffer](/sdk/telemetry/telemetry-buffer/mobile-telemetry-buffer) section for more details.
6161

6262
The detailed specification is written in the [Gherkin syntax](https://cucumber.io/docs/gherkin/reference/). The specification uses spans as an example, but the same applies to logs or any other future telemetry data.
6363

develop-docs/sdk/telemetry/telemetry-buffer/mobile-telemetry-buffer.mdx

Lines changed: 106 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,109 @@ description: Detailed mobile telemetry buffer design.
44
sidebar_order: 4
55
---
66

7-
To be defined — full spec lives here.
7+
<Alert level="warning">
8+
🚧 This concept is approved but not yet implemented in any SDKs, and it’s still being validated. If something feels unclear, too complex, or doesn’t work as expected, please open an issue or PR and tag @philipphofmann for review. Feedback and improvements are welcome while we confirm the approach makes sense. 🚧
9+
</Alert>
10+
11+
For the common specification for the telemetry buffer, refer to the [Telemetry Buffer](/sdk/telemetry/telemetry-buffer/) page. This page describes the mobile-specific implementation of the telemetry buffer. The most important difference is that the mobile telemetry buffer is designed to minimize data loss when sudden process terminations occur, such as crashes or watchdog terminations.
12+
13+
Each SDK environment is unique. Therefore, SDKs have three options to choose from to minimize data loss. As their number increases, the options get more complex. The first option is the simplest, and the last option is the most complicated. SDKs SHOULD implement the least complex option that is suitable for their environment.
14+
15+
## 1. Flush All Data
16+
17+
When the SDK detects a sudden process termination, it MUST put all remaining items in the telemetry buffer into one envelope and flush it. If your SDK has an offline cache, it MAY flush the envelope to disk and skip sending it to Sentry, if it ensures to send the envelope the next time the SDK starts. The telemetry buffer MUST keep its existing logic described in the [Telemetry Buffer Specification](/sdk/telemetry/telemetry-buffer/#specification) page.
18+
19+
Suppose your SDK can't reliably detect sudden process terminations, or it can't reliably flush envelopes to Sentry or disk when a sudden process termination happens. In that case, it SHOULD implement the [FileStream Cache](#2-filestream-cache) or the [DoubleRotatingBuffer](#3-doublerotatingbuffer). It's acceptable to start with this option as a best effort interim solution before adding one of the more complex options.
20+
21+
## 2. FileStream Cache
22+
23+
SDKs for which blocking the main thread is a nogo, such as Android and Apple, SDKs MUST NOT implement this option. They SHOULD implement the [DoubleRotatingBuffer](#3-doublerotatingbuffer).
24+
25+
With this option, the telemetry buffer stores the data on the calling thread directly to disk. The SDK SHOULD store the telemetry buffer files in a folder that is a sibling of the `envelopes` or `replay` folder, named `telemetry-buffer`. This folder is scoped per DSN, so SDKs ensure not mixing up data for different DSNs. In the `telemetry-buffer` folder, the SDK MUST store two types of cache files:
26+
27+
- **`cache`** - The file the processor is actively writing to
28+
- **`flushing`** - The file being converted to an envelope and sent to Sentry
29+
30+
When the timeout expires or the cache file hits the size limit, the telemetry buffer renames the `cache` file to `flushing`, creates a new `cache` file for incoming data, converts the data in the `flushing` file to an envelope, sends it to Sentry, and then deletes the `flushing` file. When the SDK starts again, it MUST check if there are any cache files in the cache directory (both `cache` and `flushing`) and if so, it MUST load the data from the files and send it to Sentry.
31+
32+
33+
## 3. DoubleRotatingBuffer
34+
35+
SDKs should only consider implementing this option when options [1](#1-flush-all-data) or [2](#2-filestream-cache) are insufficient to prevent data loss within their ecosystem. We recommend this option only if SDKs are unable to reliably detect sudden process terminations or consistently store envelopes to disk during such terminations.
36+
37+
The telemetry buffer uses two buffers to minimize data loss in the event of an abnormal process termination:
38+
* **Crash-Safe List**: A list stored in a crash-safe space to prevent data loss during detectable abnormal process terminations.
39+
* **Async IO Cache**: When a process terminates without the SDK being able to detect it, the crash-safe list loses all its elements. Therefore, the telemetry buffer uses a second buffer, the async IO cache, that stores elements to disk on a background thread to avoid blocking the calling thread, which ensures minimal data loss when such terminations occur.
40+
41+
As the telemetry buffer MUST prevent data loss during flushing, it uses a double-buffering solution. The crash-safe list has two lists `crash-safe-list-1` and `crash-safe-list-2`, and the async IO cache has two files `async-io-cache-1` and `async-io-cache-2`. When `crash-safe-list-1` is full, the telemetry buffer stores any new incoming items in `crash-safe-list-2` until it successfully stores items from `crash-safe-list-1` to disk as an envelope. Then it can delete items in `crash-safe-list-1`. The same applies to the async IO cache.
42+
43+
### Telemetry Buffer Files
44+
45+
The SDK SHOULD store the telemetry buffer files in a folder that is a sibling of the `envelopes` or `replay` folder, named `telemetry-buffer`. This folder is scoped per DSN, so SDKs ensure not mixing up data for different DSNs. The `telemetry-buffer` folder MAY contain the following files:
46+
47+
- `async-io-cache-1` and `async-io-cache-2` - The async IO cache files.
48+
- `detected-termination-x` - The file containing items from the crash-safe list from a previous detected abnormal termination.
49+
- `envelope-x` - The envelope that the telemetry buffer is about to move to the envelopes cache folder, so the SDK can send it to Sentry, where `x` is the an increasing index of the file starting from 0.
50+
51+
52+
### Receiving Items
53+
54+
The telemetry buffer has two lists `crash-safe-list-1` and `crash-safe-list-2` and two files `async-io-cache-1` and `async-io-cache-2`. When it receives items, it performs the following steps:
55+
56+
1. Put the item into the crash-safe `crash-safe-list-1` on the calling thread.
57+
2. On a background thread, store the item in the `async-io-cache-1`.
58+
59+
### Flushing
60+
61+
When the `crash-safe-list1` exceeds the [above described](#specification) 1MiB in size or the timeout exceeds, the telemetry buffer performs the following flushing steps:
62+
63+
1. Store new incoming items to the `crash-safe-list-2` and `async-io-cache-2`.
64+
2. Put the items of `crash-safe-list-1` into an envelope named `envelope-x`.
65+
3. Delete the items in `crash-safe-list-1` and `async-io-cache-1`.
66+
4. Move the `envelope-x` to the envelopes cache folder, in which all the other envelopes are stored, so the SDK can send it to Sentry.
67+
68+
The telemetry buffer stores the `envelope-x` not directly in the envelope cache folder because, if an abnormal process termination occurs before deleting the items `crash-safe-list-1` and `async-io-cache-1`, the SDKs might send duplicate items.
69+
70+
71+
### Abnormal Process Termination
72+
73+
When SDKs detect an abnormal process termination, they MUST write the items in both `crash-safe-list-1` and `crash-safe-list-2` to the `detected-termination-x` file where `x` is the an increasing index of the file starting from 0.
74+
75+
When the process terminates abnormally and the SDKs can't detect it, the SDKs lose items from the crash-safe lists, which we consider preferable to blocking the calling thread, which could be the main thread. However, the SDKs don't lose items from the async IO cache.
76+
77+
So the SDK MAY lose items due to undetectable abnormal process terminations that occur immediately after receiving an item, but it won't lose items due to detectable abnormal process terminations.
78+
79+
```
80+
// SDKs keep this log
81+
Sentry.logger.info("We might crash now.");
82+
83+
crash();
84+
```
85+
86+
```
87+
// SDKs might lose this log
88+
Sentry.logger.info("We are going to allocate loads of memory now.");
89+
90+
// Something allocating a lot of memory leading to a watchdog termination due to out of memory.
91+
allocateAllMemory();
92+
```
93+
94+
### SDK Initialization
95+
96+
Whenever the SDKs initialize, they MUST check if there is any data in the telemetry buffer folder that needs to be recovered. They MUST perform the following steps when initializing:
97+
98+
1. Load all items from `async-io-cache-1`, `async-io-cache-2` and `detected-termination-x` into memory if they exist. When the application terminates normally, these files don't exist.
99+
2. Deduplicate the items based on the IDs of the items and store the deduplicated items in the `envelope-x` file.
100+
3. Create new `async-io-cache-1` and `async-io-cache-2` files, and delete the `detected-termination-x` file.
101+
4. Now the telemetry buffer can start receiving new items.
102+
5. Move the `envelope-x` to the envelopes cache folder.
103+
104+
### SDK Closes
105+
106+
Whenever the users closes the SDK or the application terminates normally, the telemetry buffer MUST perform the steps described in the [Flushing](#flushing) section and the SDK MUST delete all items in the `async-io-cache-1` and `async-io-cache-2` files.
107+
108+
## Miscellaneous
109+
110+
The telemetry buffer maintains its logic of batching multiple logs and spans together into a single envelope to avoid multiple HTTP requests.
111+
112+
Hybrid SDKs pass every log and span down to the native SDKs, which will put every log and span in their telemetry buffer and its cache when logs and spans are ready for sending, meaning after they go through beforeLog, integrations, processors, etc.

0 commit comments

Comments
 (0)