Skip to content

Commit ff0f896

Browse files
Apply suggestions from code review
Co-authored-by: Colleen McGinnis <[email protected]>
1 parent 2aaff33 commit ff0f896

File tree

1 file changed

+9
-5
lines changed

1 file changed

+9
-5
lines changed

solutions/observability/apps/transaction-sampling.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -135,13 +135,17 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ
135135

136136
### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements]
137137

138-
Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made.
138+
Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.
139139

140-
In APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed.
140+
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](#sampling-tail-storage_limit) is insufficient, sampling will be bypassed.
141141

142142
It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate.
143143

144-
To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace.
144+
To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load that is receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**.
145+
146+
:::{important}
147+
These figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace.
148+
:::
145149

146150
Terminology:
147151

@@ -166,8 +170,8 @@ Terminology:
166170

167171
When interpreting these numbers, note that:
168172

169-
* The metrics are inter-related. For example, it is reasonable to see a higher memory usage and disk usage when event ingestion rate is higher.
170-
* Related to the previous point, event ingestion rate and event indexing rate competes for disk IO. It explains the outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0.
173+
* The metrics are inter-related. For example, it is reasonable to see higher memory usage and disk usage when the event ingestion rate is higher.
174+
* The event ingestion rate and event indexing rate competes for disk IO. This is why there is an outlier data point where APM Server version 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0.
171175

172176
The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation compresses data, as well as cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. In version 8.18, as the database grows larger, the performance slowdown can become disproportionate.
173177

0 commit comments

Comments
 (0)