Skip to content

Commit 2fda2cf

Browse files
committed
Clean up headers
1 parent 8df9fe8 commit 2fda2cf

File tree

1 file changed

+23
-17
lines changed

1 file changed

+23
-17
lines changed

solutions/observability/apps/transaction-sampling.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -135,28 +135,34 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ
135135

136136
### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements]
137137

138-
Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made.
138+
Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made.
139139

140140
In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed.
141141

142142
It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate.
143143

144-
To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace.
145-
146-
| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Event indexing rate (throughput from APM Server to Elasticsearch) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB |
147-
|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|:------------------------------------------------------------------------------|--------------------------------------------|------------------|
148-
| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 |
149-
| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 |
150-
| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 |
151-
| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 |
152-
| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 |
153-
| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 |
154-
| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 |
155-
| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 |
156-
| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 |
157-
| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 |
158-
| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 |
159-
| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 |
144+
To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a 10% sample rate in the tail sampling policy. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace.
145+
146+
Terminology:
147+
148+
* Event Ingestion Rate: The throughput from the APM agent to the APM Server using the Intake v2 protocol (the protocol used by Elastic APM agents), measured in events per second.
149+
* Event Indexing Rate: The throughput from the APM Server to Elasticsearch, measured in events per second or documents per second.
150+
* Memory Usage: The maximum Resident Set Size (RSS) of APM Server process observed throughout the benchmark.
151+
152+
| APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) |
153+
|--------------------|:------------------|:-----------------------------------------------|---------------------------------|:-------------------------------|-------------------|-----------------|
154+
| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 |
155+
| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 |
156+
| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 |
157+
| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 |
158+
| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 |
159+
| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 |
160+
| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 |
161+
| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 |
162+
| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 |
163+
| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 |
164+
| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 |
165+
| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 |
160166

161167
The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks.
162168

0 commit comments

Comments
 (0)