From fb34a66b20ba6c82c2e1dacd47d693083011cb07 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 15:54:53 +0000 Subject: [PATCH 01/30] WIP --- solutions/observability/apps/transaction-sampling.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 4f7861a2da..c68e1f8423 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -133,6 +133,12 @@ Tail-based sampling is implemented entirely in APM Server, and will work with tr Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observability/apps/limitations.md#apm-open-telemetry-tbs) when using [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor), we recommend using APM Server tail-based sampling instead. +### Tail-based sampling performance [_tail_based_sampling_performance] + +Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. + +It requires disk storage proportional to traffic received by APM, and additional memory to facilitate disk reads and writes. + ## Sampled data and visualizations [_sampled_data_and_visualizations] From fc7657267bd9c84f89a8c2c849ec210beda1f1ca Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 15:56:32 +0000 Subject: [PATCH 02/30] Add requirements --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index c68e1f8423..aec83e9e51 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -133,7 +133,7 @@ Tail-based sampling is implemented entirely in APM Server, and will work with tr Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observability/apps/limitations.md#apm-open-telemetry-tbs) when using [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor), we recommend using APM Server tail-based sampling instead. -### Tail-based sampling performance [_tail_based_sampling_performance] +### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements] Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. From dda1d3eeca6308f8de4ee58a8b9613eafb908297 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 16:04:18 +0000 Subject: [PATCH 03/30] Add fast disks --- solutions/observability/apps/transaction-sampling.md | 1 + 1 file changed, 1 insertion(+) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index aec83e9e51..73bb77e933 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,6 +139,7 @@ Tail-based sampling, by definition, requires storing events locally temporarily, It requires disk storage proportional to traffic received by APM, and additional memory to facilitate disk reads and writes. +It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM ingestion as a whole. ## Sampled data and visualizations [_sampled_data_and_visualizations] From 4b716c617069004bf49f311f4e168308eb566835 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 16:15:18 +0000 Subject: [PATCH 04/30] Add a note about insufficient storage --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 73bb77e933..b7e80060bb 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -137,7 +137,7 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. -It requires disk storage proportional to traffic received by APM, and additional memory to facilitate disk reads and writes. +It requires disk storage proportional to traffic received by APM, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit-{{input-type}}) causes sampling to be bypassed. It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM ingestion as a whole. From fedba8d88b2f660ecd0c401d0df29562375b5bd6 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 16:20:14 +0000 Subject: [PATCH 05/30] Disk rw requirements --- solutions/observability/apps/transaction-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index b7e80060bb..d671ad3032 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -137,9 +137,9 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. -It requires disk storage proportional to traffic received by APM, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit-{{input-type}}) causes sampling to be bypassed. +It requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit-{{input-type}}) causes sampling to be bypassed. -It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM ingestion as a whole. +It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. ## Sampled data and visualizations [_sampled_data_and_visualizations] From 048f807525929aac46ee257d334d54528fabdc3d Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 13 Mar 2025 19:29:31 +0000 Subject: [PATCH 06/30] Fix link --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index b94d449c26..bf565d4ec6 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -137,7 +137,7 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. -It requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit-{{input-type}}) causes sampling to be bypassed. +It requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. From 4bee1a15bc2245af40011fe863c73bd348c8f27f Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 14 Mar 2025 14:09:05 +0000 Subject: [PATCH 07/30] Mention disk --- solutions/observability/apps/transaction-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index bf565d4ec6..e5aaf5a5e3 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -135,9 +135,9 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ ### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements] -Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. +Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. -It requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. +In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. From a41a66484ab7bebfd23df72d0dd29528fff421c9 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 14 Mar 2025 17:40:31 +0000 Subject: [PATCH 08/30] Add table for numbers --- solutions/observability/apps/transaction-sampling.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index e5aaf5a5e3..163b294d52 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -141,6 +141,17 @@ In APM Server implementation, the events are stored temporarily on disk instead It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. +To demonstrate the performance overhead and requirements, here are some numbers from APM Server 9.0 under full load, trace only APM events, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like average event size, average number of events per distributed trace. + +| APM Server instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|:------------------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| +| AWS EC2 c6i.2xlarge or c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | +| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | +| .. | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | +| AWS EC2 c6i.4xlarge or c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | +| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | +| .. | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | + ## Sampled data and visualizations [_sampled_data_and_visualizations] A sampled trace retains all data associated with it. A non-sampled trace drops all [span](../../../solutions/observability/apps/spans.md) and [transaction](../../../solutions/observability/apps/transactions.md) data1. Regardless of the sampling decision, all traces retain [error](../../../solutions/observability/apps/errors.md) data. From 8ec6ad028d4f96f5cd7fa488fbc59b4b6ec90f06 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 14 Mar 2025 17:43:50 +0000 Subject: [PATCH 09/30] Language --- .../apps/transaction-sampling.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 163b294d52..3f26c87e8b 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,18 +139,18 @@ Tail-based sampling, by definition, requires storing events locally temporarily, In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. -It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO will be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. - -To demonstrate the performance overhead and requirements, here are some numbers from APM Server 9.0 under full load, trace only APM events, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like average event size, average number of events per distributed trace. - -| APM Server instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|:------------------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| -| AWS EC2 c6i.2xlarge or c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | -| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | -| .. | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | -| AWS EC2 c6i.4xlarge or c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | -| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | -| .. | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | +It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO may be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. + +To demonstrate the performance overhead and requirements, here are some numbers from standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, average number of events per distributed trace. + +| APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|:-----------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| +| c6i.2xlarge or c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | +| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | +| .. | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | +| c6i.4xlarge or c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | +| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | +| .. | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From af28057f043bc83d4fa10cbcd4160fb8d3487f1c Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 14 Mar 2025 17:55:25 +0000 Subject: [PATCH 10/30] Grammar --- solutions/observability/apps/transaction-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 3f26c87e8b..b19e8f54d3 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,9 +139,9 @@ Tail-based sampling, by definition, requires storing events locally temporarily, In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. -It is recommended to use fast disks, for example, NVMe SSDs, when tail-based sampling is enabled, as disk throughput and IO may be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. +It is recommended to use fast disks, for example, NVMe SSDs, when enabling tail-based sampling, as disk throughput and IO may be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. -To demonstrate the performance overhead and requirements, here are some numbers from standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, average number of events per distributed trace. +To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. | APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | |:-----------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| From fffe2070fc0d446325b165e7bac21bed0e9c63db Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 15:27:44 +0000 Subject: [PATCH 11/30] Update table --- .../observability/apps/transaction-sampling.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index b19e8f54d3..b0004837d5 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -143,14 +143,14 @@ It is recommended to use fast disks, for example, NVMe SSDs, when enabling tail- To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. -| APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|:-----------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| -| c6i.2xlarge or c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | -| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | -| .. | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | -| c6i.4xlarge or c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | -| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | -| .. | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | +| APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|:-----------------------------|:----------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| +| c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | +| c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | +| c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | +| c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | +| c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | +| c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From 80121333e66e9ce2446f8a962f34a864f1676ca1 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 15:33:00 +0000 Subject: [PATCH 12/30] Polish --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index b0004837d5..fedd37c9ce 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,7 +139,7 @@ Tail-based sampling, by definition, requires storing events locally temporarily, In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. -It is recommended to use fast disks, for example, NVMe SSDs, when enabling tail-based sampling, as disk throughput and IO may be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate. +It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. From 4b9f80c0445e97f654bb0a24baa8fb7262a2aa5c Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 18:56:05 +0100 Subject: [PATCH 13/30] Add 8.18 numbers --- .../apps/transaction-sampling.md | 26 ++++++++++++------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index fedd37c9ce..3ae54f2a54 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -141,16 +141,22 @@ In APM Server implementation, the events are stored temporarily on disk instead It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. -To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. - -| APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|:-----------------------------|:----------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| -| c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | -| c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | -| c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | -| c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | -| c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | -| c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 | +To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. + +| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|--------------------|:------------------|:----------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | +| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | +| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | +| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | +| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 1.73 | 23.6 | +| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 0.98 | 0 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 10960 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 7.19 | 30.6 | +| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 1.14 | 0 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 11990 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 28.76 | 109.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From 4ae01e292ed5ff04662375ad4cc3af72d1dd1f49 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 18:57:32 +0100 Subject: [PATCH 14/30] Shorten gp3 description --- .../apps/transaction-sampling.md | 28 +++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 3ae54f2a54..717eccf4bb 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -143,20 +143,20 @@ It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. -| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|--------------------|:------------------|:----------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | -| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 | -| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | -| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | -| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 | -| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 1.73 | 23.6 | -| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 10960 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 7.19 | 30.6 | -| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with the baseline IOPS of 3000 IOPS | 11990 | 26.57 | 33.6 | -| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 28.76 | 109.6 | +| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 1.41 | 13.1 | +| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | +| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | +| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 1.71 | 19.4 | +| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 1.73 | 23.6 | +| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 0.98 | 0 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 7.19 | 30.6 | +| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 1.14 | 0 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 28.76 | 109.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From 728fe1afc59d4ef17e11141827fe24531fa4cc7a Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:06:48 +0100 Subject: [PATCH 15/30] Add document indexing rate --- .../apps/transaction-sampling.md | 28 +++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 717eccf4bb..be88771e5d 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -143,20 +143,20 @@ It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. -| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 | -| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 1.41 | 13.1 | -| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 | -| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 | -| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 1.71 | 19.4 | -| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 1.73 | 23.6 | -| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 7.19 | 30.6 | -| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 26.57 | 33.6 | -| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 28.76 | 109.6 | +| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Document indexing rate (throughput from APM Server to Elasticsearch) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|:---------------------------------------------------------------------------------|--------------------------------------------|------------------| +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | +| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | +| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | +| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | +| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | +| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 824 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 48 | 7.19 | 30.6 | +| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 532 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From 7bc36c1e57eafe9c98be237ba1ad7f0f89310d4b Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:07:27 +0100 Subject: [PATCH 16/30] Rename --- .../apps/transaction-sampling.md | 28 +++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index be88771e5d..2f1b2f6925 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -143,20 +143,20 @@ It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. -| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Document indexing rate (throughput from APM Server to Elasticsearch) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|:---------------------------------------------------------------------------------|--------------------------------------------|------------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | -| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | -| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | -| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | -| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | -| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | -| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 824 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 48 | 7.19 | 30.6 | -| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 532 | 26.57 | 33.6 | -| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | +| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Event indexing rate (throughput from APM Server to Elasticsearch) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | +|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|:------------------------------------------------------------------------------|--------------------------------------------|------------------| +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | +| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | +| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | +| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | +| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | +| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 824 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 48 | 7.19 | 30.6 | +| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 532 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From c48269ac5b4eafdf2713be81832366cfd488d628 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:08:44 +0100 Subject: [PATCH 17/30] Fix numbers --- solutions/observability/apps/transaction-sampling.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 2f1b2f6925..4248482dad 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -152,10 +152,10 @@ To demonstrate the performance overhead and requirements, here are some numbers | 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | | 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | | 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 824 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 48 | 7.19 | 30.6 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 | | 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 532 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | | 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | ## Sampled data and visualizations [_sampled_data_and_visualizations] From 2a551168f8e6f0cff57f94a8772696bc6d771998 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:25:06 +0100 Subject: [PATCH 18/30] Explain difference --- solutions/observability/apps/transaction-sampling.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 4248482dad..e3b921efa2 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -158,6 +158,10 @@ To demonstrate the performance overhead and requirements, here are some numbers | 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | | 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | +9.0 tail-based sampling implementation is significantly better in performance in general than in 8.18, as the storage layer is rewritten. It cleans up expired data more reliably, which is also easier on disk, memory and compute, highlighted by the difference in event indexing rate on slow disks between versions. +In 8.18, when the database gets large, the slowdown can be disproportionate. +The one outlier data point where 8.18 32GB NVMe is faster in ingest rate than 9.0 can be explained by the slower event indexing rate, as the balance between disk read and write changed. + ## Sampled data and visualizations [_sampled_data_and_visualizations] A sampled trace retains all data associated with it. A non-sampled trace drops all [span](../../../solutions/observability/apps/spans.md) and [transaction](../../../solutions/observability/apps/transactions.md) data1. Regardless of the sampling decision, all traces retain [error](../../../solutions/observability/apps/errors.md) data. From 8df9fe8028bdd066855e2b4300ec08c067db40d5 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:35:40 +0100 Subject: [PATCH 19/30] polish --- solutions/observability/apps/transaction-sampling.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 32939fb868..727f74ed07 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -158,10 +158,9 @@ To demonstrate the performance overhead and requirements, here are some numbers | 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | | 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | -9.0 tail-based sampling implementation is significantly better in performance in general than in 8.18, as the storage layer is rewritten. It cleans up expired data more reliably, which is also easier on disk, memory and compute, highlighted by the difference in event indexing rate on slow disks between versions. -In 8.18, when the database gets large, the slowdown can be disproportionate. -The one outlier data point where 8.18 32GB NVMe is faster in ingest rate than 9.0 can be explained by the slower event indexing rate, as the balance between disk read and write changed. +The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. +In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. The one outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate than 9.0 can be attributed to the change in the balance between disk read and write operations, which results in a slower event indexing rate. ## Sampled data and visualizations [_sampled_data_and_visualizations] A sampled trace retains all data associated with it. A non-sampled trace drops all [span](../../../solutions/observability/apps/spans.md) and [transaction](../../../solutions/observability/apps/transactions.md) data1. Regardless of the sampling decision, all traces retain [error](../../../solutions/observability/apps/errors.md) data. From 2fda2cfbf04c952aea8a44d5209e4205823620e5 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:47:13 +0100 Subject: [PATCH 20/30] Clean up headers --- .../apps/transaction-sampling.md | 40 +++++++++++-------- 1 file changed, 23 insertions(+), 17 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 727f74ed07..6c8d70e8e9 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -135,28 +135,34 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ ### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements] -Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. +Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. -To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace. - -| APM Server version | EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Event indexing rate (throughput from APM Server to Elasticsearch) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB | -|--------------------|:------------------|:-----------------------------------------------|----------------------------------------------------------------------------|:------------------------------------------------------------------------------|--------------------------------------------|------------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | -| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | -| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | -| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | -| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | -| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | -| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 | -| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | -| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | +To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a 10% sample rate in the tail sampling policy. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace. + +Terminology: + +* Event Ingestion Rate: The throughput from the APM agent to the APM Server using the Intake v2 protocol (the protocol used by Elastic APM agents), measured in events per second. +* Event Indexing Rate: The throughput from the APM Server to Elasticsearch, measured in events per second or documents per second. +* Memory Usage: The maximum Resident Set Size (RSS) of APM Server process observed throughout the benchmark. + +| APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | +|--------------------|:------------------|:-----------------------------------------------|---------------------------------|:-------------------------------|-------------------|-----------------| +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | +| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | +| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | +| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | +| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | +| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | +| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 | +| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 | +| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | +| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | +| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. From 09490624e8b15026f920341a16aa18cef1570d8f Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 19:48:12 +0100 Subject: [PATCH 21/30] Fix align --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 6c8d70e8e9..37e333a75e 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -150,7 +150,7 @@ Terminology: * Memory Usage: The maximum Resident Set Size (RSS) of APM Server process observed throughout the benchmark. | APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | -|--------------------|:------------------|:-----------------------------------------------|---------------------------------|:-------------------------------|-------------------|-----------------| +|--------------------|-------------------|------------------------------------------------|---------------------------------|--------------------------------|-------------------|-----------------| | 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | | 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | | 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | From 86f0266d5c44ca98e1bf3380eed6d8f91b1419f8 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 17 Mar 2025 20:00:27 +0100 Subject: [PATCH 22/30] Grammar --- solutions/observability/apps/transaction-sampling.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 37e333a75e..7bc21c5cb4 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -137,7 +137,7 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. -In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. +In APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. @@ -167,6 +167,7 @@ Terminology: The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. The one outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate than 9.0 can be attributed to the change in the balance between disk read and write operations, which results in a slower event indexing rate. + ## Sampled data and visualizations [_sampled_data_and_visualizations] A sampled trace retains all data associated with it. A non-sampled trace drops all [span](../../../solutions/observability/apps/spans.md) and [transaction](../../../solutions/observability/apps/transactions.md) data1. Regardless of the sampling decision, all traces retain [error](../../../solutions/observability/apps/errors.md) data. From b513a8725de249bec2a2380cb7d4064346704b2c Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 10:20:25 +0100 Subject: [PATCH 23/30] Fix incorrect number --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 7bc21c5cb4..350bcb135f 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -151,7 +151,7 @@ Terminology: | APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | |--------------------|-------------------|------------------------------------------------|---------------------------------|--------------------------------|-------------------|-----------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 4720 (100% sampling) | 0.98 | 0 | +| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 47220 (100% sampling) | 0.98 | 0 | | 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | | 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | | 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | From bc5ca17be79c5382f6c65bb76cc37c5d89e8ee07 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 09:21:14 +0000 Subject: [PATCH 24/30] Update solutions/observability/apps/transaction-sampling.md Co-authored-by: Marc Lopez Rubio --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 350bcb135f..55b0457e20 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -141,7 +141,7 @@ In APM Server implementation, the events are stored temporarily on disk instead It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. -To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a 10% sample rate in the tail sampling policy. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace. +To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace. Terminology: From d7b6dfa72dcf768e48887c5736ddec6383dc579c Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 10:32:39 +0100 Subject: [PATCH 25/30] Add note on how to interpret numbers --- solutions/observability/apps/transaction-sampling.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 55b0457e20..e67c78562e 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -164,9 +164,12 @@ Terminology: | 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | | 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | -The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. +When interpreting these numbers, note that: -In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. The one outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate than 9.0 can be attributed to the change in the balance between disk read and write operations, which results in a slower event indexing rate. +* The metrics are inter-related. For example, it is reasonable to see a higher memory usage and disk usage when event ingestion rate is higher. +* Related to the previous point, event ingestion rate and event indexing rate competes for disk IO. It explains the outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0. + +The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation compresses data, as well as cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. ## Sampled data and visualizations [_sampled_data_and_visualizations] From 2aaff337d7d7f3fcc325ecbc518714d44be150d2 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 12:51:02 +0100 Subject: [PATCH 26/30] Add note about event indexing rate --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index e67c78562e..2b38c447b1 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -146,7 +146,7 @@ To demonstrate the performance overhead and requirements, here are some referenc Terminology: * Event Ingestion Rate: The throughput from the APM agent to the APM Server using the Intake v2 protocol (the protocol used by Elastic APM agents), measured in events per second. -* Event Indexing Rate: The throughput from the APM Server to Elasticsearch, measured in events per second or documents per second. +* Event Indexing Rate: The throughput from the APM Server to Elasticsearch, measured in events per second or documents per second. Note that it should roughly be equal to Event Ingestion Rate * Sampling Rate. * Memory Usage: The maximum Resident Set Size (RSS) of APM Server process observed throughout the benchmark. | APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | From ff0f896c345d01c75d02e354df6a478ead895e80 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 14:39:39 +0000 Subject: [PATCH 27/30] Apply suggestions from code review Co-authored-by: Colleen McGinnis --- .../observability/apps/transaction-sampling.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 2b38c447b1..9ba872d852 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -135,13 +135,17 @@ Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observ ### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements] -Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made. +Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made. -In APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed. +In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](#sampling-tail-storage_limit) is insufficient, sampling will be bypassed. It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. -To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load, receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. Please note that these figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace. +To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load that is receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. + +:::{important} +These figures are for reference only and may vary depending on factors such as sampling rate, average event size, and the average number of events per distributed trace. +::: Terminology: @@ -166,8 +170,8 @@ Terminology: When interpreting these numbers, note that: -* The metrics are inter-related. For example, it is reasonable to see a higher memory usage and disk usage when event ingestion rate is higher. -* Related to the previous point, event ingestion rate and event indexing rate competes for disk IO. It explains the outlier data point where 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0. +* The metrics are inter-related. For example, it is reasonable to see higher memory usage and disk usage when the event ingestion rate is higher. +* The event ingestion rate and event indexing rate competes for disk IO. This is why there is an outlier data point where APM Server version 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0. The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation compresses data, as well as cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. From 6345c968c333ff762100ecb2e44791a4c462aea5 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 15:45:08 +0100 Subject: [PATCH 28/30] Spell out SSD --- solutions/observability/apps/transaction-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 9ba872d852..9e61a4d5d1 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,7 +139,7 @@ Tail-based sampling (TBS), by definition, requires storing events locally tempor In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](#sampling-tail-storage_limit) is insufficient, sampling will be bypassed. -It is recommended to use fast disks, such as NVMe SSDs, when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. +It is recommended to use fast disks, such as Solid State Drives (SSD), when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load that is receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. @@ -171,7 +171,7 @@ Terminology: When interpreting these numbers, note that: * The metrics are inter-related. For example, it is reasonable to see higher memory usage and disk usage when the event ingestion rate is higher. -* The event ingestion rate and event indexing rate competes for disk IO. This is why there is an outlier data point where APM Server version 8.18 with a 32GB NVMe disk shows a higher ingest rate but a slower event indexing rate than in 9.0. +* The event ingestion rate and event indexing rate competes for disk IO. This is why there is an outlier data point where APM Server version 8.18 with a 32GB NVMe SSD shows a higher ingest rate but a slower event indexing rate than in 9.0. The tail-based sampling implementation in version 9.0 offers significantly better performance compared to version 8.18, primarily due to a rewritten storage layer. This new implementation compresses data, as well as cleans up expired data more reliably, resulting in reduced load on disk, memory, and compute resources. This improvement is particularly evident in the event indexing rate on slower disks. In version 8.18, as the database grows larger, the performance slowdown can become disproportionate. From 1fdedefdff8aabdc5d0dd653b327e802b8ee2881 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 15:49:30 +0100 Subject: [PATCH 29/30] SSD with high IOPS --- solutions/observability/apps/transaction-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 9e61a4d5d1..2977601b03 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -139,7 +139,7 @@ Tail-based sampling (TBS), by definition, requires storing events locally tempor In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](#sampling-tail-storage_limit) is insufficient, sampling will be bypassed. -It is recommended to use fast disks, such as Solid State Drives (SSD), when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. +It is recommended to use fast disks, ideally Solid State Drives (SSD) with high I/O per second (IOPS), when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate. To demonstrate the performance overhead and requirements, here are some reference numbers from a standalone APM Server deployed on AWS EC2 under full load that is receiving APM events containing only traces. These numbers assume no backpressure from Elasticsearch and a **10% sample rate in the tail sampling policy**. From 12ada46c76c532930f6dda1d77ad1fe4083860d0 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Tue, 18 Mar 2025 16:00:07 +0100 Subject: [PATCH 30/30] Split version to header --- .../apps/transaction-sampling.md | 35 +++++++++++-------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/solutions/observability/apps/transaction-sampling.md b/solutions/observability/apps/transaction-sampling.md index 2977601b03..1582f47bc9 100644 --- a/solutions/observability/apps/transaction-sampling.md +++ b/solutions/observability/apps/transaction-sampling.md @@ -153,20 +153,27 @@ Terminology: * Event Indexing Rate: The throughput from the APM Server to Elasticsearch, measured in events per second or documents per second. Note that it should roughly be equal to Event Ingestion Rate * Sampling Rate. * Memory Usage: The maximum Resident Set Size (RSS) of APM Server process observed throughout the benchmark. -| APM Server version | EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | -|--------------------|-------------------|------------------------------------------------|---------------------------------|--------------------------------|-------------------|-----------------| -| 9.0 | c6id.2xlarge | TBS disabled | 47220 | 47220 (100% sampling) | 0.98 | 0 | -| 9.0 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | -| 9.0 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | -| 9.0 | c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | -| 9.0 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | -| 9.0 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | -| 8.18 | c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | -| 8.18 | c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 | -| 8.18 | c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 | -| 8.18 | c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | -| 8.18 | c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | -| 8.18 | c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | +#### APM Server 9.0 + +| EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | +|-------------------|------------------------------------------------|---------------------------------|--------------------------------|-------------------|-----------------| +| c6id.2xlarge | TBS disabled | 47220 | 47220 (100% sampling) | 0.98 | 0 | +| c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 21310 | 2360 | 1.41 | 13.1 | +| c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 21210 | 2460 | 1.34 | 12.9 | +| c6id.4xlarge | TBS disabled | 142200 | 142200 (100% sampling) | 1.12 | 0 | +| c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 32410 | 3710 | 1.71 | 19.4 | +| c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 37040 | 4110 | 1.73 | 23.6 | + +#### APM Server 8.18 + +| EC2 instance size | TBS and disk configuration | Event ingestion rate (events/s) | Event indexing rate (events/s) | Memory usage (GB) | Disk usage (GB) | +|-------------------|------------------------------------------------|---------------------------------|--------------------------------|-------------------|-----------------| +| c6id.2xlarge | TBS disabled | 50260 | 50270 (100% sampling) | 0.98 | 0 | +| c6id.2xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 10960 | 50 | 5.24 | 24.3 | +| c6id.2xlarge | TBS enabled, local NVMe SSD from c6id instance | 11450 | 820 | 7.19 | 30.6 | +| c6id.4xlarge | TBS disabled | 149200 | 149200 (100% sampling) | 1.14 | 0 | +| c6id.4xlarge | TBS enabled, EBS gp3 volume with 3000 IOPS | 11990 | 530 | 26.57 | 33.6 | +| c6id.4xlarge | TBS enabled, local NVMe SSD from c6id instance | 43550 | 2940 | 28.76 | 109.6 | When interpreting these numbers, note that: