From 94160189ed6d51433050ea1ff1e47c11801b41a1 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Wed, 30 Jul 2025 16:06:48 +0100 Subject: [PATCH 01/11] WIP TBS FAQ --- solutions/observability/apm/tail-based-sampling.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index af3b1c2217..6cecfd5671 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -189,3 +189,17 @@ This metric can also be used to get an estimate of the storage requirements for ### `apm-server.sampling.tail.storage.value_log_size` [sampling-tail-monitoring-storage-value-log-size-ref] This metric tracks the storage size for value log files used by the previous implementation of a tail-based sampler. This metric was deprecated in 9.0.0 and should always report `0`. + +## Frequently Asked Questions (FAQ) [sampling-tail-faq-ref] + +:::{dropdown} Q: Why does the sampling rate shown in storage explorer not match the configured tail sampling rate? + +WIP + +:::: + +::::{dropdown} Q: Why does a transaction disappear after enabling tail-based sampling? + +WIP + +:::: From a0b53ee7485717ac3ef8f0b5c72562151c532fdf Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Wed, 30 Jul 2025 16:44:44 +0100 Subject: [PATCH 02/11] Storage explorer --- solutions/observability/apm/tail-based-sampling.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 6cecfd5671..99f3f27acf 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -194,7 +194,11 @@ This metric tracks the storage size for value log files used by the previous imp :::{dropdown} Q: Why does the sampling rate shown in storage explorer not match the configured tail sampling rate? -WIP +In APM Server, the configured tail sampling policy applied to a distributed trace is determined by the root transaction, i.e. the transaction without a parent. + +However, the APM UI storage explorer calculates the effective average sampling rate for every service in a completely different way, which has to consider head-based sampling and has no concept of root transactions. Therefore, the sampling rate shown in storage explorer can be different from the configured tail sampling rate, and create a false impression that tail-based sampling is not functioning properly. + +See [Kibana issue](https://github.com/elastic/kibana/issues/226600). :::: From cada310a035c4545347037e4ab90b2f98855a2a2 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Wed, 30 Jul 2025 17:07:59 +0100 Subject: [PATCH 03/11] Add transaction disappear --- solutions/observability/apm/tail-based-sampling.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 99f3f27acf..843920d13d 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -204,6 +204,12 @@ See [Kibana issue](https://github.com/elastic/kibana/issues/226600). ::::{dropdown} Q: Why does a transaction disappear after enabling tail-based sampling? -WIP +If you have configured a non-zero sampling rate for a transaction, but it is always not sampled after enabling tail-based sampling, please double check your instrumentation setup for missing root transactions, i.e. the transaction without a parent. + +APM Server makes a sampling decision based on the configured policies when a distributed trace ends, which is when the root transaction ends. If the root transaction of a trace is not received by APM Server, APM Server will not be able to make a sampling decision, and will silently drop all the trace events associated with this trace. + +TODO: describe common causes + +TODO: add ESQL to find traces with missing parent :::: From bc7c1097cc94a098c7bc6b0d9520a78feef46813 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Wed, 30 Jul 2025 17:50:06 +0100 Subject: [PATCH 04/11] Add common cause --- solutions/observability/apm/tail-based-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 843920d13d..5394ba6112 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -208,7 +208,7 @@ If you have configured a non-zero sampling rate for a transaction, but it is alw APM Server makes a sampling decision based on the configured policies when a distributed trace ends, which is when the root transaction ends. If the root transaction of a trace is not received by APM Server, APM Server will not be able to make a sampling decision, and will silently drop all the trace events associated with this trace. -TODO: describe common causes +A common cause for this issue is, for example, assuming that service A always produces the root transaction while in reality there can be a service B before service A. However, service B is not instrumented or it is instrumented to send to a separate APM Server cluster. To resolve this issue, either fix service B's instrumentation to send to the same APM Server cluster as service A, or adjust service A's trace continuation strategy. TODO: add ESQL to find traces with missing parent From 391307f1af3d2e0f1236004e5e53239899f93af3 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Wed, 30 Jul 2025 17:58:04 +0100 Subject: [PATCH 05/11] Add ESQL --- solutions/observability/apm/tail-based-sampling.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 5394ba6112..59f13ed962 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -210,6 +210,12 @@ APM Server makes a sampling decision based on the configured policies when a dis A common cause for this issue is, for example, assuming that service A always produces the root transaction while in reality there can be a service B before service A. However, service B is not instrumented or it is instrumented to send to a separate APM Server cluster. To resolve this issue, either fix service B's instrumentation to send to the same APM Server cluster as service A, or adjust service A's trace continuation strategy. -TODO: add ESQL to find traces with missing parent +To identify traces missing a root transaction, use the following ESQL in a time range when tail-based sampling is disabled. Query with a short time range to avoid too many results in response. +``` +FROM "traces-apm-*" +| STATS total_docs = COUNT(*), total_child_docs = COUNT(parent.id) BY trace.id, transaction.id +| WHERE total_docs == total_child_docs +| KEEP trace.id, transaction.id +``` :::: From a7fa5ee54f8856690e9171a72d85b511f9d67e56 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Thu, 31 Jul 2025 17:57:18 +0100 Subject: [PATCH 06/11] Storage limit --- solutions/observability/apm/tail-based-sampling.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 59f13ed962..eb241f893f 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -219,3 +219,13 @@ FROM "traces-apm-*" ``` :::: + +:::{dropdown} Q: What happens if the storage limit is reached? + +When the storage limit for tail-based sampling is reached, APM Server can no longer store new trace events for sampling. By default, when this occurs, traces bypass sampling and are always indexed (sampling rate becomes 100%). This sudden increase in indexing can overload Elasticsearch, as it must now handle all incoming traces instead of just the sampled subset. + +To prevent Elasticsearch from being overloaded in this scenario, you can enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server will discard traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load, but enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority. + +For more details, see the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section above. + +::: From a48c34c653d7d1f1a65f27bda65ef64b4e7b5964 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 1 Aug 2025 13:34:34 +0100 Subject: [PATCH 07/11] Polish --- .../observability/apm/tail-based-sampling.md | 34 ++++++++----------- 1 file changed, 14 insertions(+), 20 deletions(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index eb241f893f..24fae577d3 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -192,40 +192,34 @@ This metric tracks the storage size for value log files used by the previous imp ## Frequently Asked Questions (FAQ) [sampling-tail-faq-ref] -:::{dropdown} Q: Why does the sampling rate shown in storage explorer not match the configured tail sampling rate? +:::{dropdown} Why does the sampling rate shown in Storage Explorer not match the configured tail sampling rate? -In APM Server, the configured tail sampling policy applied to a distributed trace is determined by the root transaction, i.e. the transaction without a parent. +In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent) and using the first policy that matches. In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly. -However, the APM UI storage explorer calculates the effective average sampling rate for every service in a completely different way, which has to consider head-based sampling and has no concept of root transactions. Therefore, the sampling rate shown in storage explorer can be different from the configured tail sampling rate, and create a false impression that tail-based sampling is not functioning properly. - -See [Kibana issue](https://github.com/elastic/kibana/issues/226600). - -:::: +For more information, see the related [Kibana issue](https://github.com/elastic/kibana/issues/226600). +::: -::::{dropdown} Q: Why does a transaction disappear after enabling tail-based sampling? +:::{dropdown} Why do transactions disappear after enabling tail-based sampling? -If you have configured a non-zero sampling rate for a transaction, but it is always not sampled after enabling tail-based sampling, please double check your instrumentation setup for missing root transactions, i.e. the transaction without a parent. +If a transaction is consistently not sampled after enabling tail-based sampling, verify that your instrumentation is not missing root transactions (transactions without a parent). APM Server makes sampling decisions when a distributed trace ends, which occurs when the root transaction ends. If the root transaction is not received by APM Server, it cannot make a sampling decision and will silently drop all associated trace events. -APM Server makes a sampling decision based on the configured policies when a distributed trace ends, which is when the root transaction ends. If the root transaction of a trace is not received by APM Server, APM Server will not be able to make a sampling decision, and will silently drop all the trace events associated with this trace. +This issue often arises when it is assumed that a particular service (e.g., service A) always produces the root transaction, but in reality, another service (e.g., service B) may precede it. If service B is not instrumented or sends data to a different APM Server cluster, the root transaction will be missing. To resolve this, ensure that all relevant services are instrumented and send data to the same APM Server cluster, or adjust the trace continuation strategy accordingly. -A common cause for this issue is, for example, assuming that service A always produces the root transaction while in reality there can be a service B before service A. However, service B is not instrumented or it is instrumented to send to a separate APM Server cluster. To resolve this issue, either fix service B's instrumentation to send to the same APM Server cluster as service A, or adjust service A's trace continuation strategy. +To identify traces missing a root transaction, run the following ESQL query during a period when tail-based sampling is disabled. Use a short time range to limit the number of results: -To identify traces missing a root transaction, use the following ESQL in a time range when tail-based sampling is disabled. Query with a short time range to avoid too many results in response. ``` FROM "traces-apm-*" -| STATS total_docs = COUNT(*), total_child_docs = COUNT(parent.id) BY trace.id, transaction.id +| STATS total_docs = COUNT(*), total_child_docs = COUNT(parent.id) BY trace.id, transaction.id | WHERE total_docs == total_child_docs | KEEP trace.id, transaction.id ``` +::: -:::: - -:::{dropdown} Q: What happens if the storage limit is reached? - -When the storage limit for tail-based sampling is reached, APM Server can no longer store new trace events for sampling. By default, when this occurs, traces bypass sampling and are always indexed (sampling rate becomes 100%). This sudden increase in indexing can overload Elasticsearch, as it must now handle all incoming traces instead of just the sampled subset. +:::{dropdown} What happens if the storage limit is reached? -To prevent Elasticsearch from being overloaded in this scenario, you can enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server will discard traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load, but enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority. +When the storage limit for tail-based sampling is reached, APM Server cannot store new trace events for sampling. By default, traces bypass sampling and are always indexed (sampling rate becomes 100%). This can cause a sudden increase in indexing load, potentially overloading Elasticsearch, as it must process all incoming traces instead of only the sampled subset. -For more details, see the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section above. +To mitigate this risk, enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server discards traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load. Note that enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority. +For more information, see the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section. ::: From ed45c5fa22fb11087c8aa17cb4d170856d4d0359 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 1 Aug 2025 13:43:02 +0100 Subject: [PATCH 08/11] Trace always sampled --- solutions/observability/apm/tail-based-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 24fae577d3..ab27a7d7a8 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -215,9 +215,9 @@ FROM "traces-apm-*" ``` ::: -:::{dropdown} What happens if the storage limit is reached? +:::{dropdown} Why is configured tail sampling rate ignored and trace always sampled, causing unexpected load to Elasticsearch? -When the storage limit for tail-based sampling is reached, APM Server cannot store new trace events for sampling. By default, traces bypass sampling and are always indexed (sampling rate becomes 100%). This can cause a sudden increase in indexing load, potentially overloading Elasticsearch, as it must process all incoming traces instead of only the sampled subset. +When the storage limit for tail-based sampling is reached, APM Server will log "configured limit reached" (or "configured storage limit reached" in version 8) as it cannot store new trace events for sampling. By default, traces bypass sampling and are always indexed (sampling rate becomes 100%). This can cause a sudden increase in indexing load, potentially overloading Elasticsearch, as it must process all incoming traces instead of only the sampled subset. To mitigate this risk, enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server discards traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load. Note that enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority. From 5f8bc1a7a2c20a0f5d8e1e62281739a6944796ee Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Fri, 1 Aug 2025 14:15:36 +0100 Subject: [PATCH 09/11] Apply suggestions from code review Co-authored-by: florent-leborgne --- solutions/observability/apm/tail-based-sampling.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index ab27a7d7a8..54107efe25 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -192,11 +192,11 @@ This metric tracks the storage size for value log files used by the previous imp ## Frequently Asked Questions (FAQ) [sampling-tail-faq-ref] -:::{dropdown} Why does the sampling rate shown in Storage Explorer not match the configured tail sampling rate? +:::{dropdown} Why doesn't the sampling rate shown in Storage Explorer match the configured tail sampling rate? In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent) and using the first policy that matches. In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly. -For more information, see the related [Kibana issue](https://github.com/elastic/kibana/issues/226600). +For more information, check the related [Kibana issue](https://github.com/elastic/kibana/issues/226600). ::: :::{dropdown} Why do transactions disappear after enabling tail-based sampling? @@ -205,7 +205,7 @@ If a transaction is consistently not sampled after enabling tail-based sampling, This issue often arises when it is assumed that a particular service (e.g., service A) always produces the root transaction, but in reality, another service (e.g., service B) may precede it. If service B is not instrumented or sends data to a different APM Server cluster, the root transaction will be missing. To resolve this, ensure that all relevant services are instrumented and send data to the same APM Server cluster, or adjust the trace continuation strategy accordingly. -To identify traces missing a root transaction, run the following ESQL query during a period when tail-based sampling is disabled. Use a short time range to limit the number of results: +To identify traces missing a root transaction, run the following {{esql}} query during a period when tail-based sampling is disabled. Use a short time range to limit the number of results: ``` FROM "traces-apm-*" @@ -215,11 +215,11 @@ FROM "traces-apm-*" ``` ::: -:::{dropdown} Why is configured tail sampling rate ignored and trace always sampled, causing unexpected load to Elasticsearch? +:::{dropdown} Why is the configured tail sampling rate ignored and why are traces always sampled, causing unexpected load to Elasticsearch? When the storage limit for tail-based sampling is reached, APM Server will log "configured limit reached" (or "configured storage limit reached" in version 8) as it cannot store new trace events for sampling. By default, traces bypass sampling and are always indexed (sampling rate becomes 100%). This can cause a sudden increase in indexing load, potentially overloading Elasticsearch, as it must process all incoming traces instead of only the sampled subset. To mitigate this risk, enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server discards traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load. Note that enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority. -For more information, see the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section. +For more information, refer to the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section. ::: From a1ec18990c7bfe33a58a8bf8489a86e39c29bdb5 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 4 Aug 2025 12:07:02 +0100 Subject: [PATCH 10/11] Link to examples --- solutions/observability/apm/tail-based-sampling.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index 54107efe25..bdcd1c38a3 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -194,7 +194,9 @@ This metric tracks the storage size for value log files used by the previous imp :::{dropdown} Why doesn't the sampling rate shown in Storage Explorer match the configured tail sampling rate? -In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent) and using the first policy that matches. In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly. +In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent). To learn more about how tail sampling policies are applied, see the examples in [Configure Tail-based sampling](/solutions/observability/apm/transaction-sampling#apm-configure-tail-based-sampling). + +In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly. For more information, check the related [Kibana issue](https://github.com/elastic/kibana/issues/226600). ::: From 8dc793b23fa866a3cd326f26a1e53ed69ea68a67 Mon Sep 17 00:00:00 2001 From: Carson Ip Date: Mon, 4 Aug 2025 13:44:42 +0100 Subject: [PATCH 11/11] Missing .md --- solutions/observability/apm/tail-based-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md index bdcd1c38a3..37ada36ece 100644 --- a/solutions/observability/apm/tail-based-sampling.md +++ b/solutions/observability/apm/tail-based-sampling.md @@ -194,7 +194,7 @@ This metric tracks the storage size for value log files used by the previous imp :::{dropdown} Why doesn't the sampling rate shown in Storage Explorer match the configured tail sampling rate? -In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent). To learn more about how tail sampling policies are applied, see the examples in [Configure Tail-based sampling](/solutions/observability/apm/transaction-sampling#apm-configure-tail-based-sampling). +In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent). To learn more about how tail sampling policies are applied, see the examples in [Configure Tail-based sampling](/solutions/observability/apm/transaction-sampling.md#apm-configure-tail-based-sampling). In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly.