Skip to content
5 changes: 5 additions & 0 deletions reference/apm/cloud/apm-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ If a setting is not supported by {{ech}}, you will get an error message when you
Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
::::

### Version 9.1+ [ec_version_9_1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config also applies to 8.19+ but I left it out based on @carsonip comment in another PR. Let me know if I should add 8.19+.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@florent-leborgne @colleenmcginnis My initial plan was to backport this PR to 8.X branch for the 8.19 release (and change the versions from 9.1 to 8.19). But I just realized 8.19 is being released before 9.1.

Should I create a separate PR for 8.X? Or do you have any other suggestions? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @isaacaflores2. Thanks for this PR. You would need a different PR anyways for 8.19 docs because:

  • the content is likely in a different repository (https://github.com/elastic/observability-docs)
  • 8.x docs, including 8.19, are still powered by the asciidoc-based system, while 9.0 docs and above like this PR are markdown-based.

I am happy to help if you need

Copy link
Contributor Author

@isaacaflores2 isaacaflores2 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks for sharing. I will start a PR for 8.19 docs in the other repo. I'll reach out on slack for any help

This {{stack}} version adds support for the following settings:

`apm-server.sampling.tail.discard_on_write_failure`
: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

### Version 8.0+ [ec_version_8_0_3]

Expand Down
6 changes: 6 additions & 0 deletions solutions/observability/apm/configure-apm-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,12 @@ If a setting is not supported on {{ecloud}}, you will get an error message when
Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
::::

### Version 9.1+ [ec_version_9_1]
This {{stack}} version adds support for the following settings:

`apm-server.sampling.tail.discard_on_write_failure`
: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

### Version 8.0+ [ec_version_8_0_3]

This stack version removes support for some previously supported settings. These are all of the supported settings for this version:
Expand Down
14 changes: 13 additions & 1 deletion solutions/observability/apm/tail-based-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
| APM Server binary | `sampling.tail.policies` |
| Fleet-managed | `Policies` |

### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]

Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Default: `false`. (bool)

| | |
|------------------------------|------------------------------------------|
| APM Server binary | `sampling.tail.discard_on_write_failure` |
| Fleet-managed (version 9.1+) | `Discard On Write Failure` |


### Storage limit [sampling-tail-storage_limit-ref]

The amount of storage space allocated for trace events matching tail sampling policies. Caution: Setting this limit higher than the allowed space may cause APM Server to become unhealthy.
Expand All @@ -93,7 +105,7 @@ A value of `0GB` (or equivalent) does not set a concrete limit, but rather allow

If this is not desired, a concrete `GB` value can be set for the maximum amount of disk used for tail-based sampling.

If the configured storage limit is insufficient, it logs "configured limit reached". The event will bypass sampling and will always be indexed when storage limit is reached.
If the configured storage limit is insufficient, it logs "configured limit reached". When the storage limit is reached, the event will be indexed or discarded based on the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) configuration.

Default: `0GB`. (text)

Expand Down
2 changes: 1 addition & 1 deletion solutions/observability/apm/transaction-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/

Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.

In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events are indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.

It is recommended to use fast disks, ideally Solid State Drives (SSD) with high I/O per second (IOPS), when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate.

Expand Down
Loading