Merge branch 'main' into 303-query-rules-gui

ketkee-aryamane · web-flow · commit 0d579b6bcf9d · 2025-07-28T19:04:29.000+02:00
diff --git a/reference/fleet/agent-policy.md b/reference/fleet/agent-policy.md
@@ -2,6 +2,8 @@
 navigation_title: Policies
 mapped_pages:
   - https://www.elastic.co/guide/en/fleet/current/agent-policy.html
+applies_to:
+  stack: ga
 products:
   - id: fleet
   - id: elastic-agent
@@ -55,6 +57,7 @@ Hosted policies display a lock icon in the {{fleet}} UI, and actions are restric
 | [Edit or delete a policy](#policy-main-settings) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
 | [Add custom fields](#add-custom-fields) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
 | [Configure agent monitoring](#change-policy-enable-agent-monitoring) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
+| [Configure an automatic {{agent}} upgrade](#agent-policy-automatic-agent-upgrade) {applies_to}`stack: ga 9.1.0` | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
 | [Change the output of a policy](#change-policy-output) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
 | [Add a {{fleet-server}} to a policy](#add-fleet-server-to-policy) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
 | [Configure secret values in a policy](#agent-policy-secret-values) | ![yes](images/green-check.svg "") | ![no](images/red-x.svg "") |
@@ -260,6 +263,19 @@ You can set a rate limit for the action handler for diagnostics requests coming
 This setting configures retries for the file upload client handling diagnostics requests coming from {{fleet}}. The setting affects only {{fleet}}-managed {{agents}}. By default, a maximum of `10` retries are allowed with an initial duration of `1s` and a backoff duration of `1m`. The client may retry failed requests with exponential backoff.
 
 
+## Configure an automatic {{agent}} upgrade [#agent-policy-automatic-agent-upgrade]
+
+```{applies_to}
+stack: ga 9.1.0
+```
+
+For a high-scale deployment of {{fleet}}, you can configure an automatic, gradual rollout of a new minor or patch version to a percentage of the {{agents}} in your policy. For more information, refer to [Auto-upgrade agents enrolled in a policy](/reference/fleet/upgrade-elastic-agent.md#auto-upgrade-agents).
+
+::::{note}
+This feature is only available for certain subscription levels. For more information, refer to [{{stack}} subscriptions](https://www.elastic.co/subscriptions).
+::::
+
+
 ## Change the output of a policy [change-policy-output]
 
 Assuming your [{{stack}} subscription level](https://www.elastic.co/subscriptions) supports per-policy outputs, you can change the output of a policy to send data to a different output.
diff --git a/reference/fleet/upgrade-elastic-agent.md b/reference/fleet/upgrade-elastic-agent.md
@@ -2,6 +2,8 @@
 navigation_title: Upgrade {{agent}}s
 mapped_pages:
   - https://www.elastic.co/guide/en/fleet/current/upgrade-elastic-agent.html
+applies_to:
+  stack: ga
 products:
   - id: fleet
   - id: elastic-agent
@@ -44,7 +46,7 @@ These restrictions apply whether you are upgrading {{agents}} individually or in
 
 ## Upgrading {{agent}} [upgrade-agent]
 
-To upgrade your {{agent}}s, go to **Management > {{fleet}} > Agents** in {{kib}}. You can perform the following upgrade-related actions:
+To upgrade your {{agents}}, go to **Management** → **{{fleet}}** → **Agents** in {{kib}}. You can perform the following upgrade-related actions:
 
 | User action | Result |
 | --- | --- |
@@ -55,6 +57,8 @@ To upgrade your {{agent}}s, go to **Management > {{fleet}} > Agents** in {{kib}}
 | [Restart an upgrade for a single agent](#restart-upgrade-single) | Restart an upgrade process that has stalled for a single agent. |
 | [Restart an upgrade for multiple agents](#restart-upgrade-multiple) | Do a bulk restart of the upgrade process for a set of agents. |
 
+With the right [subscription level](https://www.elastic.co/subscriptions), you can also configure an automatic, gradual upgrade of a percentage of the {{agents}} enrolled in an {{agent}} policy. For more information, refer to [Auto-upgrade agents enrolled in a policy](#auto-upgrade-agents). {applies_to}`stack: ga 9.1.0`
+
 
 ## Upgrade a single {{agent}} [upgrade-an-agent]
 
@@ -84,7 +88,6 @@ To upgrade your {{agent}}s, go to **Management > {{fleet}} > Agents** in {{kib}}
     :::
 
 
-
 ## Do a rolling upgrade of multiple {{agent}}s [rolling-agent-upgrade]
 
 You can do rolling upgrades to avoid exhausting network resources when updating a large number of {{agent}}s.
@@ -182,7 +185,6 @@ If an upgrade fails, you can view the agent logs to find the reason:
     :::
 
 
-
 ## Restart an upgrade for a single agent [restart-upgrade-single]
 
 An {{agent}} upgrade process may sometimes stall. This can happen for various reasons, including, for example, network connectivity issues or a delayed shutdown.
@@ -217,6 +219,68 @@ When the upgrade process for multiple agents has been detected to have stalled,
 5. Restart the upgrades.
 
 
+## Auto-upgrade agents enrolled in a policy [auto-upgrade-agents]
+
+```{applies_to}
+stack: ga 9.1.0
+```
+
+::::{note}
+This feature is only available for certain subscription levels. For more information, refer to [{{stack}} subscriptions](https://www.elastic.co/subscriptions).
+::::
+
+To configure an automatic rollout of a new minor or patch version to a percentage of the agents enrolled in your {{agent}} policy. follow these steps:
+
+1. In {{kib}}, go to **Management** → **{{fleet}}** → **Agent policies**.
+2. Select the agent policy for which you want to configure an automatic agent upgrade.
+3. On the agent policy's details page, find **Auto-upgrade agents**, and select **Manage** next to it.
+4. In the **Manage auto-upgrade agents** window, click **Add target version**.
+5. From the **Target agent version** dropdown, select the minor or patch version to which you want to upgrade a percentage of your agents.
+6. In the **% of agents to upgrade** field, enter the percentage of active agents you want to upgrade to this target version.
+   
+   Note that:
+   - Unenrolling, unenrolled, inactive, and uninstalled agents are not included in the count. For example, if you set the target upgrade percentage to 50% for a policy with 10 active agents and 10 inactive agents, the target is met when 5 active agents are upgraded.
+   - Rounding is applied, and the actual percentage of the upgraded agents may vary slightly. For example, if you set the target upgrade percentage to 30% for a policy with 25 active agents, the target is met when 8 active agents are upgraded (32%).
+
+7. You can then add a different target version, and specify the percentage of agents you want to be upgraded to that version. The total percentage of agents to be upgraded cannot exceed 100%.
+8. Click **Save**.
+
+Once the configuration is saved, an asynchronous task runs every 30 minutes, gradually upgrading the agents in the policy to the specified target version.
+
+In case of any failed upgrades, the upgrades are retried with exponential backoff mechanism until the upgrade is successful, or the maximum number of retries is reached. Note that the maximum number of retries is the number of [configured retry delays](#auto-upgrade-settings).
+
+::::{note}
+Only active agents enrolled in the policy are considered for the automatic upgrade.
+
+If new agents are assigned to the policy, the number of {{agents}} to be upgraded is adjusted according to the set percentages.
+::::
+
+### Configure the auto-upgrade settings [auto-upgrade-settings]
+
+On self-managed and cloud deployments of {{stack}}, you can configure the default task interval and the retry delays of the automatic upgrade in the [{{kib}} {{fleet}} settings](kibana://reference/configuration-reference/fleet-settings.md). For example:
+
+```yml
+xpack.fleet.autoUpgrades.taskInterval: 15m <1>
+xpack.fleet.autoUpgrades.retryDelays: ['5m', '10m', '20m'] <2>
+```
+1. The time interval at which the auto-upgrade task should run. Defaults to `30m`.
+2. Array indicating how much time should pass before a failed auto-upgrade is retried. The array's length indicates the maximum number of retries. Defaults to `['30m', '1h', '2h', '4h', '8h', '16h', '24h']`.
+
+For more information, refer to the [Kibana configuration reference](kibana://reference/configuration-reference.md).
+
+### View the status of the automatic upgrade [auto-upgrade-view-status]
+
+You can view the status of the automatic upgrade in the following ways:
+
+- On the agent policy's details page, find **Auto-upgrade agents**, and select **Manage** to open the **Manage auto-upgrade agents** window.
+  
+  The status of the upgrade is displayed next to the specified target version and percentage, and includes the percentage of agents that have already been upgraded.
+
+  To view any failed upgrades, hover over the **Upgrade failed** status, then click **Go to upgrade**.
+
+- On the **{{fleet}}** → **Agents** page, click **Agent activity** to open a flyout showing logs of the {{agent}} activity and the progress of the automatic agent upgrade.
+
+
 ## Upgrade RPM and DEB system packages [upgrade-system-packages]
 
 If you have installed and enrolled {{agent}} using either a DEB (for a Debian-based Linux distribution) or RPM (for a RedHat-based Linux distribution) install package, the upgrade cannot be managed by {{fleet}}. Instead, you can perform the upgrade using these steps.
diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md
@@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
 | APM Server binary | `apm-server.sampling.tail.policies` |
 | Fleet-managed | `Policies` |
 
+### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]
+
+Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.
+
+Default: `false`. (bool)
+
+|                              |                                          |
+|------------------------------|------------------------------------------|
+| APM Server binary            | `apm-server.sampling.tail.discard_on_write_failure` |
+| Fleet-managed {applies_to}`stack: ga 9.1` | `Discard On Write Failure`               |
+
+
 ### Storage limit [sampling-tail-storage_limit-ref]
 
 The amount of storage space allocated for trace events matching tail sampling policies. Caution: Setting this limit higher than the allowed space may cause APM Server to become unhealthy.
@@ -93,7 +105,7 @@ A value of `0GB` (or equivalent) does not set a concrete limit, but rather allow
 
 If this is not desired, a concrete `GB` value can be set for the maximum amount of disk used for tail-based sampling.
 
-If the configured storage limit is insufficient, it logs "configured limit reached". The event will bypass sampling and will always be indexed when storage limit is reached.
+If the configured storage limit is insufficient, it logs "configured limit reached". When the storage limit is reached, the event will be indexed or discarded based on the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) configuration.
 
 Default: `0GB`. (text)
 
diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md
@@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/
 
 Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.
 
-In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
+In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events are indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.
 
 It is recommended to use fast disks, ideally Solid State Drives (SSD) with high I/O per second (IOPS), when enabling tail-based sampling. Disk throughput and I/O may become performance bottlenecks for tail-based sampling and APM event ingestion overall. Disk writes are proportional to the event ingest rate, while disk reads are proportional to both the event ingest rate and the sampling rate.