Add list of agent OOB alert rules with descriptions #3608

karenzone · 2025-10-22T19:04:52Z

Related: #2760
Follow-up to: #3537

github-actions · 2025-10-22T19:07:45Z

🔍 Preview links for changed docs

karenzone · 2025-10-22T20:32:10Z

reference/fleet/alert-templates.md

+| -------- | -------- |
+| [Elastic Agent] CPU usage spike|  Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
+| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Excessive memory usage|  Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |


80%
- Default: Enabled (perhaps the threshold should be higher if this is on by default)

What did we decide?

I believe the threshold here is currently set to 50% - @MichelLosier is that correct?

karenzone · 2025-10-22T20:33:34Z

reference/fleet/alert-templates.md

+| [Elastic Agent] CPU usage spike|  Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
+| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Excessive memory usage|  Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
+| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |


Alert on (not sure) > 10 times in a 5 minute window

What did we decide?

this is correct. currently set to greater than 10 restarts in the5 min window

karenzone · 2025-10-22T20:36:47Z

reference/fleet/alert-templates.md


 You can find these rules in **Stack Management** > **Alerts and Insights** > **Rules**.

+### Available rules [available-alert-rules]


Suggested change

### Available rules [available-alert-rules]

### Available alert rules [available-alert-rules]

karenzone · 2025-10-22T20:39:52Z

@nimarezainia @MichelLosier I've added the alert rules and descriptions, and asked some questions about final decisions inline. Please take a look and let me know what you think. If there's a later "source of truth" for available alerts and descriptions, please let me know and I'll update the PR accordingly.

We still have some questions to answer (see inline comments," but I'm marking this as "Ready for review."

nimarezainia · 2025-10-22T23:42:27Z

@MichelLosier I don't see the "Agent Unhealthy" rule in staging? are we shipping the agent status changes as well?

benironside

I'm not very familiar with the subject matter but left some minor editorial suggestions for your consideration!

benironside · 2025-10-24T23:55:48Z

reference/fleet/alert-templates.md

+| Alert | Description |
+| -------- | -------- |
+| [Elastic Agent] CPU usage spike|  Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
+| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |


Suggested change

| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |

| [Elastic Agent] Dropped events | Checks if the percentage of events dropped to acked events from the pipeline is greater than or equal to 5%. Rows are distinguished by agent ID and component ID. |

IDK what "events dropped to acked events from the pipeline are" but if we're talking about "the percentage", we want an "is" not an "are" :)

benironside · 2025-10-24T23:59:15Z

reference/fleet/alert-templates.md

+| [Elastic Agent] CPU usage spike|  Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
+| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Excessive memory usage|  Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
+| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |


Suggested change

| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |

| [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host which require further investigation. Some restarts can have business impacts, and getting alerts for them can enable timely mitigation efforts.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |

benironside · 2025-10-25T00:12:48Z

reference/fleet/alert-templates.md

+| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Excessive memory usage|  Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
+| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
+| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |


Suggested change

| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |

| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. |

Should we specify the exact names of the agent ID and component ID fields? e.g. agent_ID (not sure that's accurate just an example)

benironside · 2025-10-25T00:13:59Z

reference/fleet/alert-templates.md

+| [Elastic Agent] Excessive memory usage|  Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
+| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
+| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |


Suggested change

| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |

| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. |

benironside · 2025-10-25T00:14:44Z

reference/fleet/alert-templates.md

+| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
+| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |
+| [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. |


Suggested change

| [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. |

| [Elastic Agent] Unhealthy status | Checks logs for an agent status change to `error` using the new `elastic_agent.status_change` datastreams. |

karenzone self-assigned this Oct 22, 2025

github-actions bot deployed to docs-preview October 22, 2025 19:05 View deployment

karenzone force-pushed the 2760-alert-assets-v2 branch from 71c44c1 to e6daf96 Compare October 22, 2025 20:19

github-actions bot deployed to docs-preview October 22, 2025 20:20 View deployment

Add list of OOB alert rules with descriptions

326a28c

karenzone force-pushed the 2760-alert-assets-v2 branch from e6daf96 to 326a28c Compare October 22, 2025 20:24

github-actions bot deployed to docs-preview October 22, 2025 20:25 View deployment

karenzone commented Oct 22, 2025

View reviewed changes

karenzone requested review from MichelLosier and nimarezainia October 22, 2025 20:37

karenzone marked this pull request as ready for review October 22, 2025 20:40

karenzone requested a review from a team as a code owner October 22, 2025 20:40

karenzone requested a review from nchaulet October 22, 2025 20:48

benironside reviewed Oct 25, 2025

View reviewed changes


		You can find these rules in Stack Management > Alerts and Insights > Rules.

		### Available rules [available-alert-rules]

	### Available rules [available-alert-rules]
	### Available alert rules [available-alert-rules]

	\| [Elastic Agent] Dropped events \| Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. \|
	\| [Elastic Agent] Dropped events \| Checks if the percentage of events dropped to acked events from the pipeline is greater than or equal to 5%. Rows are distinguished by agent ID and component ID. \|

	\| [Elastic Agent] Excessive restarts\| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled \|
	\| [Elastic Agent] Excessive restarts\| Checks for excessive restarts on a host which require further investigation. Some restarts can have business impacts, and getting alerts for them can enable timely mitigation efforts.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled \|

	\| [Elastic Agent] High pipeline queue \| Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. \|
	\| [Elastic Agent] High pipeline queue \| Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. \|

	\| [Elastic Agent] Output errors \| Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. \|
	\| [Elastic Agent] Output errors \| Checks if the errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. \|

	\| [Elastic Agent] Unhealthy status \| Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. \|
	\| [Elastic Agent] Unhealthy status \| Checks logs for an agent status change to `error` using the new `elastic_agent.status_change` datastreams. \|

Add list of agent OOB alert rules with descriptions #3608

Are you sure you want to change the base?

Add list of agent OOB alert rules with descriptions #3608

Conversation

karenzone commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

karenzone Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karenzone commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nimarezainia commented Oct 22, 2025

Uh oh!

benironside left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 22, 2025 •

edited

Loading

karenzone Oct 22, 2025 •

edited

Loading

karenzone commented Oct 22, 2025 •

edited

Loading

benironside left a comment •

edited

Loading