Skip to content

Conversation

@karenzone
Copy link
Contributor

Related: #2760
Follow-up to: #3537

@github-actions
Copy link

github-actions bot commented Oct 22, 2025

🔍 Preview links for changed docs

| -------- | -------- |
| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
Copy link
Contributor Author

@karenzone karenzone Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80%
- Default: Enabled (perhaps the threshold should be higher if this is on by default)

What did we decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the threshold here is currently set to 50% - @MichelLosier is that correct?

| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alert on (not sure) > 10 times in a 5 minute window

What did we decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct. currently set to greater than 10 restarts in the5 min window


You can find these rules in **Stack Management** > **Alerts and Insights** > **Rules**.

### Available rules [available-alert-rules]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Available rules [available-alert-rules]
### Available alert rules [available-alert-rules]

@karenzone
Copy link
Contributor Author

karenzone commented Oct 22, 2025

@nimarezainia @MichelLosier I've added the alert rules and descriptions, and asked some questions about final decisions inline. Please take a look and let me know what you think. If there's a later "source of truth" for available alerts and descriptions, please let me know and I'll update the PR accordingly.

We still have some questions to answer (see inline comments," but I'm marking this as "Ready for review."

@karenzone karenzone marked this pull request as ready for review October 22, 2025 20:40
@karenzone karenzone requested a review from a team as a code owner October 22, 2025 20:40
@karenzone karenzone requested a review from nchaulet October 22, 2025 20:48
@nimarezainia
Copy link

@MichelLosier I don't see the "Agent Unhealthy" rule in staging? are we shipping the agent status changes as well?

Copy link
Contributor

@benironside benironside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the subject matter but left some minor editorial suggestions for your consideration!

| Alert | Description |
| -------- | -------- |
| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Dropped events | Checks if the percentage of events dropped to acked events from the pipeline is greater than or equal to 5%. Rows are distinguished by agent ID and component ID. |

IDK what "events dropped to acked events from the pipeline are" but if we're talking about "the percentage", we want an "is" not an "are" :)

| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
| [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host which require further investigation. Some restarts can have business impacts, and getting alerts for them can enable timely mitigation efforts.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |

| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. |

Should we specify the exact names of the agent ID and component ID fields? e.g. agent_ID (not sure that's accurate just an example)

| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |
| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. |

| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. |
| [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. |
| [Elastic Agent] Unhealthy status | Checks logs for an agent status change to `error` using the new `elastic_agent.status_change` datastreams. |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants