Skip to content

Conversation

@karenzone
Copy link
Contributor

@karenzone karenzone commented Oct 22, 2025

Related: #2760
Follow-up to: #3537
Fixes: #3815

@karenzone karenzone self-assigned this Oct 22, 2025
@github-actions
Copy link

github-actions bot commented Oct 22, 2025

🔍 Preview links for changed docs

@karenzone karenzone force-pushed the 2760-alert-assets-v2 branch from 71c44c1 to e6daf96 Compare October 22, 2025 20:19
@karenzone karenzone force-pushed the 2760-alert-assets-v2 branch from e6daf96 to 326a28c Compare October 22, 2025 20:24
| -------- | -------- |
| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
Copy link
Contributor Author

@karenzone karenzone Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80%
- Default: Enabled (perhaps the threshold should be higher if this is on by default)

What did we decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the threshold here is currently set to 50% - @MichelLosier is that correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, 50% is the new threshold

| [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled |
| [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. |
| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) |
| [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alert on (not sure) > 10 times in a 5 minute window

What did we decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct. currently set to greater than 10 restarts in the5 min window

@karenzone
Copy link
Contributor Author

karenzone commented Oct 22, 2025

@nimarezainia @MichelLosier I've added the alert rules and descriptions, and asked some questions about final decisions inline. Please take a look and let me know what you think. If there's a later "source of truth" for available alerts and descriptions, please let me know and I'll update the PR accordingly.

We still have some questions to answer (see inline comments," but I'm marking this as "Ready for review."

@karenzone karenzone marked this pull request as ready for review October 22, 2025 20:40
@karenzone karenzone requested a review from a team as a code owner October 22, 2025 20:40
@karenzone karenzone requested a review from nchaulet October 22, 2025 20:48
@nimarezainia
Copy link

@MichelLosier I don't see the "Agent Unhealthy" rule in staging? are we shipping the agent status changes as well?

Copy link
Contributor

@benironside benironside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the subject matter but left some minor editorial suggestions for your consideration!

@karenzone
Copy link
Contributor Author

@nimarezainia @MichelLosier, let's wrap this one up and get it merged for our users.
A lot of information and implementation ideas were shared in slack, some of it forward-looking.

Please review this content with an eye for what's available to users now. We can update docs as new rules become available. Just open a docs issue!

@nimarezainia
Copy link

@MichelLosier could you please review these changes, we can always change the thresholds as we learn more. (I left a few comments for you above). thx

| [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host which require further investigation. Some restarts can have a business impact and getting alerts for them can enable timely mitigation.<br>- Condition: Alert on restarts > 10 restarts in a 5 minute window<br>- Default: Enabled |
| [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. |
| [Elastic Agent] Output errors | Checks if errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add here the [Elastic Agent] Unhealthy status alerting rule. We can describe it as "Checks if an agent has transitioned to an 'unhealthy' status, which can indicate errors or degraded functionality of the agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3608 (comment)

I checked staging and didn't see it, so I removed it.
Thanks for clarifying. I'll add it back.

Copy link

@MichelLosier MichelLosier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just the adjustment for the percentage threshold of the high memory rule, and adding a description of the Unhealthy status rule.

@karenzone
Copy link
Contributor Author

@MichelLosier, please give it a look.

Copy link

@MichelLosier MichelLosier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! :shipit:

@karenzone
Copy link
Contributor Author

karenzone commented Nov 11, 2025

I've re-requested a review from ingest-docs as code-owner. Stay tuned!

Copy link
Contributor

@colleenmcginnis colleenmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts and questions below.

| [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending up. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` more than 50%<br>- Default: Enabled |
| [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host. Some restarts can have a business impact, and getting alerts for them can enable timely mitigation.<br>- Condition: Alert on 11 or more restarts in a 5-minute window<br>- Default: Enabled |
| [Elastic Agent] High pipeline queue | Checks percentage of pipeline queue. Rows are distinguished by agent ID and component ID. <br> - Condition: Alert on max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeding 90% <br>- Default: Enabled|
| [Elastic Agent] Output errors | Checks errors per minute from an agent component. Rows are distinguished by agent ID and component ID. <br> - Condition: Alert on 6 or more errors per minute <br>- Default: Enabled|
Copy link
Contributor Author

@karenzone karenzone Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally was "greater than 5." I changed it to "6 or more."
Please keep me honest here, @MichelLosier

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still correct. The rule is defined with > 5, but if "6 or more" is a clearer read thats totally fine.

@karenzone karenzone merged commit a15fff0 into elastic:main Nov 12, 2025
7 checks passed
@karenzone karenzone deleted the 2760-alert-assets-v2 branch November 12, 2025 20:01
@karenzone
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Fleet] Doc: Expand Elastic Agent OOB rules content with rules and descriptions

5 participants