-
Notifications
You must be signed in to change notification settings - Fork 185
Add list of agent OOB alert rules with descriptions #3608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔍 Preview links for changed docs |
71c44c1 to
e6daf96
Compare
e6daf96 to
326a28c
Compare
reference/fleet/alert-templates.md
Outdated
| | -------- | -------- | | ||
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80%
- Default: Enabled (perhaps the threshold should be higher if this is on by default)
What did we decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the threshold here is currently set to 50% - @MichelLosier is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, 50% is the new threshold
reference/fleet/alert-templates.md
Outdated
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | | ||
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alert on (not sure) > 10 times in a 5 minute window
What did we decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is correct. currently set to greater than 10 restarts in the5 min window
|
@nimarezainia @MichelLosier I've added the alert rules and descriptions, and asked some questions about final decisions inline. Please take a look and let me know what you think. If there's a later "source of truth" for available alerts and descriptions, please let me know and I'll update the PR accordingly. We still have some questions to answer (see inline comments," but I'm marking this as "Ready for review." |
|
@MichelLosier I don't see the "Agent Unhealthy" rule in staging? are we shipping the agent status changes as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the subject matter but left some minor editorial suggestions for your consideration!
Co-authored-by: Benjamin Ironside Goldstein <[email protected]>
|
@nimarezainia @MichelLosier, let's wrap this one up and get it merged for our users. Please review this content with an eye for what's available to users now. We can update docs as new rules become available. Just open a docs issue! |
|
@MichelLosier could you please review these changes, we can always change the thresholds as we learn more. (I left a few comments for you above). thx |
| | [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host which require further investigation. Some restarts can have a business impact and getting alerts for them can enable timely mitigation.<br>- Condition: Alert on restarts > 10 restarts in a 5 minute window<br>- Default: Enabled | | ||
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. | | ||
| | [Elastic Agent] Output errors | Checks if errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. | | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add here the [Elastic Agent] Unhealthy status alerting rule. We can describe it as "Checks if an agent has transitioned to an 'unhealthy' status, which can indicate errors or degraded functionality of the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked staging and didn't see it, so I removed it.
Thanks for clarifying. I'll add it back.
MichelLosier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just the adjustment for the percentage threshold of the high memory rule, and adding a description of the Unhealthy status rule.
|
@MichelLosier, please give it a look. |
MichelLosier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! ![]()
|
I've re-requested a review from |
colleenmcginnis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts and questions below.
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending up. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` more than 50%<br>- Default: Enabled | | ||
| | [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host. Some restarts can have a business impact, and getting alerts for them can enable timely mitigation.<br>- Condition: Alert on 11 or more restarts in a 5-minute window<br>- Default: Enabled | | ||
| | [Elastic Agent] High pipeline queue | Checks percentage of pipeline queue. Rows are distinguished by agent ID and component ID. <br> - Condition: Alert on max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeding 90% <br>- Default: Enabled| | ||
| | [Elastic Agent] Output errors | Checks errors per minute from an agent component. Rows are distinguished by agent ID and component ID. <br> - Condition: Alert on 6 or more errors per minute <br>- Default: Enabled| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally was "greater than 5." I changed it to "6 or more."
Please keep me honest here, @MichelLosier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still correct. The rule is defined with > 5, but if "6 or more" is a clearer read thats totally fine.
|
Published content: https://www.elastic.co/docs/reference/fleet/alert-templates#ea-alert-rules |
Related: #2760
Follow-up to: #3537
Fixes: #3815