-
Notifications
You must be signed in to change notification settings - Fork 165
Add list of agent OOB alert rules with descriptions #3608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔍 Preview links for changed docs |
71c44c1 to
e6daf96
Compare
e6daf96 to
326a28c
Compare
| | -------- | -------- | | ||
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80%
- Default: Enabled (perhaps the threshold should be higher if this is on by default)
What did we decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the threshold here is currently set to 50% - @MichelLosier is that correct?
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | | ||
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alert on (not sure) > 10 times in a 5 minute window
What did we decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is correct. currently set to greater than 10 restarts in the5 min window
|
|
||
| You can find these rules in **Stack Management** > **Alerts and Insights** > **Rules**. | ||
|
|
||
| ### Available rules [available-alert-rules] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Available rules [available-alert-rules] | |
| ### Available alert rules [available-alert-rules] |
|
@nimarezainia @MichelLosier I've added the alert rules and descriptions, and asked some questions about final decisions inline. Please take a look and let me know what you think. If there's a later "source of truth" for available alerts and descriptions, please let me know and I'll update the PR accordingly. We still have some questions to answer (see inline comments," but I'm marking this as "Ready for review." |
|
@MichelLosier I don't see the "Agent Unhealthy" rule in staging? are we shipping the agent status changes as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the subject matter but left some minor editorial suggestions for your consideration!
| | Alert | Description | | ||
| | -------- | -------- | | ||
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | |
| | [Elastic Agent] Dropped events | Checks if the percentage of events dropped to acked events from the pipeline is greater than or equal to 5%. Rows are distinguished by agent ID and component ID. | |
IDK what "events dropped to acked events from the pipeline are" but if we're talking about "the percentage", we want an "is" not an "are" :)
| | [Elastic Agent] CPU usage spike| Checks if {{agent}} or any of its processes were pegged at a high CPU for a specified window of time. This could signal a bug in an application and warrant further investigation.<br> - Condition: `system.process.cpu.total.time.ms` > 80% for 5 minutes<br>- Default: Enabled | | ||
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | | ||
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | | |
| | [Elastic Agent] Excessive restarts| Checks for excessive restarts on a host which require further investigation. Some restarts can have business impacts, and getting alerts for them can enable timely mitigation efforts.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | |
| | [Elastic Agent] Dropped events | Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | | ||
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | | ||
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. | | |
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinguished by agent ID and component ID. | |
Should we specify the exact names of the agent ID and component ID fields? e.g. agent_ID (not sure that's accurate just an example)
| | [Elastic Agent] Excessive memory usage| Checks if {{agent}} or any of its processes have a high memory usage or memory usage that is trending higher. This could signal a memory leak in an application and warrant further investigation.<br>- Condition: Alert on `system.process.memory.rss.pct` > 80%<br>- Default: Enabled (perhaps the threshold should be higher if this is on by default) | | ||
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | | ||
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. | | |
| | [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinguished by agent ID and component ID. | |
| | [Elastic Agent] Excessive restarts| Checks if excessive restarts on a host which require further investigation. Some of these restarts could have a business impact and getting an alert for them would allow us to act quickly to mitigate.<br>- Condition: Alert on (not sure) > 10 times in a 5 minute window<br>- Default: Enabled | | ||
| | [Elastic Agent] High pipeline queue | Checks if max of `beat.stats.libbeat.pipeline.queue.filled.pct` exceeds 90%. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Output errors | Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id. | | ||
| | [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | [Elastic Agent] Unhealthy status | Checks for log occurrence of an agent status change to `error` using the new elastic_agent.status_change datastreams. | | |
| | [Elastic Agent] Unhealthy status | Checks logs for an agent status change to `error` using the new `elastic_agent.status_change` datastreams. | |
Related: #2760
Follow-up to: #3537