[Observability] Alarm on clustermgtd not running#7209
Conversation
2b788d8 to
33bf487
Compare
| "Cpu": self._cw_metric_head_node("AWS/EC2", "CPUUtilization"), | ||
| "Mem": self._cw_metric_head_node("CWAgent", "mem_used_percent"), | ||
| "Disk": self._cw_metric_head_node("CWAgent", "disk_used_percent", extra_dimensions={"path": "/"}), | ||
| "Health": { |
There was a problem hiding this comment.
NOTE FOR THE REVIEWER
The refactoring of how alarms are defined here was required because clustermgtd heartbeat introduces an alarm that has different needs that the other alarms (threshold, missing data, observation period); so the logic must be changed to accommodate more flexibility.
| datapoints_to_alarm=alarm_details["datapoints_to_alarm"], | ||
| treat_missing_data=alarm_details["treat_missing_data"], | ||
| ) | ||
| alarm.node.add_dependency(self.wait_condition) |
There was a problem hiding this comment.
NOTES FOR THE REVIEWER
We introduced here a dependency of head node alarms to the head node wait condition. This is needed because clustermgtd alarm is meant to go red on missing heartbeats. However, during cluster creation it is expected to not have such heartbeat. So it makes no sense to start alarming before the head node configuration has completed.
There was a problem hiding this comment.
I understand the Alarm dependency for the new ClustermgtdHeartbeat Alarm but why add dependency for all the alarms?
There was a problem hiding this comment.
This is a good point.
My reasoning is:
- value: there is no reason to alarm on the head node until the head node completes its setup.
- simplicity: it is easier to define a dependency that applies to all the head node alarms than having different dependency rules.
So, un less there is a real value in created head node alarms before it completes its setup, there is not reason to have different behaviors.
Do you agree with that?
33bf487 to
c393063
Compare
cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics.cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics
cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metricsc9401d5 to
c9a775f
Compare
… node policy so that clustermgtd is able to emit metrics.
c9a775f to
2e44ac9
Compare
2e44ac9 to
dd3fdc9
Compare
dd3fdc9 to
ec6ad08
Compare
ec6ad08 to
91d71d1
Compare
cf560ed to
0891999
Compare
| "comparison_operator", cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD | ||
| ), | ||
| datapoints_to_alarm=alarm_config.get("datapoints_to_alarm", CW_ALARM_DATAPOINTS_TO_ALARM_DEFAULT), | ||
| treat_missing_data=alarm_config.get("treat_missing_data", cloudwatch.TreatMissingData.MISSING), |
There was a problem hiding this comment.
NOTES FOR THE REVIEWER
Before this change we used to leave the treat_missing_data unspecified.
The default value when unspecified is MISSING. I wanted to set it here to make clearer what is the expected behavior.
There was a problem hiding this comment.
Yup I checked already the default
39a655e to
b181ccf
Compare
4fc0190 to
46dde5f
Compare
…`ClustermgtdHeartbeat` is collected.
46dde5f to
d3884ea
Compare
Description of changes
cloudwatch:PutMetricDatato the head node policy so that clustermgtd is able to emit metrics.$ClusterName-HeadNode-ClustermgtdHeartbeat, which goes into alarm when the metricClustermgtdHeartbeatis missing for more than 10 minutes, which is the worst case time that clustermgtd is expected to be stopped during a cluster update.test_monitoringso verify clustermgtd metrics are collected.All the changes are applied only when Slurm is used (not AWS Batch) because at this point we are only pushing metrics from clustermgtd, which is deployed only with Slurm scheduler.
This PR is coupled with aws/aws-parallelcluster-node#685. In particular:
ClustermgtdHeartbeatto be emitted by clustermgtdUX
Q&A
test_monitoring.pyto cover the new alarm details and dashboard?I extend the integ test to verify that the metric is collected. However, there is no point in extending the assertions around the alarm and dashboard details because they are already covered by unit tests. On the opposite, we should instead remove the asserttions around alarm settings and dashboard settings from the integ test (will do in a follow up PR)
No. The name is
$clusterName-HeadNode-Clustermgtd.The max length for an alarm name is 255 chars.
The max length for cluster name is 60 chars.
The suffix
-HeadNode-Clustermgtdis 21 chars, so we are far away from the limit.In this Pr we are introducing the alarm for clustermgtd, which runs only when the scheduler is Slurm (not AWS Batch).
I agree that the alarm should recover faster. Ideally even 3 good datapoints would be enough to recover. However there is a technical blocker: CW does not support asymmetric thresholds within a single alarm. If you need 10 missing datapoints to go red, you need 10 good datapoints to recover. To implement asymmetric threshold we need to create a composite alarm made by one alarm driving the red state and the other one driving the recovery. This would overcomplicate the solution and the user experience.
Tests
test_monitoringBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.