Skip to content

[Observability] Alarm on clustermgtd not running#7209

Merged
gmarciani merged 4 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-metrics-0123-1
Jan 28, 2026
Merged

[Observability] Alarm on clustermgtd not running#7209
gmarciani merged 4 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-metrics-0123-1

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Jan 26, 2026

Description of changes

  1. Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics.
  2. Add alarm $ClusterName-HeadNode-ClustermgtdHeartbeat, which goes into alarm when the metric ClustermgtdHeartbeat is missing for more than 10 minutes, which is the worst case time that clustermgtd is expected to be stopped during a cluster update.
  3. Add the alarm above to the Head Node composite alarm
  4. Surface the above alarm in cluster dashboard within the existing section "Head Node Alarms"
  5. Surface the clustermgtd heartbeat to the cluster dashboard widget "Daemons Heartbeats"
  6. Extended integ test test_monitoring so verify clustermgtd metrics are collected.

All the changes are applied only when Slurm is used (not AWS Batch) because at this point we are only pushing metrics from clustermgtd, which is deployed only with Slurm scheduler.

This PR is coupled with aws/aws-parallelcluster-node#685. In particular:

  1. the current PR adds the permission required by clustermgtd to emit the metric.
  2. the current PR adds an alarm that required the metric ClustermgtdHeartbeat to be emitted by clustermgtd

UX

Screenshot 2026-01-27 at 1 15 11 PM Screenshot 2026-01-27 at 1 15 34 PM

Q&A

  1. Why you did not extend the integ test test_monitoring.py to cover the new alarm details and dashboard?
    I extend the integ test to verify that the metric is collected. However, there is no point in extending the assertions around the alarm and dashboard details because they are already covered by unit tests. On the opposite, we should instead remove the asserttions around alarm settings and dashboard settings from the integ test (will do in a follow up PR)
  2. Why not using a metric filter on clustermgtd logs to signal that it is running, rather than alarming an explicit metric posted by clustermgtd?
    • Reduced point of failure: with an explicit put_metric we rely only on CW Metrics; whereas with metric_Filters we rely also on Cw Logs
    • Robustness: metric_filters rely on logs formatting, which is ephemeral by definition.
    • Cluster Config: in pcluster you can toggle logs and alarms independently. If we go with metric_filters we would couple the two concepts and a user that disabled logs will end up disabling alarms as well.
  3. Isn't the alarm name too long?
    No. The name is $clusterName-HeadNode-Clustermgtd.
    The max length for an alarm name is 255 chars.
    The max length for cluster name is 60 chars.
    The suffix -HeadNode-Clustermgtd is 21 chars, so we are far away from the limit.
  4. Why only Slurm scheduler?
    In this Pr we are introducing the alarm for clustermgtd, which runs only when the scheduler is Slurm (not AWS Batch).
  5. With this configuration the alarm goes red after 10 minutes of clustermgtd not running, which is ok. However it also requires 10 minutes of clustermgtd running to get back to green. Why not getting back to green at first positive datapoint?
    I agree that the alarm should recover faster. Ideally even 3 good datapoints would be enough to recover. However there is a technical blocker: CW does not support asymmetric thresholds within a single alarm. If you need 10 missing datapoints to go red, you need 10 good datapoints to recover. To implement asymmetric threshold we need to create a composite alarm made by one alarm driving the red state and the other one driving the recovery. This would overcomplicate the solution and the user experience.

Tests

  • Unit tests (updated to cover the changes in this PR)
  • Manual validation: created cluster and verified that clustermgtd is able to emit the metric and that the cluster dashboard shows as expected both the alarm and the metrics. Verified that by stopping clustermgtd the alarm goes red after the expected 10 minutes
  • [SUCCESS] Integration test test_monitoring

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

"Cpu": self._cw_metric_head_node("AWS/EC2", "CPUUtilization"),
"Mem": self._cw_metric_head_node("CWAgent", "mem_used_percent"),
"Disk": self._cw_metric_head_node("CWAgent", "disk_used_percent", extra_dimensions={"path": "/"}),
"Health": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE FOR THE REVIEWER
The refactoring of how alarms are defined here was required because clustermgtd heartbeat introduces an alarm that has different needs that the other alarms (threshold, missing data, observation period); so the logic must be changed to accommodate more flexibility.

datapoints_to_alarm=alarm_details["datapoints_to_alarm"],
treat_missing_data=alarm_details["treat_missing_data"],
)
alarm.node.add_dependency(self.wait_condition)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTES FOR THE REVIEWER
We introduced here a dependency of head node alarms to the head node wait condition. This is needed because clustermgtd alarm is meant to go red on missing heartbeats. However, during cluster creation it is expected to not have such heartbeat. So it makes no sense to start alarming before the head node configuration has completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the Alarm dependency for the new ClustermgtdHeartbeat Alarm but why add dependency for all the alarms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

My reasoning is:

  1. value: there is no reason to alarm on the head node until the head node completes its setup.
  2. simplicity: it is easier to define a dependency that applies to all the head node alarms than having different dependency rules.

So, un less there is a real value in created head node alarms before it completes its setup, there is not reason to have different behaviors.

Do you agree with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed!

@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 33bf487 to c393063 Compare January 27, 2026 17:36
@gmarciani gmarciani changed the title [Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics. [Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics Jan 27, 2026
@gmarciani gmarciani changed the title [Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics [Observability] Alarm on clustermgtd not running Jan 27, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch 2 times, most recently from c9401d5 to c9a775f Compare January 27, 2026 18:18
… node policy so that clustermgtd is able to emit metrics.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from c9a775f to 2e44ac9 Compare January 27, 2026 20:04
@gmarciani gmarciani enabled auto-merge (rebase) January 27, 2026 20:05
@gmarciani gmarciani disabled auto-merge January 27, 2026 20:05
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 2e44ac9 to dd3fdc9 Compare January 27, 2026 21:24
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from dd3fdc9 to ec6ad08 Compare January 27, 2026 21:45
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from ec6ad08 to 91d71d1 Compare January 27, 2026 21:59
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from cf560ed to 0891999 Compare January 27, 2026 22:50
"comparison_operator", cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD
),
datapoints_to_alarm=alarm_config.get("datapoints_to_alarm", CW_ALARM_DATAPOINTS_TO_ALARM_DEFAULT),
treat_missing_data=alarm_config.get("treat_missing_data", cloudwatch.TreatMissingData.MISSING),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTES FOR THE REVIEWER
Before this change we used to leave the treat_missing_data unspecified.
The default value when unspecified is MISSING. I wanted to set it here to make clearer what is the expected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I checked already the default

@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch 2 times, most recently from 39a655e to b181ccf Compare January 28, 2026 17:11
himani2411
himani2411 previously approved these changes Jan 28, 2026
himani2411
himani2411 previously approved these changes Jan 28, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 46dde5f to d3884ea Compare January 28, 2026 19:11
@gmarciani gmarciani merged commit ea1061f into aws:develop Jan 28, 2026
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch January 28, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants