[Observability] Alarm on clustermgtd not running by gmarciani · Pull Request #7209 · aws/aws-parallelcluster

gmarciani · 2026-01-26T23:00:06Z

Description of changes

Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics.
Add alarm $ClusterName-HeadNode-ClustermgtdHeartbeat, which goes into alarm when the metric ClustermgtdHeartbeat is missing for more than 10 minutes, which is the worst case time that clustermgtd is expected to be stopped during a cluster update.
Add the alarm above to the Head Node composite alarm
Surface the above alarm in cluster dashboard within the existing section "Head Node Alarms"
Surface the clustermgtd heartbeat to the cluster dashboard widget "Daemons Heartbeats"
Extended integ test test_monitoring so verify clustermgtd metrics are collected.

All the changes are applied only when Slurm is used (not AWS Batch) because at this point we are only pushing metrics from clustermgtd, which is deployed only with Slurm scheduler.

This PR is coupled with aws/aws-parallelcluster-node#685. In particular:

the current PR adds the permission required by clustermgtd to emit the metric.
the current PR adds an alarm that required the metric ClustermgtdHeartbeat to be emitted by clustermgtd

UX

Q&A

Why you did not extend the integ test test_monitoring.py to cover the new alarm details and dashboard?
I extend the integ test to verify that the metric is collected. However, there is no point in extending the assertions around the alarm and dashboard details because they are already covered by unit tests. On the opposite, we should instead remove the asserttions around alarm settings and dashboard settings from the integ test (will do in a follow up PR)
Why not using a metric filter on clustermgtd logs to signal that it is running, rather than alarming an explicit metric posted by clustermgtd?
- Reduced point of failure: with an explicit put_metric we rely only on CW Metrics; whereas with metric_Filters we rely also on Cw Logs
- Robustness: metric_filters rely on logs formatting, which is ephemeral by definition.
- Cluster Config: in pcluster you can toggle logs and alarms independently. If we go with metric_filters we would couple the two concepts and a user that disabled logs will end up disabling alarms as well.
Isn't the alarm name too long?
No. The name is $clusterName-HeadNode-Clustermgtd.
The max length for an alarm name is 255 chars.
The max length for cluster name is 60 chars.
The suffix -HeadNode-Clustermgtd is 21 chars, so we are far away from the limit.
Why only Slurm scheduler?
In this Pr we are introducing the alarm for clustermgtd, which runs only when the scheduler is Slurm (not AWS Batch).
With this configuration the alarm goes red after 10 minutes of clustermgtd not running, which is ok. However it also requires 10 minutes of clustermgtd running to get back to green. Why not getting back to green at first positive datapoint?
I agree that the alarm should recover faster. Ideally even 3 good datapoints would be enough to recover. However there is a technical blocker: CW does not support asymmetric thresholds within a single alarm. If you need 10 missing datapoints to go red, you need 10 good datapoints to recover. To implement asymmetric threshold we need to create a composite alarm made by one alarm driving the red state and the other one driving the recovery. This would overcomplicate the solution and the user experience.

Tests

Unit tests (updated to cover the changes in this PR)
Manual validation: created cluster and verified that clustermgtd is able to emit the metric and that the cluster dashboard shows as expected both the alarm and the metrics. Verified that by stopping clustermgtd the alarm goes red after the expected 10 minutes
[SUCCESS] Integration test test_monitoring

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gmarciani · 2026-01-27T14:23:01Z

cli/src/pcluster/templates/cluster_stack.py

-            "Cpu": self._cw_metric_head_node("AWS/EC2", "CPUUtilization"),
-            "Mem": self._cw_metric_head_node("CWAgent", "mem_used_percent"),
-            "Disk": self._cw_metric_head_node("CWAgent", "disk_used_percent", extra_dimensions={"path": "/"}),
+            "Health": {


NOTE FOR THE REVIEWER
The refactoring of how alarms are defined here was required because clustermgtd heartbeat introduces an alarm that has different needs that the other alarms (threshold, missing data, observation period); so the logic must be changed to accommodate more flexibility.

gmarciani · 2026-01-27T14:24:35Z

cli/src/pcluster/templates/cluster_stack.py

+                datapoints_to_alarm=alarm_details["datapoints_to_alarm"],
+                treat_missing_data=alarm_details["treat_missing_data"],
            )
+            alarm.node.add_dependency(self.wait_condition)


NOTES FOR THE REVIEWER
We introduced here a dependency of head node alarms to the head node wait condition. This is needed because clustermgtd alarm is meant to go red on missing heartbeats. However, during cluster creation it is expected to not have such heartbeat. So it makes no sense to start alarming before the head node configuration has completed.

I understand the Alarm dependency for the new ClustermgtdHeartbeat Alarm but why add dependency for all the alarms?

This is a good point.

My reasoning is:

value: there is no reason to alarm on the head node until the head node completes its setup.

simplicity: it is easier to define a dependency that applies to all the head node alarms than having different dependency rules.

So, un less there is a real value in created head node alarms before it completes its setup, there is not reason to have different behaviors.

Do you agree with that?

Yes agreed!

… node policy so that clustermgtd is able to emit metrics.

cli/src/pcluster/templates/cluster_stack.py

cli/tests/pcluster/templates/test_cluster_stack.py

gmarciani · 2026-01-28T16:06:59Z

cli/src/pcluster/templates/cluster_stack.py

+                    "comparison_operator", cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD
+                ),
+                datapoints_to_alarm=alarm_config.get("datapoints_to_alarm", CW_ALARM_DATAPOINTS_TO_ALARM_DEFAULT),
+                treat_missing_data=alarm_config.get("treat_missing_data", cloudwatch.TreatMissingData.MISSING),


NOTES FOR THE REVIEWER
Before this change we used to leave the treat_missing_data unspecified.
The default value when unspecified is MISSING. I wanted to set it here to make clearer what is the expected behavior.

Yup I checked already the default

…eduler.

…`ClustermgtdHeartbeat` is collected.

gmarciani added the 3.x label Jan 26, 2026

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 2b788d8 to 33bf487 Compare January 26, 2026 23:03

gmarciani mentioned this pull request Jan 27, 2026

[Observability] Emit metric ClustermgtdHeartbeat to signal clustermgtd heartbeat. aws/aws-parallelcluster-node#685

Merged

gmarciani marked this pull request as ready for review January 27, 2026 13:49

gmarciani requested review from a team as code owners January 27, 2026 13:49

gmarciani added Observability Security enhancement labels Jan 27, 2026

gmarciani commented Jan 27, 2026

View reviewed changes

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 33bf487 to c393063 Compare January 27, 2026 17:36

gmarciani changed the title ~~[Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics.~~ [Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics Jan 27, 2026

gmarciani changed the title ~~[Observability] Add permission cloudwatch:PutMetricData to the head node policy so that clustermgtd is able to emit metrics~~ [Observability] Alarm on clustermgtd not running Jan 27, 2026

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch 2 times, most recently from c9401d5 to c9a775f Compare January 27, 2026 18:18

gmarciani mentioned this pull request Jan 27, 2026

[Test] Add unit test to verify that cluster alarms have the expected settings. #7212

Merged

[Observability] Add permission cloudwatch:PutMetricData to the head…

4df799d

… node policy so that clustermgtd is able to emit metrics.

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from c9a775f to 2e44ac9 Compare January 27, 2026 20:04

gmarciani enabled auto-merge (rebase) January 27, 2026 20:05

gmarciani disabled auto-merge January 27, 2026 20:05

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 2e44ac9 to dd3fdc9 Compare January 27, 2026 21:24

himani2411 reviewed Jan 27, 2026

View reviewed changes

cli/src/pcluster/templates/cluster_stack.py Outdated Show resolved Hide resolved

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from dd3fdc9 to ec6ad08 Compare January 27, 2026 21:45

himani2411 reviewed Jan 27, 2026

View reviewed changes

cli/tests/pcluster/templates/test_cluster_stack.py Show resolved Hide resolved

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from ec6ad08 to 91d71d1 Compare January 27, 2026 21:59

[Observability] Add alarm on missing clustermgtd heartbeat.

76054aa

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from cf560ed to 0891999 Compare January 27, 2026 22:50

gmarciani commented Jan 28, 2026

View reviewed changes

[Observability] Alarm on clustermgtd heartbeat only for the slurm sch…

2edb607

…eduler.

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch 2 times, most recently from 39a655e to b181ccf Compare January 28, 2026 17:11

himani2411 previously approved these changes Jan 28, 2026

View reviewed changes

gmarciani mentioned this pull request Jan 28, 2026

[Update] Reduce the risk of clustermgtd being stopped on cluster/fleet update aws/aws-parallelcluster-cookbook#3102

Open

gmarciani dismissed himani2411’s stale review via 4fc0190 January 28, 2026 18:48

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch 2 times, most recently from 4fc0190 to 46dde5f Compare January 28, 2026 19:05

gmarciani enabled auto-merge (rebase) January 28, 2026 19:05

himani2411 previously approved these changes Jan 28, 2026

View reviewed changes

[Test] Extend integ test test_monitoring to verify that the metric …

d3884ea

…`ClustermgtdHeartbeat` is collected.

gmarciani dismissed himani2411’s stale review via d3884ea January 28, 2026 19:11

gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch from 46dde5f to d3884ea Compare January 28, 2026 19:11

himani2411 approved these changes Jan 28, 2026

View reviewed changes

gmarciani merged commit ea1061f into aws:develop Jan 28, 2026
24 checks passed

gmarciani deleted the wip/mgiacomo/3150/clustermgtd-metrics-0123-1 branch January 28, 2026 19:22

gmarciani mentioned this pull request Jan 28, 2026

[Update] Always restart clustermgtd on update failure aws/aws-parallelcluster-cookbook#3104

Closed

gmarciani mentioned this pull request Feb 4, 2026

[Observability] Use metric filter to generate the clustermgtd heartbeat metric. #7219

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Alarm on clustermgtd not running#7209

[Observability] Alarm on clustermgtd not running#7209
gmarciani merged 4 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-metrics-0123-1

gmarciani commented Jan 26, 2026 •

edited

Loading

Uh oh!

gmarciani Jan 27, 2026

Uh oh!

gmarciani Jan 27, 2026

Uh oh!

himani2411 Jan 27, 2026

Uh oh!

gmarciani Jan 27, 2026

Uh oh!

himani2411 Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

gmarciani Jan 28, 2026

Uh oh!

himani2411 Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmarciani commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

UX

Q&A

Tests

Uh oh!

gmarciani Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gmarciani Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented Jan 26, 2026 •

edited

Loading