Skip to content

Commit f2584e4

Browse files
committed
Added a CloudFormation template to auto-add-cw-alarms.
1 parent 3511e27 commit f2584e4

File tree

6 files changed

+892
-23
lines changed

6 files changed

+892
-23
lines changed
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
# Copyright (c) NetApp, Inc.
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
name: "Update Cloudformation Template"
6+
7+
on:
8+
pull_request:
9+
paths:
10+
- 'Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py'
11+
push:
12+
paths:
13+
- 'Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py'
14+
branches:
15+
- main
16+
17+
jobs:
18+
update-Cloudformation-Template:
19+
runs-on: ubuntu-latest
20+
permissions:
21+
# Give the default GITHUB_TOKEN write permission to commit and push the
22+
# added or changed files to the repository.
23+
contents: write
24+
25+
steps:
26+
- name: Checkout pull request
27+
uses: actions/checkout@v4
28+
with:
29+
ref: ${{ github.event.pull_request.head.ref }}
30+
31+
- name: Update the Cloudformation Template
32+
shell: bash
33+
working-directory: Monitoring/auto-add-cw-alarms
34+
run: ./update-auto-add-cw-alarms-CF-Template
35+
36+
- name: Commit the changes
37+
uses: stefanzweifel/git-auto-commit-action@v5

Monitoring/auto-add-cw-alarms/README.md

Lines changed: 35 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,32 @@ to monitor the CPU utilization of the file system. And if a volume or file syste
1010

1111
To implement this, you might think to just create EventTail filters to trigger on the creation or deletion of an FSx Volume.
1212
This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create
13-
and delete volumes without creating CloudTrail events. So, this method would not be reliable. Therefore, instead
13+
and delete volumes without generating any CloudTrail events. So, this method would not be reliable. Therefore, instead
1414
of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed.
1515

1616
## Invocation
17-
There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could upload it
18-
as a Lambda function.
17+
There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could install it
18+
as a Lambda function. If you want to run it as a Lambda function, a CloudFormation template is included in the repo that will:
19+
- Create a role that will allow the Lambda function to:
20+
- List AWS regions. So it can scan all regions for FSx for ONTAP file systems and volumes.
21+
- List the FSx for ONTAP file systems.
22+
- List the FSx volume.
23+
- List the CloudWatch alarms.
24+
- List tags for the resources. This is so you can customize the thresholds for the alarms.
25+
- Create CloudWatch alarms.
26+
- Delete CloudWatch alarms that it has created (based on alarm names).
27+
- Create a Lambda function with the Python program.
28+
- Create a EventBridge schedule that will run the Lambda function on a user defined basis.
29+
- Create a role that will allow the EventBridge schedule to trigger the Lambda function.
1930

2031
### Configuring the program
2132
Before you can run the program you will need to configure it. You can configure it a few ways:
2233
* By editing the top part of the program itself where there are the following variable definitions.
23-
* By setting environment variables.
34+
* By setting environment variables with the same names as the variables in the program.
2435
* If running it as a standalone program, via some command line options.
2536

37+
:bulb: **NOTE:** The CloudFormation template will prompt for these values when you create the stack and will set the appropriate environment variables for you.
38+
2639
Here is the list of variables, and what they define:
2740

2841
| Variable | Description |Command Line Option|
@@ -78,19 +91,20 @@ You can run the program in "Dry Run" mode by specifying the `-d` (or `--dryRun`)
7891
messages showing what it would have done, and not really create or delete any CloudWatch alarms.
7992

8093
### Running as a Lambda function
81-
If you run the program as a Lambda function, you will want to set the timeout to at least two minutes since some of the API calls
94+
A CloudFormation template is included in the repo that will do the steps below. Otherwise, here are the steps required to install the program as a Lambda function.
95+
Create a Lambda function and upload the program as the function code. Set the set the timeout to at least five minutes since some of the API calls
8296
can take a significant amount of "clock time" to run, especially in distant regions.
8397

8498
Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis.
8599

86100
The appropriate permissions will need to be assigned to the Lambda function in order for it to run correctly.
87101
It doesn't need many permissions. It just needs to be able to:
88-
* List the FSx for ONTAP file systems
89-
* List the FSx volume names
90-
* List the CloudWatch alarms
91-
* Create CloudWatch alarms
92-
* Delete CloudWatch alarms
93-
* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue
102+
* List the FSx for ONTAP file systems.
103+
* List the FSx volume names.
104+
* List the CloudWatch alarms.
105+
* Create CloudWatch alarms.
106+
* Delete CloudWatch alarms. You can set resource to "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" to limit the deletion to only the alarms that it created.
107+
* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue.
94108

95109
The following permissions are required to run the script (although you could narrow the "Resource" specification to suit your needs.)
96110
```JSON
@@ -105,7 +119,6 @@ The following permissions are required to run the script (although you could nar
105119
"fsx:ListTagsForResource",
106120
"fsx:DescribeVolumes",
107121
"fsx:DescribeFilesystems",
108-
"cloudwatch:DeleteAlarms",
109122
"cloudwatch:DescribeAlarmsForMetric",
110123
"ec2:DescribeRegions",
111124
"cloudwatch:DescribeAlarms"
@@ -115,14 +128,22 @@ The following permissions are required to run the script (although you could nar
115128
{
116129
"Sid": "VisualEditor1",
117130
"Effect": "Allow",
131+
"Action": [
132+
"cloudwatch:DeleteAlarms"
133+
],
134+
"Resource": "arn:aws:cloudwatch:*:*:alarm:FSx-ONTAP-Auto*"
135+
},
136+
{
137+
"Sid": "VisualEditor2",
138+
"Effect": "Allow",
118139
"Action": [
119140
"logs:CreateLogStream",
120141
"logs:PutLogEvents"
121142
],
122143
"Resource": "arn:aws:logs:*:*:log-group:*:log-stream:*"
123144
},
124145
{
125-
"Sid": "VisualEditor2",
146+
"Sid": "VisualEditor3",
126147
"Effect": "Allow",
127148
"Action": "logs:CreateLogGroup",
128149
"Resource": "arn:aws:logs:*:*:log-group:*"
@@ -133,7 +154,7 @@ The following permissions are required to run the script (although you could nar
133154

134155
### Expected Action
135156
Once the script has been configured and invoked, it will:
136-
* Scan for every FSx for ONTAP file systems in every region. For every file system it finds it will:
157+
* Scan for every FSx for ONTAP file systems in every region. For every file system that it finds it will:
137158
* Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm.
138159
* Create a SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm.
139160
* Scan for every FSx for ONTAP volume in every region. For every volume it finds it will:

Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,18 @@
44
# ONTAP volumes, that don't already have one, that will trigger when the
55
# utilization of the volume gets above the threshold defined below. It will
66
# also create an alarm that will trigger when the file system reach
7-
# an average CPU utilization greater than what is specified below.
7+
# an average CPU utilization greater than what is specified below as well
8+
# an alarm that will trigger when the SSD utilization is greater than what
9+
# is specified below.
810
#
911
# It can either be run as a standalone script, or uploaded as a Lambda
1012
# function with the thought being that you will create a EventBridge schedule
1113
# to invoke it periodically.
1214
#
13-
# It will scan all regions looking for FSxN volumes, and since CloudWatch
14-
# can't send SNS messages across regions, it assumes that the specified
15-
# SNS topic exist in each region for the specified account ID.
15+
# It will scan all regions looking for FSxN volumes and file systems
16+
# and since CloudWatch can't send SNS messages across regions, it assumes
17+
# that the specified SNS topic exist in each region for the specified
18+
# account ID.
1619
#
1720
# Finally, a default volume threshold is defined below. It sets the volume
1821
# utilization threshold that will cause CloudWatch to send the alarm event
@@ -24,6 +27,9 @@
2427
# Lastly, you can create an override for the SSD alarm, by creating a tag
2528
# with the name "SSD_Alarm_Threshold" on the file system resource.
2629
#
30+
# Version: %%VERSION%%
31+
# Date: %%DATE%%
32+
#
2733
################################################################################
2834
#
2935
# The following variables effect the behavior of the script. They can be
@@ -64,14 +70,20 @@
6470
# what you are doing.
6571
################################################################################
6672
#
73+
# The following is put in front of all alarms so an IAM policy can be create
74+
# that will allow this script to only be able to delete the alarms it creates.
75+
# If you change this, you must also change the IAM policy. Note that the
76+
# Cloudfomration template also assume the value of this variable.
77+
basePrefix="FSx-ONTAP-Auto"
78+
#
6779
# Define the prefix for the volume utilization alarm name for the CloudWatch alarms.
68-
alarmPrefixVolume="Volume_Utilization_for_volume_"
80+
alarmPrefixVolume=f"{basePrefix}-Volume_Utilization_for_volume_"
6981
#
7082
# Define the prefix for the CPU utilization alarm name for the CloudWatch alarms.
71-
alarmPrefixCPU="CPU_Utilization_for_fs_"
83+
alarmPrefixCPU=f"{basePrefix}-CPU_Utilization_for_fs_"
7284
#
7385
# Define the prefix for the SSD utilization alarm name for the CloudWatch alarms.
74-
alarmPrefixSSD="SSD_Utilization_for_fs_"
86+
alarmPrefixSSD=f"{basePrefix}-SSD_Utilization_for_fs_"
7587

7688
################################################################################
7789
# You shouldn't have to modify anything below here.
@@ -531,7 +543,7 @@ def lambda_handler(event, context):
531543
# This function is used to print out the usage of the script.
532544
################################################################################
533545
def usage():
534-
print('Usage: add_cw_alarm [-h|--help] [-d|--dryRun] [[-c|--customerID customerID] [[-a|--accountID aws_account_id] [[-s|--SNSTopic SNS_Topic_Name] [[-r|--region region] [[-C|--CPUThreshold threshold] [[-S|--SSDThreshold threshold] [[-V|--VolumeThreshold threshold] [-F|--FileSystemID FileSystemID]')
546+
print('Usage: auto_add_cw_alarms [-h|--help] [-d|--dryRun] [[-c|--customerID customerID] [[-a|--accountID aws_account_id] [[-s|--SNSTopic SNS_Topic_Name] [[-r|--region region] [[-C|--CPUThreshold threshold] [[-S|--SSDThreshold threshold] [[-V|--VolumeThreshold threshold] [-F|--FileSystemID FileSystemID]')
535547

536548
################################################################################
537549
# Main logic starts here.

0 commit comments

Comments
 (0)