Skip to content

Commit a82d59c

Browse files
committed
Added the ability to limit the scanning to specified regions.
1 parent e46565e commit a82d59c

File tree

3 files changed

+111
-31
lines changed

3 files changed

+111
-31
lines changed

Monitoring/auto-add-cw-alarms/README.md

Lines changed: 96 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,16 @@ delete alarms. This can be tedious, and error prone. This script will automate t
88
AWS CloudWatch alarms that monitor the utilization of the file system and its volumes. It will also create alarms
99
to monitor the CPU utilization of the file system. And if a volume or file system is removed, it will remove the associated alarms.
1010

11-
To implement this, you might think to just create EventTail filters to trigger on the creation or deletion of an FSx Volume.
11+
To implement this, you might think to just create EventBridge rules to trigger on the creation or deletion of an FSx Volume.
1212
This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create
1313
and delete volumes without generating any CloudTrail events. So, this method would not be reliable. Therefore, instead
1414
of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed.
1515

1616
## Invocation
17-
There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could install it
18-
as a Lambda function. If you want to run it as a Lambda function, a CloudFormation template is included in the repo that will:
17+
The preferred way to run this script is as a Lambda function. That is because it is very inexpensive to run without having
18+
to maintain compute resources. You can use an `EventBridge Schedule` to run it on a regular basis to
19+
ensure that all the CloudWatch alarms are kept up to date. Since there are several steps involved in setting up a Lambda function
20+
a CloudFormation script is included in the repo, named `cloudlformation.yaml`, that will do the following steps for you:
1921
- Create a role that will allow the Lambda function to:
2022
- List AWS regions. This is so it can scan all regions for FSx for ONTAP file systems and volumes.
2123
- List the FSx for ONTAP file systems.
@@ -28,8 +30,32 @@ as a Lambda function. If you want to run it as a Lambda function, a CloudFormati
2830
- Create a EventBridge schedule that will run the Lambda function on a user defined basis.
2931
- Create a role that will allow the EventBridge schedule to trigger the Lambda function.
3032

33+
To use the CloudFormation template perform the following steps:
34+
35+
1. Download the `cloudformation.yaml` file from this repo.
36+
2. Go to the `CloudFormation` services page in the AWS console and select `Create Stack -> With new resources (standard)`.
37+
3. Select `Choose an existing template` and `Upload a template file`.
38+
4. Click `Choose file` and select the `cloudformation.yaml` file you downloaded in step 1.
39+
5. Click `Next` and fill in the parameters presented on the next page. The parameters are:
40+
- `Stack name` - The name of the CloudFormation stack. Note this name is also used as a base name for some of the resources that are created, to make them unique, so you must keep this string under 25 characters so the resource names don't exceed their name length limit.
41+
- `SNStopic` - The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created. This CloudFormation template, nor the Lambda function, will not create these SNS topics for you.
42+
- `accountId` - The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic set above.
43+
- `customerId` - This is optional. If provided the string entered is included in the description of every alarm created.
44+
- `defaultCPUThreshold` - This will define a default CPU utilization threshold. You can override the default by having a specific tag associated with the file system (see below).
45+
- `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system (see below).
46+
- `defaultVolumeThreshold` - This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume (see below).
47+
- `checkInterval` - This is the interval in minutes that the program will run.
48+
- `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.
49+
- `regions` - This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid.
50+
6. Click `Next`. There aren't any recommended changes to make to any of the proceeding pages, so just click `Next` again.
51+
7. On the final page, check the box that says `I acknowledge that AWS CloudFormation might create IAM resources with custom names.` and click `Submit`.
52+
53+
If you prefer, you can run this Python program on any UNIX based computer that has Python installed. See the "Running on a computer" section below for more information.
54+
3155
### Configuring the program
32-
Before you can run the program you will need to configure it. You can configure it a few ways:
56+
If you use the CloudFormation template to deploy the program, it will create the appropriate environment variables for you.
57+
However, if you didn't use the CloudFormation template, you will need to configure the program yourself. Here are the
58+
various ways you can do so:
3359
* By editing the top part of the program itself where there are the following variable definitions.
3460
* By setting environment variables with the same names as the variables in the program.
3561
* If running it as a standalone program, via some command line options.
@@ -40,20 +66,20 @@ Here is the list of variables, and what they define:
4066

4167
| Variable | Description |Command Line Option|
4268
|:---------|:------------|:--------------------------------|
43-
|SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS_Topic_Name|
44-
|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account_number|
45-
|customerId| This is really just a comment that will be added to the alarm description.|-c Customer_String|
69+
|SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS\_Topic\_Name|
70+
|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account\_number|
71+
|customerId| This is really just a comment that will be added to the alarm description.|-c Customer\_String|
4672
|defaultCPUThreshold | This will define the default CPU utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-C number|
4773
|defaultSSDThreshold | This will define the default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-S number|
4874
|defaultVolumeThreshold | This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume. See below for more information.|-V number|
49-
|alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A|
50-
|alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A|
75+
|alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A|
76+
|alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A|
5177
|alarmPrefixVolume | This defines the string that will be put in front of the name of every volume utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A|
78+
|regions | This is a comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system. Note that no checking is preformed to ensure that the regions you provide are valid.|-r region -r region ...|
5279

5380
There are a few command line options that don't have a corresponding variables:
5481
|Option|Description|
5582
|:-----|:----------|
56-
|-r region| This option can be specified multiple times to limit the regions that the program will act on. If not specified, the program will act on all regions.|
5783
|-d| This option will cause the program to run in "Dry Run" mode. In this mode, the program will only display messages showing what it would have done, and not really create or delete any CloudWatch alarms.|
5884
|-F filesystem\_ID| This option will cause the program to only add or remove alarms that are associated with the filesystem\_ID.|
5985

@@ -81,8 +107,9 @@ Once you have Python and boto3 installed, you can run the program by executing t
81107
python3 auto_add_cw_alarms.py
82108
```
83109
This will run the program based on all the variables set at the top. If you want to change the behavior without
84-
having to edit the program, you can use the Command Line Option specified in the table above. Note that you can give a `-h` (or `--help`)
85-
option and the program will display a list of all the available options.
110+
having to edit the program, you can either use the Command Line Option specified in the table above or you can
111+
set the appropriate environment variable. Note that you can give a `-h` (or `--help`) command line option
112+
and the program will display a list of all the available options.
86113

87114
You can limit the regions that the program will act on by using the `-r region` option. You can specify that option
88115
multiple times to act on multiple regions.
@@ -91,23 +118,39 @@ You can run the program in "Dry Run" mode by specifying the `-d` (or `--dryRun`)
91118
messages showing what it would have done, and not really create or delete any CloudWatch alarms.
92119

93120
### Running as a Lambda function
94-
A CloudFormation template is included in the repo that will do the steps below. Otherwise, here are the steps required to install the program as a Lambda function.
95-
96-
Create a Lambda function and upload the program as the function code. Set the timeout to at least five minutes since some of the API calls
97-
can take a significant amount of "clock time" to run, especially in distant regions.
98-
99-
Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis.
100-
101-
The appropriate permissions will need to be assigned to the Lambda function in order for it to run correctly.
102-
It doesn't need many permissions. It just needs to be able to:
121+
A CloudFormation template is included in the repo that will do the steps below. If you don't want to use that, here are
122+
the detailed steps required to install the program as a Lambda function.
123+
124+
#### Create a Lambda function
125+
1. Download the `auto_add_cw_alarms.py` file from this repo.
126+
2. Create a new Lambda function in the AWS console by going to the Lambda services page and clicking on the `Create function` button.
127+
3. Choose `Author from scratch` and give the function a name. For example `auto_add_cw_alarms`.
128+
4. Choose the latest version of Python (currently Python 3.11) as the runtime and click on `Create function`.
129+
5. In the function code section, copy and paste the contents of the `auto_add_cw_alarms.py` file into the code editor.
130+
6. Click on the `Deploy` button to save the function.
131+
7. Click on the Configuration tag and then the "General configuration" sub tab and set the "Timeout" to be at least 3 minutes.
132+
8. Click on the "Environment variables" tab and add the following environment variables:
133+
- `SNStopic` - The SNS Topic name where CloudWatch will send alerts to.
134+
- `accountId` - The AWS account ID associated with the SNStopic.
135+
- `customerId` - This is optional. If provided the string entered is included in the description of every alarm created.
136+
- `defaultCPUThreshold` - This will define a default CPU utilization threshold.
137+
- `defaultSSDThreshold` - This will define a default SSD (aggregate) utilization threshold.
138+
- `defaultVolumeThreshold` - This will define the default Volume utilization threshold.
139+
- `alarmPrefixString` - This defines the string that will be prepended to every CloudWatch alarm name that the program creates.
140+
- `regions` - This is an optional comma separated list of AWS region names (e.g. us-east-1) that the program will act on. If not specified, the program will scan on all regions that support an FSx for ONTAP file system.
141+
142+
You will also need to set up the appropriate permissions for the Lambda function to run. It doesn't need many permissions. It just needs to be able to:
103143
* List the FSx for ONTAP file systems.
104144
* List the FSx volume names.
145+
* List tags associated with an FSx file system or volume.
105146
* List the CloudWatch alarms.
147+
* List all the AWS regions.
106148
* Create CloudWatch alarms.
107-
* Delete CloudWatch alarms. You can set resource to "arn:aws:cloudwatch:*:${AWS::AccountId}:alarm:FSx-ONTAP-Auto*" to limit the deletion to only the alarms that it created.
149+
* Delete CloudWatch alarms. You can set resource to `arn:aws:cloudwatch:*:`*AccountId*`:alarm:`*alarmPrefixString*`*` to limit the deletion to only the alarms that it creates.
108150
* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue.
109151

110-
The following permissions are required to run the script (although you could narrow the "Resource" specification to suit your needs.)
152+
The following is an example AWS policy that has all the required permissions to run the script (although you could narrow the "Resource" specification to suit your needs.)
153+
Note it assumes that the alarmPrefixString is set to "FSx-ONTAP-Auto".
111154
```JSON
112155
{
113156
"Version": "2012-10-17",
@@ -116,13 +159,13 @@ The following permissions are required to run the script (although you could nar
116159
"Sid": "VisualEditor0",
117160
"Effect": "Allow",
118161
"Action": [
119-
"cloudwatch:PutMetricAlarm",
120-
"fsx:ListTagsForResource",
121-
"fsx:DescribeVolumes",
122162
"fsx:DescribeFilesystems",
163+
"fsx:DescribeVolumes",
164+
"fsx:ListTagsForResource",
165+
"cloudwatch:DescribeAlarms"
123166
"cloudwatch:DescribeAlarmsForMetric",
124167
"ec2:DescribeRegions",
125-
"cloudwatch:DescribeAlarms"
168+
"cloudwatch:PutMetricAlarm",
126169
],
127170
"Resource": "*"
128171
},
@@ -153,15 +196,39 @@ The following permissions are required to run the script (although you could nar
153196
}
154197
```
155198

199+
Once you have deployed the Lambda function it is recommended to set up a scheduled to run it on a regular basis.
200+
The easiest way to do that is:
201+
1. Click on the `Add trigger` button from the Lambda function page.
202+
2. Select `EventBridge (CloudWatch Events)` as the trigger type.
203+
3. Click on the `Create a new rule` button.
204+
4. Give the rule a name and a description.
205+
5. Set the `Schedule expression` to be the interval you want the function to run. For example, if you want it to run every 15 minutes, you would set the expression to `rate(15 minutes)`.
206+
6. Click on the `Add` button
207+
156208
### Expected Action
157209
Once the script has been configured and invoked, it will:
158-
* Scan for every FSx for ONTAP file systems in every region. For every file system that it finds it will:
210+
* Scan for every FSx for ONTAP file systems in every region, unless you have specified a specific list of regions to scan. For every file system that it finds it will:
159211
* Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm.
160212
* Create an SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm.
161-
* Scan for every FSx for ONTAP volume in every region. For every volume it finds it will:
213+
* Scan for every FSx for ONTAP volume in every region, unless you have specified a specific list of regions to scan. For every volume it finds it will:
162214
* Create a Volume Utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm.
163215
* Scan for the CloudWatch alarms and remove any alarms that the associated resource doesn't exist anymore.
164216

217+
### Cleaning up
218+
If you decide you don't want to use this program anymore, you can delete the CloudFormation stack that you created.
219+
This will remove the Lambda function, the EventBridge schedule, and the roles that were created for you. If you did
220+
not use the CloudFormation template, you will have to do these steps yourself.
221+
222+
Once you have removed the program, you can remove all the CloudWatch alarms that were created by the program by running
223+
the following command:
224+
225+
```bash
226+
region=us-west-2
227+
aws cloudwatch describe-alarms --region=$region --alarm-name-prefix "FSx-ONTAP-Auto" --query "MetricAlarms[*].AlarmName" --output text | xargs -n 50 aws cloudwatch delete-alarms --region $region --alarm-names
228+
```
229+
This command will remove all the alarms that have an alarm name that starts with "FSx-ONTAP-Auto" in the us-west-2 region.
230+
Make sure to adjust the alarm-name-prefix to match the AlarmPrefix you set when you deployed the program.
231+
You will also need to adjust the region variable and run the `aws` command again for each region where you have alarms in.
165232

166233
## Author Information
167234

Monitoring/auto-add-cw-alarms/auto_add_cw_alarms.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -563,6 +563,9 @@ def usage():
563563
defaultCPUThreshold = int(os.environ.get('defaultCPUThreshold', defaultCPUThreshold))
564564
defaultSSDThreshold = int(os.environ.get('defaultSSDThreshold', defaultSSDThreshold))
565565
defaultVolumeThreshold = int(os.environ.get('defaultVolumeThreshold', defaultVolumeThreshold))
566+
regionsEnv = os.environ.get('regions', '')
567+
if regionsEnv != '':
568+
regions = regionsEnv.split(',')
566569
#
567570
# Check to see if we are bring run from a command line or a Lmabda function.
568571
if os.environ.get('AWS_LAMBDA_FUNCTION_NAME') == None:

0 commit comments

Comments
 (0)