|
| 1 | +# Automatically Add Cloud Watch Alarms to Monitor Aggregate, Volume and CPU Utilization |
| 2 | + |
| 3 | +## Introduction |
| 4 | +There are times when you want to be notified when a FSx for ONTAP file system, or one of its volumes, is reaching |
| 5 | +its capacity. AWS CloudWatch has metrics that can give you this information. The only problem is that they are |
| 6 | +on a per instance basis. This means as you add and delete file systems and/or volumes, you have to add and |
| 7 | +delete alarms. This can be tedious, and error prone. This script will automate the creation of |
| 8 | +AWS CloudWatch alarms that monitor the utilization of the file system and its volumes. It will also create alarms |
| 9 | +to monitor the CPU utilization of the file system. And if a volume or file system is removed, it will remove the associated alarms. |
| 10 | + |
| 11 | +To implement this, you might think to just create EventTail filters to trigger on the creation or deletion of an FSx Volume. |
| 12 | +This would kind of work, but since you have command line access to the FSx for ONTAP file system, you can create |
| 13 | +and delete volumes without creating CloudTrail events. So, this method would not be reliable. Therefore, instead |
| 14 | +of relying on those events, this script will scan all the file systems and volumes in all the regions then create and delete alarms as needed. |
| 15 | + |
| 16 | +## Invocation |
| 17 | +There are two ways you can invoke this script (Python program). Either from a computer that has Python installed, or you could upload it |
| 18 | +as a Lambda function. |
| 19 | + |
| 20 | +### Configuring the program |
| 21 | +Before you can run the program you will need to configure it. You can configure it two ways. Either by editing the top part of the program itself, |
| 22 | +where there are the following variable definitions, or if you are running it as a standalone program, via some command line options. |
| 23 | +Here is the list of variables, and what they define: |
| 24 | + |
| 25 | +| Variable | Description |Command Line Option| |
| 26 | +|:---------|:------------|:--------------------------------| |
| 27 | +|SNStopic | The SNS Topic name where CloudWatch will send alerts to. Note that it is assumed that the SNS topic, with the same name, will exist in all the regions where alarms are to be created.|-s SNS_Topic_Name| |
| 28 | +|accountId | The AWS account ID associated with the SNStopic. This is only used to compute the ARN to the SNS Topic.|-a Account_number| |
| 29 | +|customerId| This is really just a comment that will be added to the alarm description.|-c Customer_String| |
| 30 | +|defaultCPUThreshold | This will define the default CPU utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-C number| |
| 31 | +|defaultSSDThreshold | This will define the default SSD (aggregate) utilization threshold. You can override the default by having a specific tag associated with the file system. See below for more information.|-S number| |
| 32 | +|defaultVolumeThreshold | This will define the default Volume utilization threshold. You can override the default by having a specific tag associated with the volume. See below for more information.|-V number| |
| 33 | +|alarmPrefixCPU | This defines the string that will be put in front of the name of every CPU utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |
| 34 | +|alarmPrefixSSD | This defines the string that will be put in front of the name of every SSD utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |
| 35 | +|alarmPrefixVolume | This defines the string that will be put in front of the name of every volume utilization CloudWatch alarm that the program creates. Having a known prefix is how it knows it is the one maintaining the alarm.|N/A| |
| 36 | + |
| 37 | +As mentioned with the threshold variables, you can create a tag on the specific resource to override the default value set by the associated threshold |
| 38 | +variable. Here is the list of tags and where they should be located: |
| 39 | + |
| 40 | +|Tag|Description|Location| |
| 41 | +|:---|:------|:---| |
| 42 | +|alarm_threshold | Sets the volume utilization threshold. | Volume | |
| 43 | +|cpu_alarm_threshold| Sets the CPU utilization threshold. | File System | |
| 44 | +|ssd_alarm_threshold| Sets the SSD utilization threshold. | File System | |
| 45 | + |
| 46 | +:bulb: **NOTE:** When the alarm threshold is set to 100, the alarm will not be created. So, if you set the default to 100, then you can selectively add alarms by setting the appropriate tag. |
| 47 | + |
| 48 | +### Running on a computer |
| 49 | +To run the program on a computer, you must have Python installed. You will also need to install the boto3 library. |
| 50 | +You can do that by running the following command: |
| 51 | + |
| 52 | +```bash |
| 53 | +pip install boto3 |
| 54 | +``` |
| 55 | +Once you have Python and boto3 installed, you can run the program by executing the following command: |
| 56 | + |
| 57 | +```bash |
| 58 | +python3 auto_add_cw_alarms.py |
| 59 | +``` |
| 60 | +This will run the program based on all the variables set at the top. If you want to change the behavior without |
| 61 | +having to edit the program, you can use the Command Line Option specified in the table above. Note that you can give a `-h` (or `--help`) |
| 62 | +option and the program will display a list of all the available options. |
| 63 | + |
| 64 | +You can limit the regions that the program will act on by using the `-r region` option. You can specify that option |
| 65 | +multiple times to act on multiple regions. |
| 66 | + |
| 67 | +You can run the program in "Dry Run" mode by specifying the `-d` (or `--dryRun`) option. This will cause the program to only display |
| 68 | +messages showing what it would have done, and not really create or delete any CloudWatch alarms. |
| 69 | + |
| 70 | +### Running as a Lambda function |
| 71 | +If you run the program as a Lambda function, you will want to set the timeout to at least two minutes since some of the API calls |
| 72 | +can take a significant amount of "clock time" to run, especially in distant regions. |
| 73 | + |
| 74 | +Once you have installed the Lambda function it is recommended to set up a scheduled type EventBridge rule so the function will run on a regular basis. |
| 75 | + |
| 76 | +The appropriate permissions will need to be assigned to the Lambda function in order for it to run correctly. |
| 77 | +It doesn't need many permissions. It just needs to be able to: |
| 78 | +* List the FSx for ONTAP file systems |
| 79 | +* List the FSx volume names |
| 80 | +* List the CloudWatch alarms |
| 81 | +* Create CloudWatch alarms |
| 82 | +* Delete CloudWatch alarms |
| 83 | +* Create CloudWatch Log Groups and Log Streams in case you need to diagnose an issue |
| 84 | + |
| 85 | +The following permissions are required to run the script (although you could narrow the "Resource" specification to suit your needs.) |
| 86 | +```JSON |
| 87 | +{ |
| 88 | + "Version": "2012-10-17", |
| 89 | + "Statement": [ |
| 90 | + { |
| 91 | + "Sid": "VisualEditor0", |
| 92 | + "Effect": "Allow", |
| 93 | + "Action": [ |
| 94 | + "cloudwatch:PutMetricAlarm", |
| 95 | + "fsx:ListTagsForResource", |
| 96 | + "fsx:DescribeVolumes", |
| 97 | + "fsx:DescribeFilesystems", |
| 98 | + "cloudwatch:DeleteAlarms", |
| 99 | + "cloudwatch:DescribeAlarmsForMetric", |
| 100 | + "ec2:DescribeRegions", |
| 101 | + "cloudwatch:DescribeAlarms" |
| 102 | + ], |
| 103 | + "Resource": "*" |
| 104 | + }, |
| 105 | + { |
| 106 | + "Sid": "VisualEditor1", |
| 107 | + "Effect": "Allow", |
| 108 | + "Action": [ |
| 109 | + "logs:CreateLogStream", |
| 110 | + "logs:PutLogEvents" |
| 111 | + ], |
| 112 | + "Resource": "arn:aws:logs:*:*:log-group:*:log-stream:*" |
| 113 | + }, |
| 114 | + { |
| 115 | + "Sid": "VisualEditor2", |
| 116 | + "Effect": "Allow", |
| 117 | + "Action": "logs:CreateLogGroup", |
| 118 | + "Resource": "arn:aws:logs:*:*:log-group:*" |
| 119 | + } |
| 120 | + ] |
| 121 | +} |
| 122 | +``` |
| 123 | + |
| 124 | +### Expected Action |
| 125 | +Once the script has been configured and invoked, it will: |
| 126 | +* Scan for every FSx for ONTAP file systems in every region. For every file system it finds it will: |
| 127 | + * Create a CPU utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. |
| 128 | + * Create a SSD utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. |
| 129 | +* Scan for every FSx for ONTAP volume in every region. For every volume it finds it will: |
| 130 | + * Create a Volume Utilization CloudWatch alarm, unless the threshold value is set to 100 for the specific alarm. |
| 131 | +* Scan for the CloudWatch alarms and remove any alarms that the associated resource doesn't exist anymore. |
| 132 | + |
| 133 | + |
| 134 | +## Author Information |
| 135 | + |
| 136 | +This repository is maintained by the contributors listed on [GitHub](https://github.com/NetApp/FSx-ONTAP-samples-scripts/graphs/contributors). |
| 137 | + |
| 138 | +## License |
| 139 | + |
| 140 | +Licensed under the Apache License, Version 2.0 (the "License"). |
| 141 | + |
| 142 | +You may obtain a copy of the License at [apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0). |
| 143 | + |
| 144 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an _"AS IS"_ basis, without WARRANTIES or conditions of any kind, either express or implied. |
| 145 | + |
| 146 | +See the License for the specific language governing permissions and limitations under the License. |
| 147 | + |
| 148 | +© 2024 NetApp, Inc. All Rights Reserved. |
0 commit comments