Skip to content

Commit 3705467

Browse files
committed
Added CloudWatch as a destination; Added an volume inode utilization alert; Added an EMS filter, as well as exceptions and specific matches for volume utilization alerts.
1 parent 63be888 commit 3705467

File tree

4 files changed

+489
-153
lines changed

4 files changed

+489
-153
lines changed

Monitoring/monitor-ontap-services/README.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,14 @@ Here is an itemized list of the services that this program can monitor:
66
- If the file system is available.
77
- If the underlying Data ONTAP version has changed.
88
- If the file system is running off its partner node (i.e. is running in failover mode).
9+
- If any of the network interfaces are down.
910
- Any EMS message, with filtering to allow you to only be alerted on the ones you care about.
1011
- If a SnapMirror relationship hasn't been updated in a specified amount of time.
1112
- If a SnapMirror update has stalled.
1213
- If a SnapMirror relationship is in a "non-healthy" state.
1314
- If the aggregate is over a certain percentage full. User can set two thresholds (Warning and Critical).
1415
- If a volume is over a certain percentage full. User can set two thresholds (Warning and Critical).
16+
- If a volumes is using more than a specified percentage of its inodes (Warning and Critical).
1517
- If any quotas are over a certain percentage full. User can follow both soft and hard limits.
1618

1719
## Architecture
@@ -57,16 +59,20 @@ To install the program using the CloudFormation template, you will need to do th
5759
|Stackname|The name you want to assign to the CloudFormation stack. Note that this name is used as a base name for the resources it creates, so please keep it **under 25 characters**. Also, since it is used as part of the s3 bucket name that it creates to keep event information in, it **must be in all lower case letters**.|
5860
|OntapAdminServer|The DNS name, or IP address, of the management endpoint of the FSxN file system you wish to monitor.|
5961
|SubnetIds|The subnet IDs that the Lambda function will be attached to. Must have connectivity to the FSxN file system you wish to monitor.|
60-
|SecurityGroupIds|The security group IDs that the Lambda function will be attached to. The security group most allow outbound traffic over port 443 to the SNS, Secrets Manager and S3 endpoints, as well as the FSxN file system you want to monitor.|
62+
|SecurityGroupIds|The security group IDs that the Lambda function will be attached to. The security group must allow outbound traffic over port 443 to the SNS, Secrets Manager and S3 endpoints, as well as the FSxN file system you want to monitor.|
6163
|SnsTopicArn|The ARN of the SNS topic you want the program to publish alert messages to.|
64+
|CloudWatchLogGroupName|The name of **an existing CloudWatch log group** that the Lambda function will write its logs to. If left blank, alerts will not be sent to CloudWatch.|
6265
|SecretArn|The ARN of the secret within the AWS Secrets Manager that holds the FSxN file system credentials. **NOTE:** The secret must be in the same region as the FSxN file system.|
63-
|SecretUsernameKey|The key name within the secret that holds the username portion of the FSxN file system credentials.|
64-
|SecretPasswordKey|The key name within the secret that holds the password portion of the FSxN file system credentials.|
66+
|SecretUsernameKey|The name of the key within the secret that holds the username portion of the FSxN file system credentials.|
67+
|SecretPasswordKey|The name of the key within the secret that holds the password portion of the FSxN file system credentials.|
68+
|LambdaRoleArn|The ARN of the role that the Lambda function will use. This role must have the permissions listed in the [Create an AWS Role](#create-an-aws-role) section above. If left blank a role will be created.|
69+
|SchedulerRoleArn|The ARN of the role that the EventBridge schedule will use to trigger the Lambda function. It just needs the permission to invoke a Lambda function. If left blank a role will be created.|
6570
|CheckInterval|The interval, in minutes, that the EventBridge schedule will trigger the Lambda function. The default is 15 minutes.|
6671
|CreateCloudWatchAlarm|Set to "true" if you want to create a CloudWatch alarm that will alert you if the Lambda function fails.|
6772
|CreateSNSEndpoint|Set to "true" if you want to create an SNS endpoint. **NOTE:** If an SNS Endpoint already exist for the specified Subnet the creation will fail, causing the entire CloudFormation script to fail. Since the Lambda function will be running within your VPC it will most likely not have access to the Internet, therefore a endpoint will need to be created if you don't already have one. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
6873
|CreateSecretsManagerEndpoint|Set to "true" if you want create a Secrets Manager endpoint. **NOTE:** If an SecretsManager Endpoint already exist for the specified Subnet the creation will fail, causing the entire CloudFormation script to fail. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
6974
|CreateS3Endpoint|Set to "true" if you want create an S3 endpoint. **NOTE:** If an S3 Gateway Endpoint already exist for the specified VPC the creation will fail, causing the entire CloudFormation script to fail. Note that this will be a "Gateway" type endpoint, since they are free to use. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
75+
|CreateCWEndpoint|Set to "true" if you want create a CloudWatch endpoint. **NOTE:** If an CloudWatch Endpoint already exist for the specified Subnet the creation will fail, causing the entire CloudFormation script to fail. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
7076
|RoutetableIds|The route table IDs to update to use the S3 endpoint. Since the S3 endpoint is of type 'Gateway' route tables have to be updated to use it. This parameter is only needed if createS3Endpoint is set to 'true'.|
7177
|VpcId|The VPC ID where the FSxN file system is located. This is only needed if you are creating an endpoint.|
7278
|EndpointSecurityGroupIds|The security group IDs that the endpoint will be attached to. The security group must allow traffic over TCP port 443 from the Lambda function. This is only needed if you are creating an SNS or SecretsManager endpoint.|
@@ -100,12 +106,12 @@ help you.
100106
If you want more control over the installation then you can install it manually by following the steps below. Note that these
101107
instructions assume you have familiarity with how to create the various AWS service mentioned below. If you do not,
102108
I would recommend using the CloudFormation method of deploying the program. Afterwards, if you need to change things, make the required
103-
modifications then.
109+
modifications then using the instructions found below.
104110

105111
#### Create an AWS Role
106112
This program doesn't need many permissions. It just needs to be able to read the FSxN file system credentials stored in a Secrets Manager secret,
107-
read and write objects in an s3 bucket, and be able to publish SNS messages. Below is the specific list of permissions
108-
needed. The easiest way to give the Lambda function the permissions it needs is by creating a role with these
113+
read and write objects in an s3 bucket, be able to publish SNS messages, and optionally create CloudWatch log Streams and put events.
114+
Below is the specific list of permissions needed. The easiest way to give the Lambda function the permissions it needs is by creating a role with these
109115
permissions and assigning it to the Lambda function.
110116

111117
| Permission | Reason |
@@ -115,7 +121,10 @@ permissions and assigning it to the Lambda function.
115121
|s3:PutObject | The program stores its state information in various s3 objects.|
116122
|s3:GetObject | The program reads previous state information, as well as configuration from various s3 objects. |
117123
|s3:ListBucket | To allow the program to know if an object exist or not. |
118-
|ec2:CreateNetworkInterface | Since the program runs as a Lambda function within your VPC, it needs to be able to create a network interface in your VPC. you can read more about that [here](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html). |
124+
|logs:CreateLogStream | If you want the program to send its logs to CloudWatch, it needs to be able to create a log stream. |
125+
|logs:PutLogEvents | If you want the program to send its logs to CloudWatch, it needs to be able to put log events into the log stream. |
126+
|logs:DescribeLogStreams | If you want the program to send its logs to CloudWatch, it needs to be able to see if a log stream already exist before creating one. |
127+
|ec2:CreateNetworkInterface | Since the program runs as a Lambda function within your VPC, it needs to be able to create a network interface in your VPC. You can read more about that [here](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html). |
119128
|ec2:DeleteNetworkInterface | Since it created a network interface, it needs to be able to delete it when not needed anymore. |
120129
|ec2:DescribeNetworkInterfaces | So it can check to see if an network interface already exist. |
121130

@@ -198,6 +207,7 @@ filename, then set the configFilename environment variable to the name of your c
198207
| quotaEventsFilename | No | No | OntapAdminServer + "-quotaEvents" | Set to the filename (S3 object) that you want the program to store the Quota alerts into. This file will be created as necessary. |
199208
| systemStatusFilename | No | No | OntapAdminServer + "-systemStatus" | Set to the filename (S3 object) that you want the program to store the overall system status information into. This file will be created as necessary. |
200209
| snsTopicArn | Yes | No | None | Set to the ARN of the SNS topic you want the program to publish alert messages to. |
210+
| cloudWatchLogGroupName | No | No | None | The name of **an existing CloudWatch log group** that the Lambda function will write its logs to. If left blank, alerts will not be sent to CloudWatch.|
201211
| conditionsFilename | Yes | No | OntapAdminServer + "-conditions" | Set to the filename (S3 object) where you want the program to read the matching condition information from. |
202212
| secretArn | Yes | No | None | Set to the ARN of the secret within the AWS Secrets Manager that holds the FSxN credentials. |
203213
| secretUsernameKey | Yes | No | None | Set to the key name within the secretName that holds the username portion of the FSxN credentials. |
@@ -211,10 +221,9 @@ The Matching Conditions file allows you to specify which events you want to be a
211221
file is JSON. JSON is basically a series of "key" : "value" pairs. Where the value can be object that also has
212222
"key" : "value" pairs. For more information about the format of a JSON file, please refer to this [page](https://www.json.org/json-en.html).
213223
The JSON schema in this file is made up of an array of objects, with with a key name of "services". Each element of the "services" array
214-
is an object with two keys. The first key is “name" which specifies the name of the service it is going to provide
224+
is an object with at least two keys. The first key is “name" which specifies the name of the service it is going to provide
215225
matching conditions (rules) for. The second key is "rules" which is an array of objects that provide the specific
216-
matching condition. Note that each service's rules has its own unique schema. The following is the unique schema
217-
for each of the service's rules.
226+
matching condition. Note that each service's rules has its own unique schema. Following is the definition of the schema for each service.
218227

219228
###### Matching condition schema for System Health (systemHealth)
220229
Each rule should be an object with one, or more, of the following keys:
@@ -233,6 +242,7 @@ Each rule should be an object with three keys:
233242
|name|String|Which will match on the EMS event name.|
234243
|message|String|Which will match on the EMS event message text.|
235244
|severity|String|Which will match on the severity of the EMS event (debug, informational, notice, error, alert or emergency).|
245+
|filter|String|If any event's message match this filter, then the EMS event will be skipped. Try to be as specific as possible to avoid unintentional matches.|
236246

237247
Note that all values to each of the keys are used as a regular expressions against the associated EMS component. So, for
238248
example, if you want to match on any event message text that starts with “snapmirror” then you would put “\^snapmirror”.
@@ -248,17 +258,28 @@ Each rule should be an object with one, or more, of the following keys:
248258
|---|---|---|
249259
|maxLagTime|Integer|Specifies the maximum allowable time, in seconds, since the last successful SnapMirror update before an alert will be sent.|
250260
|stalledTransferSeconds|Integer|Specifies the minimum number of seconds that have to transpire before a SnapMirror transfer will be considered stalled.|
251-
|health|Boolean|If true will alert with the relationship is health. If false will alert with the relationship is unhealthy.|
261+
|healthy|Boolean|If true will alert with the relationship is healthy. If false will alert with the relationship is unhealthy.|
252262

253263
###### Matching condition schema for Storage (storage)
254-
Each rule should be an object with one, or more, of the following keys:
264+
The storage schema had two additional keys that can be included before the rules:
265+
|Key Name|Value Type|Notes|
266+
|---|---|---|
267+
|exceptions|Array of objects|Each entry in this array specifies a cluster name, SVM Name, and Volume Name combination that should be ignored for the rules specified within its block. The format of the object is:
268+
`{ "cluster": "string", "svm": "string", "name": "string" }`|
269+
|matches|Array of objects|Each entry in this array specifies a cluster name, SVM Name, and Volume Name combination that must be matched for the rules specified within its block to be applied. The format of the object is:
270+
`{ "cluster": "string", "svm": "string", "name": "string" }`|
255271

272+
The exceptions and matches keys are optional. If they are not specified, then the rules will be applied to all clusters, SVMs and volumes.
273+
274+
Each rule should be an object with one, or more, of the following keys:
256275
|Key Name|Value Type|Notes|
257276
|---|---|---|
258277
|aggrWarnPercentUsed|Integer|Specifies the maximum allowable physical storage (aggregate) utilization (between 0 and 100) before an alert is sent.|
259278
|aggrCriticalPercentUsed|Integer|Specifies the maximum allowable physical storage (aggregate) utilization (between 0 and 100) before an alert is sent.|
260279
|volumeWarnPercentUsed|Integer|Specifies the maximum allowable volume utilization (between 0 and 100) before an alert is sent.|
261280
|volumeCriticalPercentUsed|Integer|Specifies the maximum allowable volume utilization (between 0 and 100) before an alert is sent.|
281+
|volumeWarnFilesPercentUsed|Integer|Specifies the maximum allowable volume files (inodes) utilization (between 0 and 100) before an alert is sent.|
282+
|volumeCriticalFilesPercentUsed|Integer|Specifies the maximum allowable volume files (inodes) utilization (between 0 and 100) before an alert is sent.|
262283

263284
###### Matching condition schema for Quota (quota)
264285
Each rule should be an object with one, or more, of the following keys:

0 commit comments

Comments
 (0)