You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Monitoring/monitor-ontap-services/README.md
+31-28Lines changed: 31 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,13 @@
1
-
# Introduction
2
-
Currently there is some functionality within an FSx for NetApp ONTAP (FSxN) file system for which there is no corresponding
3
-
CloudWatch metrics. For example, there is no CloudWatch metrics for a SnapMirror relationship, so there is no way to
4
-
alert on when an update has stalled, or it is simply not considered Healthy by Data ONTAP. The purpose of this blog
5
-
is to show how a relatively small Python program, that can be run as a Lambda function, can leverage the ONTAP APIs
6
-
to obtain the required information to detect certain conditions, and when found, send SNS messages to alert someone.
7
-
8
-
This program was initially created to forward EMS messages to an AWS service outside of the FSxN file system since
9
-
there was no way to do that from the FSxN file system itself (i.e. the syslog forwarding didn't work at the time). As it turns out this is
10
-
no longer the case, in that as of Data ONTAP 9.13.1 you can now forward EMS messages to a 'syslog' server. However, once this program was created,
11
-
other functionality was added to monitor other Data ONTAP services that AWS didn't provide a way to trigger an alert when
12
-
something was outside of an expected realm. For example, if the lag time between SnapMirror synchronization were more
13
-
than a specified amount of time. Or, if a SnapMirror update was stalled. This program can alert on all these things and more.
1
+
# Monitoring ONTAP Services
2
+
3
+
## Introduction
4
+
This program is used to monitor various services of a NetApp ONTAP file system. It uses the ONTAP APIs to obtain the required information to determine if any of the conditions that are being monitored have been met. If they have, then the program will send an SNS message to the specified SNS topic. The program will also send a syslog message to a syslog server if the syslogIP parameter is set. The program will store the event information in an S3 bucket so that it can be compared against it before sending a second message for the same event. The configuration files is also kept in the S3 bucket for easy access.
14
5
Here is an itemized list of the services that this program can monitor:
15
6
- If the file system is available.
16
7
- If the underlying Data ONTAP version has changed.
17
-
- If the file system is running off its partner node (i.e. a failover has occurred).
8
+
- If the file system is running off its partner node (i.e. is running in failover mode).
18
9
- Any EMS message, with filtering to allow you to only be alerted on the ones you care about.
19
-
- If a SnapMirror relationship hasn't been updated in a user specified amount of time.
10
+
- If a SnapMirror relationship hasn't been updated in a specified amount of time.
20
11
- If a SnapMirror update has stalled.
21
12
- If a SnapMirror relationship is in a "non-healthy" state.
22
13
- If the aggregate is over a certain percentage full. User can set two thresholds (Warning and Critical).
@@ -33,9 +24,17 @@ a second message for the same event. The configuration files is also kept in the
- An FSx for NetApp ONTAP file system you want to monitor.
29
+
- The security group associated with the FSx for ONTAP file system must allow inbound traffic from the Lambda function over TCP port 443.
30
+
- An SNS topic to send the alerts to.
31
+
- An AWS Secrets Manager secret that holds the FSx for ONTAP file system credentials. There should be two keys in the secret, one for the username and one for the password.
32
+
36
33
## Installation
37
34
There are two ways to install this program. You can either perform all the steps show in the [Manual Installation](#manual-installation) section below, or run
38
-
the CloudFormation template that is provided in this repository.
35
+
the CloudFormation template that is provided in this repository. The manual installation is more involved, but it gives you more control and allows to you
36
+
make changes to settings that aren't available in the CloudFormation template. The CloudFormation template is easier to use, but it doesn't allow for as much
37
+
customization.
39
38
40
39
### Installation using the CloudFormation template
41
40
The CloudFormation template will do the following:
@@ -44,29 +43,33 @@ The CloudFormation template will do the following:
44
43
- Create an S3 bucket for the Lambda function to store the matching conditions file, and the event information, in.
45
44
- Create an EventBridge Schedule to trigger the Lambda function every 15 minutes. If you want the function to run more or less frequently, you can change that after the CloudFormation stack has been created.
46
45
- Create a role that allows the EventBridge schedule to trigger the Lambda function.
46
+
- Optionally create a CloudWatch alarm that will alert you if the Lambda function fails.
47
+
- Optionally create a VPC Endpoints for the SNS, Secrets Manager and/or S3 services.
47
48
48
49
To install the program using the CloudFormation template, you will need to do the following:
49
-
1. Download the CloudFormation template from this repository. The name of the file is 'cloudformation.yaml'.
50
+
1. Download the CloudFormation template from this repository. You can do that by clicking on the 'cloudformation.yaml' file in the repository, then clicking on the download icon next to the "Raw" button at the top right of the page. That should cause your browser to download the file to you local computer.
50
51
2. Go to the CloudFormation service in the AWS console and click on "Create stack (with new resources)".
51
52
3. Choose the "Upload a template file" option and select the CloudFormation template you downloaded in step 1.
52
-
4. This should bring up a new window with several of parameters to provide values to. Most have defaults, but some do require values to be provided.
53
+
4. This should bring up a new window with several of parameters to provide values to. Most have defaults, but some do require values to be provided. See the list below for what each parameter is for.
53
54
54
55
|Parameter Name | Notes|
55
56
|---|---|
56
57
|Stackname|The name you want to assign to the CloudFormation stack. Note that this name is used as a base name for the resources it creates, so please keep it under 25 characters.|
57
58
|OntapAdminServer|The DNS name, or IP address, of the management endpoint of the FSxN file system you wish to monitor.|
58
59
|SubnetIds|The subnet IDs that the Lambda function will be attached to. Must have connectivity to the FSxN file system you wish to monitor.|
59
-
|SecurityGroupIds|The security group IDs that the Lambda function will be attached to.|
60
+
|SecurityGroupIds|The security group IDs that the Lambda function will be attached to. The security group most allow outbound traffic over port 443 to the SNS, Secrets Manager and S3 endpoints, as well as the FSxN file system you want to monitor.|
60
61
|SnsTopicArn|The ARN of the SNS topic you want the program to publish alert messages to.|
61
62
|SecretArn|The ARN of the secret within the AWS Secrets Manager that holds the FSxN file system credentials. **NOTE:** The secret must be in the same region as the FSxN file system.|
62
63
|SecretUsernameKey|The key name within the secret that holds the username portion of the FSxN file system credentials.|
63
64
|SecretPasswordKey|The key name within the secret that holds the password portion of the FSxN file system credentials.|
64
-
|CreateSNSEndpoint|Set to "true" if you want to create an SNS endpoint. Since the Lambda function will be running within your VPC it will most likely not have access to the Internet, therefore a endpoint will need to be created if you don't already have one. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
65
-
|CreateSecretsManagerEndpoint|Set to "true" if you want create a Secrets Manager endpoint. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
66
-
|CreateS3Endpoint|Set to "true" if you want create an S3 endpoint. Note that this will be a "Gateway" type endpoint, since they are free to use. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
65
+
|CheckInterval|The interval, in minutes, that the EventBridge schedule will trigger the Lambda function. The default is 15 minutes.|
66
+
|CreateCloudWatchAlarm|Set to "true" if you want to create a CloudWatch alarm that will alert you if the Lambda function fails.|
67
+
|CreateSNSEndpoint|Set to "true" if you want to create an SNS endpoint. **NOTE:** If an SNS Endpoint already exist for the specified Subnet the creation will fail, causing the entire CloudFormation script to fail. Since the Lambda function will be running within your VPC it will most likely not have access to the Internet, therefore a endpoint will need to be created if you don't already have one. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
68
+
|CreateSecretsManagerEndpoint|Set to "true" if you want create a Secrets Manager endpoint. **NOTE:** If an SecretsManager Endpoint already exist for the specified Subnet the creation will fail, causing the entire CloudFormation script to fail. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
69
+
|CreateS3Endpoint|Set to "true" if you want create an S3 endpoint. **NOTE:** If an S3 Gateway Endpoint already exist for the specified VPC the creation will fail, causing the entire CloudFormation script to fail. Note that this will be a "Gateway" type endpoint, since they are free to use. Please read the [Endpoints for AWS services](#endpoints-for-aws-services) for more information.|
67
70
|RoutetableIds|The route table IDs to update to use the S3 endpoint. Since the S3 endpoint is of type 'Gateway' route tables have to be updated to use it. This parameter is only needed if createS3Endpoint is set to 'true'.|
68
71
|VpcId|The VPC ID where the FSxN file system is located. This is only needed if you are creating an endpoint.|
69
-
|CheckInterval|The interval, in minutes, that the EventBridge schedule will trigger the Lambda function. The default is 15 minutes.|
72
+
|EndpointSecurityGroupIds|The security group IDs that the endpoint will be attached to. The security group must allow traffic over TCP port 443 from the Lambda function. This is only needed if you are creating an SNS or SecretsManager endpoint.|
70
73
71
74
The remaining parameters are used to create the matching conditions file, which specify when the program will send an SNS alert.
72
75
You can read more about it in the [Matching Conditions File](#matching-conditions-file) section below. All these parameters have default values
@@ -82,11 +85,11 @@ created, you can go to the CloudFormation service in the AWS console, click on t
82
85
After the stack has been created, I would recommend checking the status of the Lambda function to make sure it is
83
86
not in an error state. To find the Lambda function go to the Resources tab of the CloudFormation
84
87
stack and click on the "Physical ID" of the Lambda function. This should bring you to the Lambda service in the AWS
85
-
console. Once there, you can click on the "Monitoring" tab to see if the function has been invoked. Locate the
86
-
"Error count and success rate(%)" chart, which is usually found at the top right corner of the monitoring dashboard.
88
+
console. Once there, you can click on the "Monitor" tab to see if the function has been invoked. Locate the
89
+
"Error count and success rate(%)" chart, which is usually found at the top right corner of the "Monitor" dashboard.
87
90
Within the "CheckInterval" number of minutes there should be at least one dot on that chart. Note that sometimes
88
91
the chart is initially slow to reflect any status so you might have to be patient, and continue to press the "refresh"
89
-
button (the icon with a circle on it) to see an status. Once you see a dot on the chart, when you hover you mouse
92
+
button (the icon with a circle on it) to see an status. Once you see a dot on the chart, when you hover your mouse
90
93
over it, you should see the "success rate" and "number of errors." The success rate should be 100% and the number
91
94
of errors should be 0. If it is not, then scroll down to the CloudWatch Logs section and click on the most recent
92
95
log stream. This will show you the output of the Lambda function. If there are any errors, they will be displayed
@@ -112,8 +115,8 @@ permissions and assigning it to the Lambda function.
112
115
|s3:PutObject| The program stores its state information in various s3 objects.|
113
116
|s3:GetObject| The program reads previous state information, as well as configuration from various s3 objects. |
114
117
|s3:ListBucket| To allow the program to know if an object exist or not. |
115
-
|ec2:CreateNetworkInterface| Since the program runs as a Lambda function within your VPC, it needs to be able to create a network interface in your VPC. |
116
-
|ec2:DeleteNetworkInterfaces| Since it created a network interface, it needs to be able to delete it when not needed anymore. |
118
+
|ec2:CreateNetworkInterface| Since the program runs as a Lambda function within your VPC, it needs to be able to create a network interface in your VPC. you can read more about that [here](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html). |
119
+
|ec2:DeleteNetworkInterface| Since it created a network interface, it needs to be able to delete it when not needed anymore. |
117
120
|ec2:DescribeNetworkInterfaces| So it can check to see if an network interface already exist. |
0 commit comments