Skip to content

Commit 0db507e

Browse files
committed
Updated the CloudFormation template and README file to be clearer.
1 parent e3b32cd commit 0db507e

File tree

3 files changed

+109
-28
lines changed

3 files changed

+109
-28
lines changed

Monitoring/monitor-ontap-services/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ To install the program using the CloudFormation template, you will need to do th
9595
|SubnetIds|The subnet IDs that the Lambda function will be attached to. They must have connectivity to the FSxN file system management endpoint that you wish to monitor. It is recommended to select at least two.|
9696
|SecurityGroupIds|The security group IDs that the Lambda function will be attached to. The security group must allow outbound traffic over port 443 to the SNS, Secrets Manager, and CloudWatch and S3 AWS service endpoints, as well as the FSxN file system you want to monitor.|
9797
|SnsTopicArn|The ARN of the SNS topic you want the program to publish alert messages to.|
98-
|CloudWatchLogGroupName|The name of **an existing** CloudWatch Log Group that the Lambda function can send event messages to. It will create a new Log Stream within the Log Group every day that is unique to this file system so you can use the same Log Group for multiple instances of this program. If this field is left blank, alerts will not be sent to CloudWatch.|
98+
|CloudWatchLogGroupARN|The ARN of **an existing** CloudWatch Log Group that the Lambda function can send event messages to. It will create a new Log Stream within the Log Group every day that is unique to this file system so you can use the same Log Group for multiple instances of this program. If this field is left blank, alerts will not be sent to CloudWatch.|
9999
|SecretArn|The ARN of the secret within the AWS Secrets Manager that holds the FSxN file system credentials.|
100100
|SecretUsernameKey|The name of the key within the secret that holds the username portion of the FSxN file system credentials. The default is 'username'.|
101101
|SecretPasswordKey|The name of the key within the secret that holds the password portion of the FSxN file system credentials. The default is 'password'.|
@@ -124,16 +124,16 @@ set for the OntapAdminServer parameter.
124124
After the stack has been created, check the status of the Lambda function to make sure it is
125125
not in an error state. To find the Lambda function go to the Resources tab of the CloudFormation
126126
stack and click on the "Physical ID" of the Lambda function. This should bring you to the Lambda service in the AWS
127-
console. Once there, click on the "Monitor" tab to see if the function has been invoked. Locate the
127+
console. Once there, click on the "Monitor" tab to see if the function has been invoked. Note that it will take
128+
at least the configured iteration time before the function is invoked for the first time. Locate the
128129
"Error count and success rate(%)" chart, which is usually found at the top right corner of the "Monitor" dashboard.
129-
Within the "CheckInterval" number of minutes there should be at least one dot on that chart. Note that initially
130-
the chart is slow to reflect any status so you might have to be patient. Continue to press the "refresh"
131-
button (the icon with a circle on it) every minute or so to update the status. Once you see a dot on the chart, when you hover your mouse
132-
over it, you should see the "success rate" and "number of errors." The success rate should be 100% and the number
133-
of errors should be 0. If it is not, then scroll down to the CloudWatch Logs section and click on the most recent
134-
log stream. This will show you the output of the Lambda function. If there are any errors, they will be displayed
135-
there. If you can't figure out what is causing an error, then please create an issue in this repository and someone
136-
will help you.
130+
After the "CheckInterval" number of minutes there should be at least one dot on that chart.
131+
Hover your mouse over the dot and you should see the "success rate" and "number of errors."
132+
The success rate should be 100% and the number of errors should be 0. If it is not, then scroll up a little bit and
133+
click on "View CloudWatch Logs" link. Once on this page, click on the first LogStream and review any output.
134+
If there are any errors, they will be displayed there. If you can't figure out what is causing an error,
135+
please create an issue on the [Issues](https://github.com/NetApp/FSx-ONTAP-samples-scripts/issues) section
136+
in this repository and someone will help you.
137137

138138
---
139139

@@ -324,7 +324,7 @@ Each rule should be an object with one, or more, of the following keys:
324324
|failover|Boolean|If 'true' the program will send an alert if the FSxN cluster is running on its standby node. If it is set to `false`, it will not report on failover status.|
325325
|networkInterfaces|Boolean|If 'true' the program will send an alert if any of the network interfaces are down. If it is set to `false`, it will not report on any network interfaces that are down.|
326326

327-
###### Matching condition schema for EMS Messages (ems)
327+
###### Matching condition schema for EMS Events (ems)
328328
Each rule should be an object with three keys, with an optional 4th key:
329329

330330
|Key Name|Value Type|Notes|

Monitoring/monitor-ontap-services/cloudformation.yaml

Lines changed: 97 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ Metadata:
2121
- implementWatchdogAsLambda
2222
- watchdogRoleArn
2323
- LambdaRoleArn
24+
- Label:
25+
default: "AWS Endpoint Options"
26+
Parameters:
2427
- createSecretsManagerEndpoint
2528
- createSNSEndpoint
2629
- createCloudWatchLogsEndpoint
@@ -52,6 +55,90 @@ Metadata:
5255
- vserverNFSProtocolStateAlert
5356
- vserverCIFSProtocolStateAlert
5457

58+
ParameterLabels:
59+
OntapAdminSever:
60+
default: "FSxN Management IP or Hostname"
61+
s3BucketName:
62+
default: "S3 Bucket Name"
63+
subNetIds:
64+
default: "Subnet IDs"
65+
securityGroupIds:
66+
default: "Security Group IDs"
67+
snsTopicArn:
68+
default: "SNS Topic ARN"
69+
cloudWatchLogGroupArn:
70+
default: "CloudWatch Log Group ARN"
71+
secretArn:
72+
default: "Secrets Manager Secret ARN"
73+
secretUsernameKey:
74+
default: "Secret Username Key"
75+
secretPasswordKey:
76+
default: "Secret Password Key"
77+
checkInterval:
78+
default: "Check Interval (minutes)"
79+
createWatchdogAlarm:
80+
default: "Create Watchdog Alarm"
81+
implementWatchdogAsLambda:
82+
default: "Implement Watchdog as Lambda Function"
83+
watchdogRoleArn:
84+
default: "Watchdog Role ARN"
85+
LambdaRoleArn:
86+
default: "Lambda Role ARN"
87+
createSecretsManagerEndpoint:
88+
default: "Create Secrets Manager Endpoint"
89+
createSNSEndpoint:
90+
default: "Create SNS Endpoint"
91+
createCloudWatchLogsEndpoint:
92+
default: "Create CloudWatch Logs Endpoint"
93+
createS3Endpoint:
94+
default: "Create S3 Endpoint"
95+
routeTableIds:
96+
default: "Route Table IDs for S3 Endpoint"
97+
vpcId:
98+
default: "VPC ID for Endpoints"
99+
endpointSecurityGroupIds:
100+
default: "Endpoint Security Group IDs"
101+
versionChangeAlert:
102+
default: "Version Change Alert"
103+
failoverAlert:
104+
default: "Failover Alert"
105+
emsEventsAlert:
106+
default: "EMS Events Alert"
107+
snapMirrorLagTimeAlert:
108+
default: "SnapMirror Maximum Lag Time Alert (seconds)"
109+
snapMirrorLagTimePercentAlert:
110+
default: "SnapMirror Maximum Lag Time Percent Alert (%)"
111+
snapMirrorStalledAlert:
112+
default: "SnapMirror Stalled Transfer Alert (seconds)"
113+
snapMirrorHealthAlert:
114+
default: "SnapMirror Health Alert"
115+
fileSystemUtilizationWarnAlert:
116+
default: "File System (aggregate) Utilization Warning Alert (%)"
117+
fileSystemUtilizationCriticalAlert:
118+
default: "File System (aggregate) Utilization Critical Alert (%)"
119+
volumeUtilizationWarnAlert:
120+
default: "Volume Utilization Warning Alert (%)"
121+
volumeUtilizationCriticalAlert:
122+
default: "Volume Utilization Critical Alert (%)"
123+
volumeFileUtilizationWarnAlert:
124+
default: "Volume File (inode) Utilization Warning Alert (%)"
125+
volumeFileUtilizationCriticalAlert:
126+
default: "Volume File (inode) Utilization Critical Alert (%)"
127+
volumeOfflineAlert:
128+
default: "Volume Offline Alert"
129+
softQuotaUtilizationAlert:
130+
default: "Soft Quota Utilization Alert (%)"
131+
hardQuotaUtilizationAlert:
132+
default: "Hard Quota Utilization Alert (%)"
133+
inodesQuotaUtilizationAlert:
134+
default: "Inodes Quota Utilization Alert (%)"
135+
vserverStateAlert:
136+
default: "Vserver State Alert"
137+
vserverNFSProtocolStateAlert:
138+
default: "Vserver NFS Protocol State Alert"
139+
vserverCIFSProtocolStateAlert:
140+
default: "Vserver CIFS Protocol State Alert"
141+
55142
Parameters:
56143
OntapAdminSever:
57144
Description: "The DNS name, or IP address, of the management endpoint of the FSxN file system to be monitored."
@@ -62,7 +149,7 @@ Parameters:
62149
Type: String
63150

64151
subNetIds:
65-
Description: "The subnet IDs where you want to deploy the Lambda function. Must have connectivity to the FSxN file system to be monitored."
152+
Description: "The subnet IDs where you want to deploy the Lambda function. Must have connectivity to the FSxN file system to be monitored. Recommended to have at least two. Also recommended to be in a private subnet."
66153
Type: "List<AWS::EC2::Subnet::Id>"
67154

68155
securityGroupIds:
@@ -100,7 +187,7 @@ Parameters:
100187
AllowedValues: ["true", "false"]
101188

102189
implementWatchdogAsLambda:
103-
Description: "Use a Lambda function to publish to the SNS topic so it can reside in a different region. Only needed if you are creating the CloudWatch alarm and the SNS topic is in a different region."
190+
Description: "Use a Lambda function to publish to the SNS topic so the topic can reside in a different region. Only needed if you are creating the CloudWatch alarm and the SNS topic is in a different region."
104191
Type: String
105192
Default: "false"
106193
AllowedValues: ["true", "false"]
@@ -111,25 +198,25 @@ Parameters:
111198
Default: ""
112199

113200
createSecretsManagerEndpoint:
114-
Description: "Create a Secrets Manager endpoint."
201+
Description: "Set to 'true' if you want to create a Secrets Manager endpoint."
115202
Type: String
116203
Default: "false"
117204
AllowedValues: ["true", "false"]
118205

119206
createSNSEndpoint:
120-
Description: "Create an SNS endpoint."
207+
Description: "Set to 'true if you want to create an SNS endpoint."
121208
Type: String
122209
Default: "false"
123210
AllowedValues: ["true", "false"]
124211

125212
createCloudWatchLogsEndpoint:
126-
Description: "Create a CloudWatch logs endpoint."
213+
Description: "Set to 'true if you want to create a CloudWatch logs endpoint."
127214
Type: String
128215
Default: "false"
129216
AllowedValues: ["true", "false"]
130217

131218
createS3Endpoint:
132-
Description: "Create an S3 endpoint."
219+
Description: "Set to 'true if you want to create an S3 endpoint."
133220
Type: String
134221
Default: "false"
135222
AllowedValues: ["true", "false"]
@@ -145,7 +232,7 @@ Parameters:
145232
Default: ""
146233

147234
endpointSecurityGroupIds:
148-
Description: "The security group IDs, comma separated list, to associate with the SNS, SecretsManager and/or CloudWatch Logs endpoints. Must allow traffic from from the Lambda function over TCP port 443. This parameter is only needed if you are creating the SNS, SecretsManager, or CloudWatch Logs endpoint."
235+
Description: "The security group IDs, comma separated list, to associate with the SNS, SecretsManager and/or CloudWatch Logs endpoints. Must allow inbound traffic from from the Lambda function over TCP port 443. This parameter is only needed if you are creating the SNS, SecretsManager, or CloudWatch Logs endpoint."
149236
Type: CommaDelimitedList
150237
Default: ""
151238

@@ -626,8 +713,8 @@ Resources:
626713
# "matching conditions." It is intended to be run as a Lambda function, but
627714
# can be run as a standalone program.
628715
#
629-
# Version: v2.19
630-
# Date: 2025-05-27-13:28:30
716+
# Version: v2.20
717+
# Date: 2025-06-03-16:56:55
631718
################################################################################
632719
633720
import json
@@ -1900,10 +1987,7 @@ Resources:
19001987
conditions["services"][getServiceIndex("systemHealth", conditions)]["rules"].append({"networkInterfaces": False})
19011988
elif name == "initialEmsEventsAlert":
19021989
if value == "true":
1903-
if os.environ.get("initialEmsExtendedAlerts") == "true":
1904-
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "informational|notice|error|alert|emergency", "message": ""})
1905-
else:
1906-
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "error|alert|emergency", "message": ""})
1990+
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "error|alert|emergency", "message": "", "filter": ""})
19071991
elif name == "initialSnapMirrorHealthAlert":
19081992
if value == "true":
19091993
conditions["services"][getServiceIndex("snapmirror", conditions)]["rules"].append({"Healthy": False}) # This is what it matches on, so it is interesting when the health is false.

Monitoring/monitor-ontap-services/monitor_ontap_services.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1292,10 +1292,7 @@ def buildDefaultMatchingConditions():
12921292
conditions["services"][getServiceIndex("systemHealth", conditions)]["rules"].append({"networkInterfaces": False})
12931293
elif name == "initialEmsEventsAlert":
12941294
if value == "true":
1295-
if os.environ.get("initialEmsExtendedAlerts") == "true":
1296-
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "informational|notice|error|alert|emergency", "message": ""})
1297-
else:
1298-
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "error|alert|emergency", "message": ""})
1295+
conditions["services"][getServiceIndex("ems", conditions)]["rules"].append({"name": "", "severity": "error|alert|emergency", "message": "", "filter": ""})
12991296
elif name == "initialSnapMirrorHealthAlert":
13001297
if value == "true":
13011298
conditions["services"][getServiceIndex("snapmirror", conditions)]["rules"].append({"Healthy": False}) # This is what it matches on, so it is interesting when the health is false.

0 commit comments

Comments
 (0)