Merge pull request #223 from NetApp/add_multiple_vserver_support

nichollri · web-flow · commit 40e84380e278 · 2025-03-14T10:38:18.000-04:00
Added support for multiple vservers
diff --git a/Monitoring/ingest_nas_audit_logs_into_cloudwatch/README.md b/Monitoring/ingest_nas_audit_logs_into_cloudwatch/README.md
@@ -3,21 +3,21 @@
 ## Overview
 This sample demonstrates a way to ingest the NAS audit logs from an FSx for Data ONTAP file system into a CloudWatch log group
 without having to NFS or CIFS mount a volume to access them.
-It will attempt to gather the audit logs from all the FSx for Data ONTAP file systems that are within a specified region.
+It will attempt to gather the audit logs from all the SVMs within all the FSx for Data ONTAP file systems that are within a specified region.
 It will skip any file systems where the credentials aren't provided in the supplied AWS SecretManager's secret, or that do not have
 the appropriate NAS auditing configuration enabled.
 It will maintain a "stats" file in an S3 bucket that will keep track of the last time it successfully ingested audit logs from each
-file system to try to ensure it doesn't process an audit file more than once.
+SVM to try to ensure it doesn't process an audit file more than once.
 You can run this script as a standalone program or as a Lambda function. These directions assume you are going to run it as a Lambda function.
 
 ## Prerequisites
 - An FSx for Data ONTAP file system.
 - An S3 bucket to store the "stats" file. The "stats" file is used to keep track of the last time the Lambda function successfully
-ingested audit logs from each file system. Its size will be small (i.e. less than a few megabytes).
-- Have NAS auditing configured and enabled on the FSx for Data ONTAP file system. **Ensure you have selected the XML format for the audit logs.** Also,
+ingested audit logs from each SVM. Its size will be small (i.e. less than a few megabytes).
+- Have NAS auditing configured and enabled on the SVM within a FSx for Data ONTAP file system. **Ensure you have selected the XML format for the audit logs.** Also,
 ensure you have set up a rotation schedule. The program will only act on audit log files that have been finalized, and not the "active" one. You can read this
 [knowledge based article](https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/How_to_set_up_NAS_auditing_in_ONTAP_9) for instructions on how to setup NAS auditing.
-- Have the NAS auditing configured to store the audit logs in a volume with the same name on all FSx for Data ONTAP file
+- Have the NAS auditing configured to store the audit logs in a volume of the same name in all SVMs on all the FSx for Data ONTAP file
 systems that you want to ingest the audit logs from.
 - A CloudWatch log group.
 - An AWS Secrets Manager secret that contains the passwords for the fsxadmin account for all the FSx for Data ONTAP file systems you want to gather audit logs from.
@@ -29,8 +29,8 @@ systems that you want to ingest the audit logs from.
       }
 ```
 - You have applied the necessary SACLs to the files you want to audit. The knowledge base article linked above provides guidance on how to do this.
-- Since the Lambda function runs within your VPC it will not have access to the Internet, even if you can access the Internet from the Subnet it run from.
-Therefore, there needs to be an VPC endpoint for all the AWS services that the Lambda function uses. Specifically, the Lambda function needs to be able to access the following services:
+- Since the Lambda function runs within your VPC it will not have access to the Internet, even if you can access the Internet from the Subnet it runs from.
+Therefore, there needs to be an VPC endpoint for all the AWS services that the Lambda function uses. Specifically, the Lambda function needs to be able to access the following AWS services:
   - FSx.
   - Secrets Manager.
   - CloudWatch Logs.
@@ -82,7 +82,7 @@ and `DeleteNetworkInterface` actions. The correct resource line is `arn:aws:ec2:
 file system management endpoints that you want to gather audit logs from. Also, select a Security Group that allows TCP port 443 outbound.
 Inbound rules don't matter since the Lambda function is not accessible from a network.
     1. Click `Create Function` and on the next page, under the `Code` tab, select `Upload From -> .zip file.` Provide the .zip file created by the steps above. 
-    1. From the `Configuration -> General` tab set the timeout to at least 30 seconds. You will may need to increase that if it has to process a lot of audit entries and/or process a lot of FSx for ONTAP file systems.
+    1. From the `Configuration -> General` tab set the timeout to at least 30 seconds. You will may need to increase that if it has to process a lot of audit entries and/or process a lot of SVMs.
 
 3. Configure the Lambda function by setting the following environment variables. For a Lambda function you do this by clicking on the `Configuration` tab and then the `Environment variables` sub tab.
 
@@ -96,13 +96,12 @@ Inbound rules don't matter since the Lambda function is not accessible from a ne
 | statsName | The name you want to use as the stats file. |
 | logGroupName | The name of the CloudWatch log group to ingest the audit logs into. |
 | volumeName | The name of the volume, on all the FSx for ONTAP file systems, where the audit logs are stored. |
-| vserverName | The name of the vserver, on all the FSx for ONTAP file systems, where the audit logs are stored. |
 
 4. Test the Lambda function by clicking on the `Test` tab and then clicking on the `Test` button. You should see "Executing function: succeeded".
 If not, click on the "Details" button to see what errors there are.
 
 5. After you have tested that the Ladmba function is running correctly, add an EventBridge trigger to have it run periodically.
-You can do this by clicking on the `Add Trigger` button within the AWS console and selecting `EventBridge (CloudWatch Events)`
+You can do this by clicking on the `Add Trigger` button within the AWS console on the Lambda page and selecting `EventBridge (CloudWatch Events)`
 from the dropdown. You can then configure the schedule to run as often as you want. How often depends on how often you have
 set up your FSx for ONTAP file systems to generate audit logs, and how up-to-date you want the CloudWatch logs to be.
 
diff --git a/Monitoring/ingest_nas_audit_logs_into_cloudwatch/ingest_audit_log.py b/Monitoring/ingest_nas_audit_logs_into_cloudwatch/ingest_audit_log.py
@@ -12,12 +12,10 @@
 # system that doesn't have the specified volume.
 #
 # It assumes:
-#  - That there is only one data vserver per FSxN file system and that it
-#    is named 'fsx'.
 #  - That the administrator username is 'fsxadmin'.
 #  - That the audit log files will be named in the following format:
-#      audit_fsx_D2024-09-24-T13-00-03_0000000000.xml
-#    Where 'fsx' is the vserver name.
+#      audit_vserver_D2024-09-24-T13-00-03_0000000000.xml
+#    Where 'vserver' is the vserver name.
 #
 ################################################################################
 #
@@ -54,8 +52,12 @@
 # all FSxNs.
 #volumeName = "audit_logs"
 #
-# The name of the vserver that holds the audit logs. Assumed to be the same on 
+# The name of the vserver that holds the audit logs. Assumed to be the same on
 # all FSxNs.
+# *NOTE*:The program has been updated to loop on all the vservers within an FSxN
+#        filesystem and not just the one set here. This variable is now used
+#        so it can update the lastFireRead stats file to conform to the new format
+#        that includes the vserver as part of the structure.
 #vserverName = "fsx"
 #
 # The CloudWatch log group to store the audit logs in.
@@ -118,7 +120,7 @@ def processFile(ontapAdminServer, headers, volumeUUID, filePath):
                 else:
                     f.write(part.content)
         else:
-            print(f'API call to {endpoint} failed. HTTP status code: {response.status}.')
+            print(f'Warning: API call to {endpoint} failed. HTTP status code: {response.status}.')
             break
 
     f.close()
@@ -204,7 +206,7 @@ def ingestAuditFile(auditLogPath, auditLogName):
     dictData = xmltodict.parse(data)
 
     if dictData.get('Events') == None or dictData['Events'].get('Event') == None:
-        print(f"No events found in {auditLogName}")
+        print(f"Info: No events found in {auditLogName}.")
         return
     #
     # Ensure the logstream exists.
@@ -214,7 +216,7 @@ def ingestAuditFile(auditLogPath, auditLogName):
         #
         # This really shouldn't happen, since we should only be processing
         # each file once, but during testing it happens all the time.
-        print(f"Log stream {auditLogName} already exists")
+        print(f"Info: Log stream {auditLogName} already exists.")
     #
     # If there is only one event, then the dict['Events']['Event'] will be a
     # dictionary, otherwise it will be a list of dictionaries.
@@ -223,25 +225,25 @@ def ingestAuditFile(auditLogPath, auditLogName):
         for event in dictData['Events']['Event']:
             cwEvents.append(createCWEvent(event))
             if len(cwEvents) == 5000:  # The real maximum is 10000 events, but there is also a size limit, so we will use 5000.
-                print("Putting 5000 events")
+                print("Info: Putting 5000 events")
                 response = cwLogsClient.put_log_events(logGroupName=config['logGroupName'], logStreamName=auditLogName, logEvents=cwEvents)
                 if response.get('rejectedLogEventsInfo') != None:
-                    if response['rejectedLogEventsInfo'].get('tooNewLogEventStartIndex') > 0:
+                    if response['rejectedLogEventsInfo'].get('tooNewLogEventStartIndex') is not None:
                         print(f"Warning: Too new log event start index: {response['rejectedLogEventsInfo']['tooNewLogEventStartIndex']}")
-                    if response['rejectedLogEventsInfo'].get('tooOldLogEventStartIndex') > 0:
-                        print(f"Warning: Too old log event start index: {response['rejectedLogEventsInfo']['tooOldLogEventStartIndex']}")
+                    if response['rejectedLogEventsInfo'].get('tooOldLogEventEndIndex') is not None:
+                        print(f"Warning: Too old log event end index: {response['rejectedLogEventsInfo']['tooOldLogEventEndIndex']}")
                 cwEvents = []
     else:
         cwEvents = [createCWEvent(dictData['Events']['Event'])]
 
     if len(cwEvents) > 0:
-        print(f"Putting {len(cwEvents)} events")
+        print(f"Info: Putting {len(cwEvents)} events")
         response = cwLogsClient.put_log_events(logGroupName=config['logGroupName'], logStreamName=auditLogName, logEvents=cwEvents)
         if response.get('rejectedLogEventsInfo') != None:
-            if response['rejectedLogEventsInfo'].get('tooNewLogEventStartIndex') > 0:   
+            if response['rejectedLogEventsInfo'].get('tooNewLogEventStartIndex') is not None:
                 print(f"Warning: Too new log event start index: {response['rejectedLogEventsInfo']['tooNewLogEventStartIndex']}")
-            if response['rejectedLogEventsInfo'].get('tooOldLogEventStartIndex') > 0:   
-                print(f"Warning: Too old log event start index: {response['rejectedLogEventsInfo']['tooOldLogEventStartIndex']}")
+            if response['rejectedLogEventsInfo'].get('tooOldLogEventEndIndex') is not None:
+                print(f"Warning: Too old log event end index: {response['rejectedLogEventsInfo']['tooOldLogEventEndIndex']}")
 
 ################################################################################
 # This function checks that all the required configuration variables are set.
@@ -257,15 +259,17 @@ def checkConfig():
         'secretArn': secretArn if 'secretArn' in globals() else None,                  # pylint: disable=E0602
         's3BucketRegion': s3BucketRegion if 's3BucketRegion' in globals() else None,   # pylint: disable=E0602
         's3BucketName': s3BucketName if 's3BucketName' in globals() else None,         # pylint: disable=E0602
-        'statsName': statsName if 'statsName' in globals() else None,                  # pylint: disable=E0602
-        'vserverName': vserverName if 'vserverName' in globals() else None             # pylint: disable=E0602
+        'statsName': statsName if 'statsName' in globals() else None                   # pylint: disable=E0602
     }
 
     for item in config:
         if config[item] == None:
             config[item] = os.environ.get(item)
         if config[item] == None:
             raise Exception(f"{item} is not set.")
+    #
+    # To be backwards compatible, load the vserverName.
+    config['vserverName'] = vserverName if 'vserverName' in globals() else os.environ.get('vserverName')  # pylint: disable=E0602
 
 ################################################################################
 # This is the main function that checks that everything is configured correctly
@@ -330,6 +334,11 @@ def lambda_handler(event, context):     # pylint: disable=W0613
     for fsxn in fsxNs:
         fsId = fsxn.split('.')[1]
         #
+        # Since the format of the lastReadFile sttucture has changed, we need to update it.
+        if lastFileRead.get(fsxn) is not None and config['vserverName'] is not None:
+            if type(lastFileRead[fsxn]) is float:                                  # Old format
+                lastFileRead[fsxn] = {config['vserverName']: lastFileRead[fsxn]}   # New format
+        #
         # Get the password
         password = secrets.get(fsId)
         if password == None:
@@ -341,39 +350,67 @@ def lambda_handler(event, context):     # pylint: disable=W0613
         headersDownload = { **auth, 'Accept': 'multipart/form-data' }
         headersQuery = { **auth }
         #
-        # Get the volume UUID for the audit_logs volume.
-        volumeUUID = None
-        endpoint = f"https://{fsxn}/api/storage/volumes?name={config['volumeName']}&svm={config['vserverName']}"
+        # Get the list of SVMs on the FSxN.
+        endpoint = f"https://{fsxn}/api/svm/svms?return_timeout=4"
         response = http.request('GET', endpoint, headers=headersQuery, timeout=5.0)
         if response.status == 200:
-            data = json.loads(response.data.decode('utf-8'))
-            if data['num_records'] > 0:
-                volumeUUID = data['records'][0]['uuid']  # Since we specified the volume, and vserver name, there should only be one record.
+            svmsData = json.loads(response.data.decode('utf-8'))
+            numSvms = svmsData['num_records']
+            #
+            # Loop over all the SVMs.
+            while numSvms > 0:
+                for record in svmsData['records']:
+                    vserverName = record['name']
+                    #
+                    # Get the volume UUID for the audit_logs volume.
+                    volumeUUID = None
+                    endpoint = f"https://{fsxn}/api/storage/volumes?name={config['volumeName']}&svm={vserverName}"
+                    response = http.request('GET', endpoint, headers=headersQuery, timeout=5.0)
+                    if response.status == 200:
+                        data = json.loads(response.data.decode('utf-8'))
+                        if data['num_records'] > 0:
+                            volumeUUID = data['records'][0]['uuid']  # Since we specified the volume, and vserver name, there should only be one record.
 
-        if volumeUUID == None:
-            print(f"Warning: Volume {config['volumeName']} not found for {fsId} under SVM: {config['vserverName']}.")
-            continue
-        #
-        # Get all the files in the volume that match the audit file pattern.
-        endpoint = f"https://{fsxn}/api/storage/volumes/{volumeUUID}/files?name=audit_{config['vserverName']}_D*.xml&order_by=name%20asc&fields=name"
-        response = http.request('GET', endpoint, headers=headersQuery, timeout=5.0)
-        data = json.loads(response.data.decode('utf-8'))
-        if data.get('num_records') == 0:
-            print(f"Warning: No XML audit log files found on FsID: {fsId}; SvmID: {config['vserverName']}; Volume: {config['volumeName']}.")
-            continue
+                    if volumeUUID == None:
+                        print(f"Warning: Volume {config['volumeName']} not found for {fsId} under SVM: {vserverName}.")
+                        continue
+                    #
+                    # Get all the files in the volume that match the audit file pattern.
+                    endpoint = f"https://{fsxn}/api/storage/volumes/{volumeUUID}/files?name=audit_{vserverName}_D*.xml&order_by=name%20asc&fields=name"
+                    response = http.request('GET', endpoint, headers=headersQuery, timeout=5.0)
+                    data = json.loads(response.data.decode('utf-8'))
+                    if data.get('num_records') == 0:
+                        print(f"Warning: No XML audit log files found on FsID: {fsId}; SvmID: {vserverName}; Volume: {config['volumeName']}.")
+                        continue
 
-        for file in data['records']:
-            filePath = file['name']
-            if lastFileRead.get(fsxn) == None or getEpoch(filePath) > lastFileRead[fsxn]:
+                    for file in data['records']:
+                        filePath = file['name']
+                        if lastFileRead.get(fsxn) is None or lastFileRead[fsxn].get(vserverName) is None or getEpoch(filePath) > lastFileRead[fsxn][vserverName]:
+                            #
+                            # Process the file.
+                            processFile(fsxn, headersDownload, volumeUUID, filePath)
+                            if lastFileRead.get(fsxn) is None:
+                                lastFileRead[fsxn] = {vserverName: getEpoch(filePath)}
+                            else:
+                                lastFileRead[fsxn][vserverName] = getEpoch(filePath)
+                            s3Client.put_object(Key=config['statsName'], Bucket=config['s3BucketName'], Body=json.dumps(lastFileRead).encode('UTF-8'))
                 #
-                # Process the file.
-                processFile(fsxn, headersDownload, volumeUUID, filePath)
-                lastFileRead[fsxn] = getEpoch(filePath)
-                s3Client.put_object(Key=config['statsName'], Bucket=config['s3BucketName'], Body=json.dumps(lastFileRead).encode('UTF-8'))
+                # Get the next set of SVMs.
+                if svmsData['_links'].get('next') != None:
+                    endpoint = f"https://{fsxn}{svmsData['_links']['next']['href']}"
+                    response = http.request('GET', endpoint, headers=headersQuery, timeout=5.0)
+                    if response.status == 200:
+                        svmsData = json.loads(response.data.decode('utf-8'))
+                        numSvms = svmsData['num_records']
+                    else:
+                        print(f"Warning: API call to {endpoint} failed. HTTP status code: {response.status}.")
+                        break # Break out of the for all SVMs loop. Maybe the call to the next FSxN will work.
+                else:
+                    numSvms = 0
+        else:
+            print(f"Warning: API call to {endpoint} failed. HTTP status code: {response.status}.")
+            break # Break out of the for all FSxNs loop.
 #
 # If this script is not running as a Lambda function, then call the lambda_handler function.
 if os.environ.get('AWS_LAMBDA_FUNCTION_NAME') == None:
-    lambdaFunction = False
     lambda_handler(None, None)
-else:
-    lambdaFunction = True