nv-morpheus
diff --git a/‎CHANGELOG.md‎
Lines changed: 10 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎anomalous-auth-detection/README.md‎
Lines changed: 1 addition & 1 deletion b/‎anomalous-auth-detection/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cyber-foundation/README.md‎
Lines changed: 329 additions & 0 deletions b/‎cyber-foundation/README.md‎
Lines changed: 329 additions & 0 deletions
diff --git a/‎cyber-foundation/dataset/README.md‎
Lines changed: 15 additions & 0 deletions b/‎cyber-foundation/dataset/README.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎cyber-foundation/dataset/azure-ad-logs-sample-training-data.json‎
Lines changed: 1 addition & 0 deletions b/‎cyber-foundation/dataset/azure-ad-logs-sample-training-data.json‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎cyber-foundation/model/cyber-foundation-model.pt‎
Lines changed: 3 additions & 0 deletions b/‎cyber-foundation/model/cyber-foundation-model.pt‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎cyber-foundation/model/meta.pkl‎
873 Bytes b/‎cyber-foundation/model/meta.pkl‎
873 Bytes
diff --git a/‎cyber-foundation/model/train.bin‎
15.8 MB b/‎cyber-foundation/model/train.bin‎
15.8 MB
diff --git a/‎cyber-foundation/model/val.bin‎
1.75 MB b/‎cyber-foundation/model/val.bin‎
1.75 MB
@@ -1,5 +1,5 @@
 <!--
-SPDX-FileCopyrightText: Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
 
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,6 +15,15 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
+# morpheus-experimental 24.06.00 (02 Jul 2024)
+## 🚀 New Features
+
+- [REVIEW] cyber foundation notebook and scripts ([#77](https://github.com/nv-morpheus/morpheus-experimental/pull/77)) [@gbatmaz](https://github.com/gbatmaz)
+
+## 🛠️ Improvements
+
+- Fix nltk.bigrams &amp; remove nltk dependency alert ([#80](https://github.com/nv-morpheus/morpheus-experimental/pull/80)) [@tzemicheal](https://github.com/tzemicheal)
+
 # morpheus-experimental 24.03.00 (27 Mar 2024)
 
 ## 🛠️ Improvements
 
@@ -82,6 +82,8 @@ This model is an XGBoost classifier that predicts each event on a power system b
 ## [Intrusion Detection System using LODA algorithm](/ids-detection)
 The model is a Loda anomaly detector for detecting an intrusion attack in the form of bots in a network using a netflow dataset. 
 
+## [Cyber Foundation Models](/cyber-foundation)
+This model is a GPT that generates realistic synthetic raw Azure AD logs.
 
 # Repo Structure
 Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:
 
@@ -179,7 +179,7 @@ or "fraudulent" authentication.<br>
 ### What training is recommended for developers working with this model?  If none, please state "none."
 * None
 ### Link the relevant end user license agreement 
-* [Apache 2.0](https://github.com/nv-morpheus/Morpheus/blob/branch-24.03/LICENSE)
+* [Apache 2.0](https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/LICENSE)
 
 ## Model Card ++ Saftey & Security Subcard
 
 
@@ -0,0 +1,329 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+
+# Cyber Foundation
+
+# Model Overview
+
+## Description:
+* This model is a GPT model trained to generate synthetic Azure logs. This approach can be used to generate logs that are realistic for some downstream tasks, i.e. generating training data as a baseline, generating attack behavior to test detectors.  <br>
+
+## Requirements:
+
+* To run this example, additional requirements must be installed into your environment. A supplementary requirements file has been provided in this example directory.
+
+`pip install -r requirements.txt`
+
+## References(s): <br>
+
+* https://github.com/karpathy/nanoGPT <br> 
+
+## Model Architecture: <br>
+
+**Architecture Type:** <br>
+
+* Transformer <br>
+
+**Network Architecture:** <br>
+
+* GPT <br>
+
+## Input: (Enter "None" As Needed) <br>
+
+**Input Format:** <br>
+
+* JSON <br>
+
+**Input Parameters:** <br>
+
+* Azure AD Logs <br>
+
+**Other Properties Related to Output:** <br>
+
+* N/A <br>
+
+## Output: (Enter "None" As Needed) <br>
+
+**Output Format:** <br>
+
+* Text file with synthetic logs <br>
+
+**Output Parameters:** <br>
+
+* N/A <br>
+
+**Other Properties Related to Output:**
+
+* N/A <br> 
+
+## Software Integration:<br>
+
+**Runtime(s):** <br>
+
+* Morpheus  <br>
+
+**Supported Hardware Platform(s):** <br>
+
+* Ampere/Turing <br>
+
+**Supported Operating System(s):** <br>
+
+* Linux <br>
+
+## Model Version(s): 
+
+* v1  <br>
+
+# Training & Evaluation: 
+
+## Training Dataset:
+
+**Link:** <br>
+
+* https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json  <br>
+
+**Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
+
+* 3239 Azure AD logs <br>
+
+**Dataset License:** <br>
+
+* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) <br>
+
+## Evaluation Dataset: <br>
+
+**Link:** <br>
+
+* N/A <br>
+
+**Properties (Quantity, Dataset Descriptions, Sensor(s)):**
+
+* N/A <br>
+
+**Dataset License:** <br>
+
+* N/A <br>
+
+## Inference: <br>
+
+**Engine:** <br>
+
+* N/A <br>
+
+**Test Hardware:** <br>
+
+* A100  <br>
+
+# Subcards
+
+## Model Card ++ Bias Subcard
+
+### What is the gender balance of the model validation data?  
+
+* Not Applicable
+
+### What is the racial/ethnicity balance of the model validation data?
+
+* Not Applicable
+
+### What is the age balance of the model validation data?
+
+* Not Applicable
+
+### What is the language balance of the model validation data?
+
+* English: 100%
+
+### What is the geographic origin language balance of the model validation data?
+
+* Not Applicable
+
+### What is the educational background balance of the model validation data?
+
+* Not Applicable
+
+### What is the accent balance of the model validation data?
+
+* Not Applicable
+
+### What is the face/key point balance of the model validation data? 
+
+* Not Applicable
+
+### What is the skin/tone balance of the model validation data?
+
+* Not Applicable
+
+### What is the religion balance of the model validation data?
+
+* Not Applicable
+
+### Individuals from the following adversely impacted (protected classes) groups participate in model design and testing.
+
+* Not Applicable
+
+### Describe measures taken to mitigate against unwanted bias.
+
+* Not Applicable
+
+## Model Card ++ Explainability Subcard
+
+### Name example applications and use cases for this model. 
+
+* The model is primarily designed for testing purposes and serves as a small pre-trained model used to generate Azure AD logs. 
+
+### Fill in the blank for the model technique.
+
+* This model is intended for developers who want to build GPT based synthetic log generator
+
+### Name who is intended to benefit from this model. 
+
+* The intended beneficiaries of this model are developers who aim to generate synthetic Azure logs.
+
+### Describe the model output. 
+
+* This model output is synthetic Azure AD logs. 
+
+### List the steps explaining how this model works.
+
+* This model is an example of a GPT model. This model requires raw log messages as input for training and a prompt for inference. The model is trained as in the training notebook. During inference, the trained model is prompted with the first key of the log type and generates synthetic logs.
+
+### Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of:
+
+* Not Applicable
+
+### List the technical limitations of the model. 
+
+* This model is trained with synthetic logs for demonstration purposes. A separate training is needed for other logs. 
+
+### What performance metrics were used to affirm the model's performance?
+
+* Intact raw logs
+
+### What are the potential known risks to users and stakeholders?
+
+* N/A
+
+### What training is recommended for developers working with this model?
+
+* None
+
+### Link the relevant end user license agreement 
+
+* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)<br>
+
+
+## Model Card ++ Saftey & Security Subcard
+
+### Link the location of the training dataset's repository.
+
+* https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json
+
+### Is the model used in an application with physical safety impact?
+
+* No
+
+### Describe physical safety impact (if present).
+
+* N/A
+
+### Was model and dataset assessed for vulnerability for potential form of attack?
+
+* No
+
+### Name applications for the model.
+
+* This model is provided as an example of synthetic log generation. Users can create their own models for their use cases and downstream tasks.
+
+### Name use case restrictions for the model.
+
+* It's been trained with a small dataset for mainly demonstration purposes.  
+
+### Has this been verified to have met prescribed quality standards?
+
+* No
+
+### Name target quality Key Performance Indicators (KPIs) for which this has been tested.  
+
+* N/A
+
+### Technical robustness and model security validated?
+
+* No
+
+### Is the model and dataset compliant with National Classification Management Society (NCMS)?
+
+* No
+
+### Are there explicit model and dataset restrictions?
+
+* No
+
+### Are there access restrictions to systems, model, and data?
+
+* No
+
+### Is there a digital signature?
+
+* No
+
+## Model Card ++ Privacy Subcard
+
+### Generatable or reverse engineerable personally-identifiable information (PII)?
+
+* Neither
+
+### Was consent obtained for any PII used?
+
+* N/A
+
+### Protected classes used to create this model? (The following were used in model the model's training:)
+
+* N/A
+
+### How often is dataset reviewed?
+
+* The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes.
+
+### Is a mechanism in place to honor data subject right of access or deletion of personal data?
+
+* N/A
+
+### If PII collected for the development of this AI model, was it minimized to only what was required? 
+
+* N/A
+
+### Is data in dataset traceable?
+
+* N/A
+
+### Scanned for malware?
+
+* No
+
+### Are we able to identify and trace source of dataset?
+
+* Yes
+
+### Does data labeling (annotation, metadata) comply with privacy laws?
+
+* N/A
+
+### Is data compliant with data subject requests for data correction or removal, if such a request was made?
+
+* N/A
@@ -0,0 +1,15 @@
+## Cyber Foundation Model Data
+
+### Cyber Foundation Raw Azure AD Logs
+This is a synthetic dataset of Azure AD logs with activities of 20 accounts (85 applications involved, 3567 records in total). The activities are split to a train and an inference set. An anomaly is included in the inference set for model validation. The data was generated using the python [faker](https://faker.readthedocs.io/en/master/#) package. If there is any resemblance to real individuals, it is purely coincidental.
+
+#### Sample Training Data
+- 3239 records in total
+- Time range: 2022/08/01 - 2022/08/29
+- Users' log distribution:
+    - 5 high volume (>= 300) users
+    - 15 medium volume (~100) users
+    - 5 light volume (~10) users
+
+- [./azure-ad-logs-sample-training-data.json](./azure-ad-logs-sample-training-data.json)
+- [Original location of the training data](https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json)
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d13d546cf3b50bff4bd0f611c2be1259b55bde116ddd27c5fbc016e535cef50b
+size 1039180855
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:d13d546cf3b50bff4bd0f611c2be1259b55bde116ddd27c5fbc016e535cef50b`
	`3`	`+size 1039180855`