Skip to content

Commit deced44

Browse files
authored
Merge pull request #82 from nv-morpheus/branch-24.06
[RELEASE] morpheus-experimental v24.06.00
2 parents 29c8e37 + e959004 commit deced44

File tree

22 files changed

+2673
-11
lines changed

22 files changed

+2673
-11
lines changed

CHANGELOG.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
SPDX-FileCopyrightText: Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
SPDX-License-Identifier: Apache-2.0
44
55
Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,6 +15,15 @@ See the License for the specific language governing permissions and
1515
limitations under the License.
1616
-->
1717

18+
# morpheus-experimental 24.06.00 (02 Jul 2024)
19+
## 🚀 New Features
20+
21+
- [REVIEW] cyber foundation notebook and scripts ([#77](https://github.com/nv-morpheus/morpheus-experimental/pull/77)) [@gbatmaz](https://github.com/gbatmaz)
22+
23+
## 🛠️ Improvements
24+
25+
- Fix nltk.bigrams &amp; remove nltk dependency alert ([#80](https://github.com/nv-morpheus/morpheus-experimental/pull/80)) [@tzemicheal](https://github.com/tzemicheal)
26+
1827
# morpheus-experimental 24.03.00 (27 Mar 2024)
1928

2029
## 🛠️ Improvements

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ This model is an XGBoost classifier that predicts each event on a power system b
8282
## [Intrusion Detection System using LODA algorithm](/ids-detection)
8383
The model is a Loda anomaly detector for detecting an intrusion attack in the form of bots in a network using a netflow dataset.
8484

85+
## [Cyber Foundation Models](/cyber-foundation)
86+
This model is a GPT that generates realistic synthetic raw Azure AD logs.
8587

8688
# Repo Structure
8789
Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:

anomalous-auth-detection/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@ or "fraudulent" authentication.<br>
179179
### What training is recommended for developers working with this model? If none, please state "none."
180180
* None
181181
### Link the relevant end user license agreement
182-
* [Apache 2.0](https://github.com/nv-morpheus/Morpheus/blob/branch-24.03/LICENSE)
182+
* [Apache 2.0](https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/LICENSE)
183183

184184
## Model Card ++ Saftey & Security Subcard
185185

cyber-foundation/README.md

Lines changed: 329 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
19+
# Cyber Foundation
20+
21+
# Model Overview
22+
23+
## Description:
24+
* This model is a GPT model trained to generate synthetic Azure logs. This approach can be used to generate logs that are realistic for some downstream tasks, i.e. generating training data as a baseline, generating attack behavior to test detectors. <br>
25+
26+
## Requirements:
27+
28+
* To run this example, additional requirements must be installed into your environment. A supplementary requirements file has been provided in this example directory.
29+
30+
`pip install -r requirements.txt`
31+
32+
## References(s): <br>
33+
34+
* https://github.com/karpathy/nanoGPT <br>
35+
36+
## Model Architecture: <br>
37+
38+
**Architecture Type:** <br>
39+
40+
* Transformer <br>
41+
42+
**Network Architecture:** <br>
43+
44+
* GPT <br>
45+
46+
## Input: (Enter "None" As Needed) <br>
47+
48+
**Input Format:** <br>
49+
50+
* JSON <br>
51+
52+
**Input Parameters:** <br>
53+
54+
* Azure AD Logs <br>
55+
56+
**Other Properties Related to Output:** <br>
57+
58+
* N/A <br>
59+
60+
## Output: (Enter "None" As Needed) <br>
61+
62+
**Output Format:** <br>
63+
64+
* Text file with synthetic logs <br>
65+
66+
**Output Parameters:** <br>
67+
68+
* N/A <br>
69+
70+
**Other Properties Related to Output:**
71+
72+
* N/A <br>
73+
74+
## Software Integration:<br>
75+
76+
**Runtime(s):** <br>
77+
78+
* Morpheus <br>
79+
80+
**Supported Hardware Platform(s):** <br>
81+
82+
* Ampere/Turing <br>
83+
84+
**Supported Operating System(s):** <br>
85+
86+
* Linux <br>
87+
88+
## Model Version(s):
89+
90+
* v1 <br>
91+
92+
# Training & Evaluation:
93+
94+
## Training Dataset:
95+
96+
**Link:** <br>
97+
98+
* https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json <br>
99+
100+
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
101+
102+
* 3239 Azure AD logs <br>
103+
104+
**Dataset License:** <br>
105+
106+
* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) <br>
107+
108+
## Evaluation Dataset: <br>
109+
110+
**Link:** <br>
111+
112+
* N/A <br>
113+
114+
**Properties (Quantity, Dataset Descriptions, Sensor(s)):**
115+
116+
* N/A <br>
117+
118+
**Dataset License:** <br>
119+
120+
* N/A <br>
121+
122+
## Inference: <br>
123+
124+
**Engine:** <br>
125+
126+
* N/A <br>
127+
128+
**Test Hardware:** <br>
129+
130+
* A100 <br>
131+
132+
# Subcards
133+
134+
## Model Card ++ Bias Subcard
135+
136+
### What is the gender balance of the model validation data?
137+
138+
* Not Applicable
139+
140+
### What is the racial/ethnicity balance of the model validation data?
141+
142+
* Not Applicable
143+
144+
### What is the age balance of the model validation data?
145+
146+
* Not Applicable
147+
148+
### What is the language balance of the model validation data?
149+
150+
* English: 100%
151+
152+
### What is the geographic origin language balance of the model validation data?
153+
154+
* Not Applicable
155+
156+
### What is the educational background balance of the model validation data?
157+
158+
* Not Applicable
159+
160+
### What is the accent balance of the model validation data?
161+
162+
* Not Applicable
163+
164+
### What is the face/key point balance of the model validation data?
165+
166+
* Not Applicable
167+
168+
### What is the skin/tone balance of the model validation data?
169+
170+
* Not Applicable
171+
172+
### What is the religion balance of the model validation data?
173+
174+
* Not Applicable
175+
176+
### Individuals from the following adversely impacted (protected classes) groups participate in model design and testing.
177+
178+
* Not Applicable
179+
180+
### Describe measures taken to mitigate against unwanted bias.
181+
182+
* Not Applicable
183+
184+
## Model Card ++ Explainability Subcard
185+
186+
### Name example applications and use cases for this model.
187+
188+
* The model is primarily designed for testing purposes and serves as a small pre-trained model used to generate Azure AD logs.
189+
190+
### Fill in the blank for the model technique.
191+
192+
* This model is intended for developers who want to build GPT based synthetic log generator
193+
194+
### Name who is intended to benefit from this model.
195+
196+
* The intended beneficiaries of this model are developers who aim to generate synthetic Azure logs.
197+
198+
### Describe the model output.
199+
200+
* This model output is synthetic Azure AD logs.
201+
202+
### List the steps explaining how this model works.
203+
204+
* This model is an example of a GPT model. This model requires raw log messages as input for training and a prompt for inference. The model is trained as in the training notebook. During inference, the trained model is prompted with the first key of the log type and generates synthetic logs.
205+
206+
### Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of:
207+
208+
* Not Applicable
209+
210+
### List the technical limitations of the model.
211+
212+
* This model is trained with synthetic logs for demonstration purposes. A separate training is needed for other logs.
213+
214+
### What performance metrics were used to affirm the model's performance?
215+
216+
* Intact raw logs
217+
218+
### What are the potential known risks to users and stakeholders?
219+
220+
* N/A
221+
222+
### What training is recommended for developers working with this model?
223+
224+
* None
225+
226+
### Link the relevant end user license agreement
227+
228+
* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)<br>
229+
230+
231+
## Model Card ++ Saftey & Security Subcard
232+
233+
### Link the location of the training dataset's repository.
234+
235+
* https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json
236+
237+
### Is the model used in an application with physical safety impact?
238+
239+
* No
240+
241+
### Describe physical safety impact (if present).
242+
243+
* N/A
244+
245+
### Was model and dataset assessed for vulnerability for potential form of attack?
246+
247+
* No
248+
249+
### Name applications for the model.
250+
251+
* This model is provided as an example of synthetic log generation. Users can create their own models for their use cases and downstream tasks.
252+
253+
### Name use case restrictions for the model.
254+
255+
* It's been trained with a small dataset for mainly demonstration purposes.
256+
257+
### Has this been verified to have met prescribed quality standards?
258+
259+
* No
260+
261+
### Name target quality Key Performance Indicators (KPIs) for which this has been tested.
262+
263+
* N/A
264+
265+
### Technical robustness and model security validated?
266+
267+
* No
268+
269+
### Is the model and dataset compliant with National Classification Management Society (NCMS)?
270+
271+
* No
272+
273+
### Are there explicit model and dataset restrictions?
274+
275+
* No
276+
277+
### Are there access restrictions to systems, model, and data?
278+
279+
* No
280+
281+
### Is there a digital signature?
282+
283+
* No
284+
285+
## Model Card ++ Privacy Subcard
286+
287+
### Generatable or reverse engineerable personally-identifiable information (PII)?
288+
289+
* Neither
290+
291+
### Was consent obtained for any PII used?
292+
293+
* N/A
294+
295+
### Protected classes used to create this model? (The following were used in model the model's training:)
296+
297+
* N/A
298+
299+
### How often is dataset reviewed?
300+
301+
* The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes.
302+
303+
### Is a mechanism in place to honor data subject right of access or deletion of personal data?
304+
305+
* N/A
306+
307+
### If PII collected for the development of this AI model, was it minimized to only what was required?
308+
309+
* N/A
310+
311+
### Is data in dataset traceable?
312+
313+
* N/A
314+
315+
### Scanned for malware?
316+
317+
* No
318+
319+
### Are we able to identify and trace source of dataset?
320+
321+
* Yes
322+
323+
### Does data labeling (annotation, metadata) comply with privacy laws?
324+
325+
* N/A
326+
327+
### Is data compliant with data subject requests for data correction or removal, if such a request was made?
328+
329+
* N/A

cyber-foundation/dataset/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
## Cyber Foundation Model Data
2+
3+
### Cyber Foundation Raw Azure AD Logs
4+
This is a synthetic dataset of Azure AD logs with activities of 20 accounts (85 applications involved, 3567 records in total). The activities are split to a train and an inference set. An anomaly is included in the inference set for model validation. The data was generated using the python [faker](https://faker.readthedocs.io/en/master/#) package. If there is any resemblance to real individuals, it is purely coincidental.
5+
6+
#### Sample Training Data
7+
- 3239 records in total
8+
- Time range: 2022/08/01 - 2022/08/29
9+
- Users' log distribution:
10+
- 5 high volume (>= 300) users
11+
- 15 medium volume (~100) users
12+
- 5 light volume (~10) users
13+
14+
- [./azure-ad-logs-sample-training-data.json](./azure-ad-logs-sample-training-data.json)
15+
- [Original location of the training data](https://github.com/nv-morpheus/Morpheus/blob/main/models/datasets/training-data/azure/azure-ad-logs-sample-training-data.json)

cyber-foundation/dataset/azure-ad-logs-sample-training-data.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:d13d546cf3b50bff4bd0f611c2be1259b55bde116ddd27c5fbc016e535cef50b
3+
size 1039180855

cyber-foundation/model/meta.pkl

873 Bytes
Binary file not shown.

cyber-foundation/model/train.bin

15.8 MB
Binary file not shown.

cyber-foundation/model/val.bin

1.75 MB
Binary file not shown.

0 commit comments

Comments
 (0)