Skip to content

Commit 56548ac

Browse files
authored
Merge pull request #40 from nv-morpheus/branch-23.03
[RELEASE] morpheus-experimental v23.03
2 parents 5786817 + 362dc07 commit 56548ac

File tree

17 files changed

+2906
-0
lines changed

17 files changed

+2906
-0
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
1+
# morpheus-experimental 23.03.00 (29 Mar 2023)
2+
3+
## 🚀 New Features
4+
5+
- Log sequence ad usecase ([#37](https://github.com/nv-morpheus/morpheus-experimental/pull/37)) [@tzemicheal](https://github.com/tzemicheal)
6+
- operational-technology-use-case ([#35](https://github.com/nv-morpheus/morpheus-experimental/pull/35)) [@gbatmaz](https://github.com/gbatmaz)
7+
18
# morpheus-experimental 23.01.00 (27 Jan 2023)
29

310
## 🚀 New Features

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,13 @@ This model shows an application of a graph neural network for anomalous authenti
7272
## [Asset Clustering using Windows Event Logs](/asset-clustering)
7373
This model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.
7474

75+
76+
## [Log Sequence Anomaly Detector](/log-sequence-ad)
77+
This model is a sequence binary classifier trained with vector representation of log messages. The task is to identify abnormal log sequence of alerts from sequence of normally generated logs.
78+
79+
## [Industrial Control System (ICS) Cyber Attack Detection](/operational-technology)
80+
This model is an XGBoost classifier that predicts each event on a power system based on dataset features.
81+
7582
# Repo Structure
7683
Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:
7784

log-sequence-ad/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
## Log Sequence Anomaly Detection
2+
3+
## Use Case
4+
Identify log anomalies from sequence of logs generated dataset.
5+
6+
### Version
7+
1.0
8+
9+
### Model Overview
10+
The model is a sequence binary classifier trained with vector representation of log sequence of BGL dataset. The task is to identify abnormal log sequence of alerts from sequence of normally generated logs. This work is based on the model developed in the works of [[2](https://ieeexplore.ieee.org/document/9671642),[3](https://github.com/hanxiao0607/InterpretableSAD)], for further detail refer the paper and associated code at the reference link.
11+
12+
### Model Architecture
13+
LSTM binary classifier with word2vector embedding input.
14+
15+
### Requirements
16+
17+
Requirements can be installed with
18+
```
19+
pip install -r requirements.txt
20+
```
21+
22+
### Training
23+
24+
#### Training data
25+
The dataset for the example used from BlueGene/L Supercomputer System (BGL). BGL dataset contains 4,747,963 log messages that are collected
26+
from a [BlueGeme/L]('https://zenodo.org/record/3227177/files/BGL.tar.gz?download=1') supercomputer system at Lawrence Livermore National Labs. The log messages can be categorized into alert and not-alert messages. The log message is parsed using [Drain](https://github.com/logpai/logparser) parser for preprocessing. The model is trained and evaluated using 1 million rows of preprocessed logs. For running the workflow a preprocessed smaller set of BGL can be used from https://github.com/LogIntelligence/LogPPT
27+
28+
#### Training parameters
29+
30+
For the Word2Vec, gensim model is used with size=8
31+
Parameter for the LSTM model.
32+
```
33+
output_dim = 2
34+
emb_dim = 8
35+
hidden_dim = 128
36+
n_layers = 1
37+
dropout = 0.0
38+
batch_size = 32
39+
n_epoch = 10
40+
```
41+
#### GPU Model
42+
Tesla V100-SXM2
43+
44+
#### Model accuracy
45+
The label distribution in the dataset is imbalanced, the F1 score over the 1 million row dataset is 0.97.
46+
47+
48+
#### Training script
49+
To train the model, run the code in the notebook. This will save trained model under `model` directory.
50+
51+
### Inference
52+
To run inference from trained model
53+
```bash
54+
python inference.py --model_name model/model_BGL.pt --input_data dataset/BGL_2k.log_structured.csv
55+
56+
```
57+
This will produce `result.csv` that contains binary prediction of the model.
58+
59+
### How To Use This Model
60+
This model is an example of sequence binary classifier. This model requires parsed log messages as input for training and inference. The model and Word2Vector embedding is trained as follows in the training notebook. During inference, trained model is loaded from `model` directory and input file in the form of parsed logs are expected to output prediction for sequences of log messages.
61+
62+
### Input
63+
The input is an output of parsed system log messages represented as CSV file.
64+
65+
### Output
66+
Binary classifier output assigned to each sequence log messages in the input file. The predicted output is appended at the last column of the input sequence.
67+
68+
#### Out-of-scope use cases
69+
N/A
70+
71+
### Ethical considerations
72+
N/A
73+
74+
### Reference
75+
1. https://arxiv.org/pdf/2202.04301.pdf
76+
2. https://ieeexplore.ieee.org/document/9671642
77+
3. https://github.com/hanxiao0607/InterpretableSAD

log-sequence-ad/model/model_BGL.pt

563 KB
Binary file not shown.

log-sequence-ad/requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
gensim==3.8.0
2+
nltk==3.8
3+
numpy==1.23.5
4+
pandas==1.5.2
5+
torch==1.12

0 commit comments

Comments
 (0)