Skip to content

Commit af2993b

Browse files
Documentation
1 parent 2e1f69a commit af2993b

File tree

1 file changed

+135
-0
lines changed

1 file changed

+135
-0
lines changed

documentation.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Documentation
2+
## Overview
3+
Loosely based on research paper **A Novel Statistical Analysis and Autoencoder Driven Intelligent Intrusion Detection Approach**
4+
[https://doi.org/10.1016/j.neucom.2019.11.016](https://doi.org/10.1016/j.neucom.2019.11.016)
5+
6+
## Datasets
7+
- **bin_data.csv** - CSV Dataset file for Binary Classification
8+
- **multi_data.csv** - CSV Dataset file for Multi-class Classification
9+
- **KDDTrain+.txt** - Original Dataset downloaded
10+
11+
The NSL-KDD dataset from the Canadian Institute for Cybersecurity (updated version of the original KDD Cup 1999 Data (KDD99) [https://www.unb.ca/cic/datasets/nsl.html](https://www.unb.ca/cic/datasets/nsl.html)
12+
13+
## Machine Learning Models
14+
15+
- Linear Support Vector Machine
16+
- Quadratic Support Vector Machine
17+
- K-Nearest-Neighbor
18+
- Linear Discriminant Analysis
19+
- Quadratic Discriminant Analysis
20+
- Multi Layer Perceptron
21+
- Long Short-Term Memory
22+
- Auto Encoder
23+
24+
## Data Preprocessing
25+
- Dataset had 43 attributes, attribute **'difficulty_level'** was dropped.
26+
27+
- ### Data Normalization
28+
- 38 Numeric Columns of DataFrame is scaled using **Standard Scaler**.
29+
30+
- ### One-hot-encoding
31+
- Categorical Columns **'protocol_type'**, **'service'**, **'flag'** are one-hot-encoded using **pd.get_dummies()**.
32+
- **'categorical'** Dataframe had 84 attributes after one-hot-encoding.
33+
34+
- ### Binary Classification
35+
- A copy of DataFrame is created for Binary Classification.
36+
- Attack label (**'label'** attribute) is classified into two categories **'normal'** and **'abnormal'**.
37+
- **'label'** is encoded using **LabelEncoder()**, encoded labels are saved in **'intrusion'**.
38+
- **'label'** is one-hot-encoded.
39+
40+
- ### Multi-class Classification
41+
- A copy of DataFrame is created for Multi-class Classification.
42+
- Attack label (**'label'** attribute) is classified into five categories **'normal'**, **'U2R'**, **'R2L'**, **'Probe'**, **'Dos'**.
43+
- **'label'** is encoded using **LabelEncoder()**, encoded labels are saved in **'intrusion'**.
44+
- **'label is one-hot-encoded'**.
45+
46+
- ### Feature Extraction
47+
- No. of attributes of **'bin_data'** - 45
48+
- No. of attributes of **'multi_data'** - 48
49+
- The attributes of **'bin_data'** and **'multi_data'** are selected using **'Pearson Correlation Coefficient'**.
50+
- The attributes with more than 0.5 correlation coefficient with the target attribute **'intrusion'** were selected.
51+
- 9 attributes **'count'**, **'srv_serror_rate'**, **'serror_rate'**, **'dst_host_serror_rate'**, **'dst_host_srv_serror_rate'**, **'logged_in'**, **'dst_host_same_srv_rate'**, **'dst_host_srv_count'**, **'same_srv_rate'**.
52+
- No. of attributes of **'bin_data'** after feature selection and joining **'categorical'** DataFrame - 97
53+
- No. of attributes of **'multi_data'** after feature selection and joining **'categorical'** DataFrame - 100
54+
55+
## Splitting the dataset
56+
- Splitting the dataset into 1:4 Ratio for Testing and Training.
57+
- 93 attributes were selected out of 97 attributes, to exclude the target attribute (encoded, one-hot-encoded, original) for Binary Classification
58+
- **'intrusion'** attribute was selected as the target attribute.
59+
- 93 attributes were selected out of 100 attributes, to exclude the target attribute (encoded, one-hot-encoded, original) for Multi-class Classification.
60+
61+
## Linear Support Vector Machine
62+
- Binary Classification Accuracy - **96.69 %**
63+
- Multi-class Classification Accuracy - **95.24 %**
64+
- Kernel Type used - **Linear**
65+
- `SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)`
66+
67+
## Quadratic Support Vector Machine
68+
- Binary Classification Accuracy - **95.71 %**
69+
- Multi-class Classification Accuracy - **92.86 %**
70+
- Kernel Type used - **Poly**
71+
- `SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)`
72+
73+
## K-Nearest-Neighbor
74+
- Binary Classification Accuracy - **98.55 %**
75+
- Multi-class Classification Accuracy - **98.29 %**
76+
- No. of neighbors - **5**
77+
- Weights - **Uniform**
78+
- `KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform')`
79+
80+
## Linear Discriminant Analysis
81+
- Binary Classification Accuracy - **96.70 %**
82+
- Multi-class Classification Accuracy - **93.19 %**
83+
- Solver used - **svd (singular value decomposition)**
84+
- `LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,solver='svd', store_covariance=False, tol=0.0001)`
85+
86+
## Quadratic Discriminant Analysis
87+
- Binary Classification Accuracy - **68.79 %**
88+
- Multi-class Classification Accuracy - **44.96 %**
89+
- `QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0, store_covariance=False, tol=0.0001)`
90+
91+
## Multi Layer Perceptron
92+
- Binary Classification Accuracy - **97.79 %**
93+
- **Input layer** with **93** input dimensions
94+
- **1 Hidden layer** with **50 Neurons** and **relu** activation function
95+
- Output layer with **1 neuron** and **sigmoid** activation function
96+
- Loss - **binary_crossentropy**
97+
- Optimizer - **adam**
98+
- Batch size - **5000**
99+
- Epochs - **100**
100+
- Multi-class Classification Accuracy - **96.92 %**
101+
- **Input layer** with **93** input dimensions
102+
- **1 Hidden layer** with **50 Neurons** and **relu** activation function
103+
- Output layer with **5 neurons** and **softmax** activation function
104+
- Loss - **categorical_crossentropy**
105+
- Optimizer - **adam**
106+
- Batch size - **5000**
107+
- Epochs - **100**
108+
109+
## Long Short-Term Memory
110+
- Binary Classification Accuracy - **83.05 %**
111+
- **Input layer** with **93** input dimensions
112+
- **LSTM** layer with **50 encoding cells**
113+
- Output layer with **1 neuron** and **sigmoid** activation function
114+
- Loss - **binary_crossentropy**
115+
- Optimizer - **adam**
116+
- Batch Size - **5000**
117+
- Epochs - **100**
118+
119+
## Autoencoder
120+
- Binary Classification Accuracy - **92.26 %**
121+
- Multi-class Classification Accuracy - **91.22 %**
122+
- **Input layer**
123+
- **Encoding layer** with **50 encoding cells**
124+
- **Output layer** and **Decoding Layer** with **softmax** activation function
125+
- Loss - **mean_squared_error**
126+
- Optimizer - **adam**
127+
- Batch Size - **500**
128+
- Epochs - **100**
129+
130+
## Citations
131+
- Cosimo Ieracitano, Ahsan Adeel, Francesco Carlo Morabito, Amir Hussain, A Novel Statistical Analysis and Autoencoder Driven Intelligent Intrusion Detection Approach, Neurocomputing (2019), doi: [https://doi.org/10.1016/j.neucom.2019.11.016](https://doi.org/10.1016/j.neucom.2019.11.016)
132+
133+
- The NSL-KDD dataset from the Canadian Institute for Cybersecurity (updated version of the original KDD Cup 1999 Data (KDD99) [https://www.unb.ca/cic/datasets/nsl.html](https://www.unb.ca/cic/datasets/nsl.html)
134+
135+
> Written with [StackEdit](https://stackedit.io/).

0 commit comments

Comments
 (0)