|
| 1 | +# Documentation |
| 2 | +## Overview |
| 3 | +Loosely based on research paper **A Novel Statistical Analysis and Autoencoder Driven Intelligent Intrusion Detection Approach** |
| 4 | +[https://doi.org/10.1016/j.neucom.2019.11.016](https://doi.org/10.1016/j.neucom.2019.11.016) |
| 5 | + |
| 6 | +## Datasets |
| 7 | +- **bin_data.csv** - CSV Dataset file for Binary Classification |
| 8 | +- **multi_data.csv** - CSV Dataset file for Multi-class Classification |
| 9 | +- **KDDTrain+.txt** - Original Dataset downloaded |
| 10 | + |
| 11 | +The NSL-KDD dataset from the Canadian Institute for Cybersecurity (updated version of the original KDD Cup 1999 Data (KDD99) [https://www.unb.ca/cic/datasets/nsl.html](https://www.unb.ca/cic/datasets/nsl.html) |
| 12 | + |
| 13 | +## Machine Learning Models |
| 14 | + |
| 15 | + - Linear Support Vector Machine |
| 16 | + - Quadratic Support Vector Machine |
| 17 | + - K-Nearest-Neighbor |
| 18 | + - Linear Discriminant Analysis |
| 19 | + - Quadratic Discriminant Analysis |
| 20 | + - Multi Layer Perceptron |
| 21 | + - Long Short-Term Memory |
| 22 | + - Auto Encoder |
| 23 | + |
| 24 | +## Data Preprocessing |
| 25 | +- Dataset had 43 attributes, attribute **'difficulty_level'** was dropped. |
| 26 | + |
| 27 | +- ### Data Normalization |
| 28 | + - 38 Numeric Columns of DataFrame is scaled using **Standard Scaler**. |
| 29 | + |
| 30 | +- ### One-hot-encoding |
| 31 | + - Categorical Columns **'protocol_type'**, **'service'**, **'flag'** are one-hot-encoded using **pd.get_dummies()**. |
| 32 | + - **'categorical'** Dataframe had 84 attributes after one-hot-encoding. |
| 33 | + |
| 34 | +- ### Binary Classification |
| 35 | + - A copy of DataFrame is created for Binary Classification. |
| 36 | + - Attack label (**'label'** attribute) is classified into two categories **'normal'** and **'abnormal'**. |
| 37 | + - **'label'** is encoded using **LabelEncoder()**, encoded labels are saved in **'intrusion'**. |
| 38 | + - **'label'** is one-hot-encoded. |
| 39 | + |
| 40 | +- ### Multi-class Classification |
| 41 | + - A copy of DataFrame is created for Multi-class Classification. |
| 42 | + - Attack label (**'label'** attribute) is classified into five categories **'normal'**, **'U2R'**, **'R2L'**, **'Probe'**, **'Dos'**. |
| 43 | + - **'label'** is encoded using **LabelEncoder()**, encoded labels are saved in **'intrusion'**. |
| 44 | + - **'label is one-hot-encoded'**. |
| 45 | + |
| 46 | +- ### Feature Extraction |
| 47 | + - No. of attributes of **'bin_data'** - 45 |
| 48 | + - No. of attributes of **'multi_data'** - 48 |
| 49 | + - The attributes of **'bin_data'** and **'multi_data'** are selected using **'Pearson Correlation Coefficient'**. |
| 50 | + - The attributes with more than 0.5 correlation coefficient with the target attribute **'intrusion'** were selected. |
| 51 | + - 9 attributes **'count'**, **'srv_serror_rate'**, **'serror_rate'**, **'dst_host_serror_rate'**, **'dst_host_srv_serror_rate'**, **'logged_in'**, **'dst_host_same_srv_rate'**, **'dst_host_srv_count'**, **'same_srv_rate'**. |
| 52 | + - No. of attributes of **'bin_data'** after feature selection and joining **'categorical'** DataFrame - 97 |
| 53 | + - No. of attributes of **'multi_data'** after feature selection and joining **'categorical'** DataFrame - 100 |
| 54 | + |
| 55 | +## Splitting the dataset |
| 56 | +- Splitting the dataset into 1:4 Ratio for Testing and Training. |
| 57 | +- 93 attributes were selected out of 97 attributes, to exclude the target attribute (encoded, one-hot-encoded, original) for Binary Classification |
| 58 | +- **'intrusion'** attribute was selected as the target attribute. |
| 59 | +- 93 attributes were selected out of 100 attributes, to exclude the target attribute (encoded, one-hot-encoded, original) for Multi-class Classification. |
| 60 | + |
| 61 | +## Linear Support Vector Machine |
| 62 | +- Binary Classification Accuracy - **96.69 %** |
| 63 | +- Multi-class Classification Accuracy - **95.24 %** |
| 64 | +- Kernel Type used - **Linear** |
| 65 | +- `SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)` |
| 66 | + |
| 67 | +## Quadratic Support Vector Machine |
| 68 | +- Binary Classification Accuracy - **95.71 %** |
| 69 | +- Multi-class Classification Accuracy - **92.86 %** |
| 70 | +- Kernel Type used - **Poly** |
| 71 | +- `SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)` |
| 72 | + |
| 73 | +## K-Nearest-Neighbor |
| 74 | +- Binary Classification Accuracy - **98.55 %** |
| 75 | +- Multi-class Classification Accuracy - **98.29 %** |
| 76 | +- No. of neighbors - **5** |
| 77 | +- Weights - **Uniform** |
| 78 | +- `KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform')` |
| 79 | + |
| 80 | +## Linear Discriminant Analysis |
| 81 | +- Binary Classification Accuracy - **96.70 %** |
| 82 | +- Multi-class Classification Accuracy - **93.19 %** |
| 83 | +- Solver used - **svd (singular value decomposition)** |
| 84 | +- `LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,solver='svd', store_covariance=False, tol=0.0001)` |
| 85 | + |
| 86 | +## Quadratic Discriminant Analysis |
| 87 | + - Binary Classification Accuracy - **68.79 %** |
| 88 | + - Multi-class Classification Accuracy - **44.96 %** |
| 89 | + - `QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0, store_covariance=False, tol=0.0001)` |
| 90 | + |
| 91 | +## Multi Layer Perceptron |
| 92 | +- Binary Classification Accuracy - **97.79 %** |
| 93 | + - **Input layer** with **93** input dimensions |
| 94 | + - **1 Hidden layer** with **50 Neurons** and **relu** activation function |
| 95 | + - Output layer with **1 neuron** and **sigmoid** activation function |
| 96 | + - Loss - **binary_crossentropy** |
| 97 | + - Optimizer - **adam** |
| 98 | + - Batch size - **5000** |
| 99 | + - Epochs - **100** |
| 100 | +- Multi-class Classification Accuracy - **96.92 %** |
| 101 | + - **Input layer** with **93** input dimensions |
| 102 | + - **1 Hidden layer** with **50 Neurons** and **relu** activation function |
| 103 | + - Output layer with **5 neurons** and **softmax** activation function |
| 104 | + - Loss - **categorical_crossentropy** |
| 105 | + - Optimizer - **adam** |
| 106 | + - Batch size - **5000** |
| 107 | + - Epochs - **100** |
| 108 | + |
| 109 | +## Long Short-Term Memory |
| 110 | +- Binary Classification Accuracy - **83.05 %** |
| 111 | +- **Input layer** with **93** input dimensions |
| 112 | +- **LSTM** layer with **50 encoding cells** |
| 113 | +- Output layer with **1 neuron** and **sigmoid** activation function |
| 114 | +- Loss - **binary_crossentropy** |
| 115 | +- Optimizer - **adam** |
| 116 | +- Batch Size - **5000** |
| 117 | +- Epochs - **100** |
| 118 | + |
| 119 | +## Autoencoder |
| 120 | +- Binary Classification Accuracy - **92.26 %** |
| 121 | +- Multi-class Classification Accuracy - **91.22 %** |
| 122 | +- **Input layer** |
| 123 | +- **Encoding layer** with **50 encoding cells** |
| 124 | +- **Output layer** and **Decoding Layer** with **softmax** activation function |
| 125 | +- Loss - **mean_squared_error** |
| 126 | +- Optimizer - **adam** |
| 127 | +- Batch Size - **500** |
| 128 | +- Epochs - **100** |
| 129 | + |
| 130 | +## Citations |
| 131 | +- Cosimo Ieracitano, Ahsan Adeel, Francesco Carlo Morabito, Amir Hussain, A Novel Statistical Analysis and Autoencoder Driven Intelligent Intrusion Detection Approach, Neurocomputing (2019), doi: [https://doi.org/10.1016/j.neucom.2019.11.016](https://doi.org/10.1016/j.neucom.2019.11.016) |
| 132 | + |
| 133 | +- The NSL-KDD dataset from the Canadian Institute for Cybersecurity (updated version of the original KDD Cup 1999 Data (KDD99) [https://www.unb.ca/cic/datasets/nsl.html](https://www.unb.ca/cic/datasets/nsl.html) |
| 134 | + |
| 135 | +> Written with [StackEdit](https://stackedit.io/). |
0 commit comments