In the evolving landscape of automotive technology, securing in-vehicle networks is crucial. The proposition involves a Machine Learning-based Intrusion Detection System (IDS) with a multi-tier hybrid architecture that integrates both signature-based detection (Supervised learning) and anomaly-based detection (Unsupervised learning). This approach combines the accuracy of signature-based detection for known threats with the adaptability of anomaly-based detection methods for new threats, offering a robust and comprehensive security solution for vehicular networks.
| Name | Algorithm Type | Strong Point | Utilized | Reason |
|---|---|---|---|---|
| Decision Tree | Supervised Learning | Interpretability, minimal data preparation, non-parametric, feature importance. | - | Random forests already use decision trees and are more accurate. Good for speed only. |
| Random Forests | Supervised Learning | Accuracy, robustness, versatility, feature importance. | YES | Best for a robust, accurate model that mitigates overfitting. |
| Extra Gradient Boost | Supervised Learning | Performance, regularization, handles missing data, parallel processing. | YES | Best for high performance and efficiency on large datasets. |
| Support Vector Machine | Supervised Learning | Effective in high dimensions, memory efficient, and versatile. | - | Less efficient and slower than Random forests for large datasets. |
| K-Means Clustering | Unsupervised Learning | Simplicity, scalability, and speed | - | Not applicable (supervised learning algorithms preferred for this task) |
| Dataset | Description | Link | Reference |
|---|---|---|---|
| Heavy duty truck CAN-bus dataset | This dataset features over 180 hours of CAN bus traffic from a Renault Euro VI heavy-duty truck across various driving conditions. | Dataset Link | University of Turku, May 31, 2021 |
| can-train-and-test | Controller Area Network (CAN) traffic for the 2017 Subaru Forester, the 2016 Chevrolet Silverado, the 2011 Chevrolet Traverse, and the 2011 Chevrolet Impala. | Dataset Link | Brooke Lampe, Weizhi Meng, January 17, 2024 |
| Car-Hacking Dataset for intrusion detection | Datasets which include DoS attack, fuzzy attack, spoofing the drive gear, and spoofing the RPM gauge. Constructed by logging CAN traffic via the OBD-II port from a real vehicle while message injection attacks were performed. Datasets contain each 300 intrusions of message injection. | Dataset Link | Eunbi Seo, Hyun Min Song, Huy Kang Kim, July 18, 2019 |
For each category of test dataset that exists, several attack subsets exist within it to test over the training dataset.
| Category | Attacks | Test Purpose |
|---|---|---|
| Known Vehicle Known Attacks | DoS, force_neutral, rpm, standstill | To test trained model of known vehicle with known attacks. |
| Known Vehicle Unknown Attacks | Double, fuzzing, interval, speed, systematic, triple | To test trained model of known vehicle with unknown attacks. |
| Unknown Vehicle Known Attack | DoS, force_neutral, rpm, standstill | To test trained model of unknown vehicle with known attacks. |
| Unknown Vehicle Unknown Attack | Double, fuzzing, interval, speed, systematic, triple | To test trained model of unknown vehicle with unknown attacks. |
-
Import necessary data processing libraries.
import pandas as pd from sklearn import preprocessing
-
Read in all CSV training dataset and merge separated attack subsets into one.
# for example: attack_free_1 = pd.read_csv(dataset_dir + "attack-free-1.csv") attack_free_2 = pd.read_csv(dataset_dir + "attack-free-2.csv") DoS_1 = pd.read_csv(dataset_dir + "DoS-1.csv") DoS_2 = pd.read_csv(dataset_dir + "DoS-2.csv") attack_free = pd.concat([attack_free_1, attack_free_2]) DoS = pd.concat([DoS_1, DoS_2]) accessory = pd.concat([accessory_1, accessory_2])
-
Merge all attack subsets into a single unique data subset based on attack-type.
accessory['attack'] = accessory['attack'].replace(0, 1) DoS['attack'] = DoS['attack'].replace(1, 2) force_neutral['attack'] = force_neutral['attack'].replace(1, 3) rpm['attack'] = rpm['attack'].replace(1, 4) standstill['attack'] = standstill['attack'].replace(1, 5)
-
Concatenate all attack subsets into one as the training dataset.
merged_datasets = pd.concat([attack_free, accessory, DoS, force_neutral, rpm, standstill])
-
Encode columns with categorical data and normalize the data.
label_encoder = preprocessing.LabelEncoder() merged_datasets["arbitration_id"] = label_encoder.fit_transform(merged_datasets["arbitration_id"]) merged_datasets["data_field"] = label_encoder.fit_transform(merged_datasets["data_field"]) merged_datasets.to_csv("updated_dataset.csv", sep=',', index=False, encoding='utf-8')
-
Import and read test dataset CSV files into a DataFrame for each test data subset.
# for example: DoS_3 = pd.read_csv(dataset_dir + "DoS-3.csv") DoS_4 = pd.read_csv(dataset_dir + "DoS-4.csv") force_neutral_3 = pd.read_csv(dataset_dir + "force-neutral-3.csv") force_neutral_4 = pd.read_csv(dataset_dir + "force-neutral-4.csv")
-
Merge test subsets into one, resulting in a unique subset for each attack type.
# merge related datasets double = pd.concat([double_3, double_4]) fuzzing = pd.concat([fuzzing_3, fuzzing_4]) interval = pd.concat([interval_3, interval_4]) speed = pd.concat([speed_3, speed_4]) systematic = pd.concat([systematic_3, systematic_4]) triple = pd.concat([triple_3, triple_4])
-
Encode columns with categorical values.
double["arbitration_id"] = label_encoder.fit_transform(double["arbitration_id"]) double["data_field"] = label_encoder.fit_transform(double["data_field"]) fuzzing["arbitration_id"] = label_encoder.fit_transform(fuzzing["arbitration_id"]) fuzzing["data_field"] = label_encoder.fit_transform(fuzzing["data_field"]) interval["arbitration_id"] = label_encoder.fit_transform(interval["arbitration_id"]) interval["data_field"] = label_encoder.fit_transform(interval["data_field"])
-
Save the processed testing dataset to CSV format.
The "Algo" class contains the implementation of the 3 selected algorithms for testing as seen in the table above, namely: Random Forests, Extreme Gradient Boosting, and KMeans Clustering Algorithm.
Random Forests is an ensemble learning method for classification and regression. It operates by constructing multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. This approach helps improve accuracy and control overfitting.
The Random Forests Algorithm utilized is implemented at its core using sci-kit learn machine learning libraries, particularly the RandomForestClassifier as seen below:
def impl_random_forests(self):
# Initiate and train model
start_time = time.time()
self.__rf = RandomForestClassifier(random_state=42, n_jobs=-1)
self.__rf.fit(self.split_dataset.x_features, self.split_dataset.y_features.values.ravel())
close_time = time.time()
print(f"==> ml-algo [Random Forests Implemented in {close_time - start_time}]")The RandomForestClassifier fits the training datasets which is first split using read_dataset_and_split function into "X_features" which represent the Independent variables or features and the "y_features" a 1D array which represents the dependent variable or target variable i.e. the column to be predicted.
Extreme Gradient Boosting (XGBoost) is an optimized implementation of gradient boosting designed to be highly efficient and scalable. It builds an ensemble of trees sequentially, where each new tree attempts to correct the errors made by the previous ones. XGBoost includes advanced features like regularization to prevent overfitting and parallel processing to speed up training.
The Extreme Gradient Boosting Algorithm utilized is implemented using scikit-learn machine learning libraries, particularly the GradientBoostingClassifier as seen below:
def impl_xgboost(self):
# Initiate and train model
start_time = time.time()
self.__xgb = GradientBoostingClassifier()
y_features = np.ravel(self.split_dataset.y_features)
self.__xgb.fit(self.split_dataset.x_features, y_features)
close_time = time.time()
print(f"==> ml-algo [Gradient Boosting Implemented in {close_time - start_time}]")The GradientBoostingClassifier fits the training datasets which is first split using read_dataset_and_split function into "X_features" which represent the Independent variables or features and the "y_features" a 1D array which represents the dependent variable or target variable i.e. the column to be predicted.
K-Means Algorithm is a popular clustering method used to partition a dataset into K distinct, non-overlapping subsets (clusters). Each data point is assigned to the cluster with the nearest mean, and the process is repeated iteratively to minimize the variance within clusters. It is widely used for exploratory data analysis and pattern recognition.
The K-Means Algorithm utilized is implemented using sci-kit learn machine learning libraries particularly the KMeans as seen below;
def impl_kmeans(self):
# create scaled DataFrame where each variable has mean of 0 and standard dev of 1
df = pd.concat([self.split_dataset.x_features, self.split_dataset.y_features], axis=1)
start_time = time.time()
self.__kmeans = KMeans(init="random", n_init='auto', random_state=1)
self.__kmeans.fit(df)
close_time = time.time()
print(f"==> ml-algo [KMeans Implemented in {close_time - start_time}]")The KMeans Algorithm fits the training dataset but does not require splitting the dataset into feature and target or independent and dependent variables respectively since it is an unsupervised learning algorithm, we pass the pre-processed training dataset as is for fitting.
Testing was performed on two categories of datasets, the CAN Training dataset obtained after data processing and the Testing dataset also obtained during the data processing stage. The trained model was tested against two categories of test datasets, namely:
- Known Vehicle Known Attack Dataset
- Unknown Vehicle Known Attack Dataset
test_data_kv_ka_raw_dos = get_test_dataset(test_type=TestDataType.kv_ka, attack_type=AttackType.dos)
test_data_kv_ka_raw_fn = get_test_dataset(test_type=TestDataType.kv_ka, attack_type=AttackType.force_neutral)
test_data_kv_ka_raw_rpm = get_test_dataset(test_type=TestDataType.kv_ka, attack_type=AttackType.rpm)
test_data_kv_ka_raw_ss = get_test_dataset(test_type=TestDataType.kv_ka, attack_type=AttackType.standstill)test_data_kv_ka_dos = test_data_kv_ka_raw_dos.drop(columns=["attack"]) test_data_kv_ka_fn = test_data_kv_ka_raw_fn.drop(columns=["attack"]) test_data_kv_ka_rpm = test_data_kv_ka_raw_rpm.drop(columns=["attack"]) test_data_kv_ka_ss = test_data_kv_ka_raw_ss.drop(columns=["attack"])# replace the attack column categorical values in all test dataset to match the attack column values in the training dataset
y_true_dos = test_data_kv_ka_raw_dos["attack"].replace(1,2) # 2 signifies DoS attacks
y_true_fn = test_data_kv_ka_raw_fn["attack"].replace(1,3) # 3 signifies force_neutral attacks
y_true_rpm = test_data_kv_ka_raw_rpm["attack"].replace(1,4) # 4 signifies rpm attacks
y_true_ss = test_data_kv_ka_raw_ss["attack"].replace(1,5) # 5 signifies standstill attacks
start_time = time.time()y_pred_dos = algo.predict(test_data_kv_ka_dos, AlgoToPredict.random_forest)
y_pred_fn = algo.predict(test_data_kv_ka_fn, AlgoToPredict.random_forest)
y_pred_rpm = algo.predict(test_data_kv_ka_rpm, AlgoToPredict.random_forest)
y_pred_ss = algo.predict(test_data_kv_ka_ss, AlgoToPredict.random_forest)After the predictions are generated using the two categorical test datasets "known vehicle known attack" and "unknown vehicle known attack", we can use the "y_true_*" values—a data-frame which contains the accurate predictions for the "attack" column—and "y_pred_*"—a data-frame which contains the predicted values calculated using the trained model—to generate a metrics report using the generate_metrics_report function. This function calculates the accuracy, precision, recall, f1, roc_curve, roc_auc, and a confusion matrix, then returns it.
More information on the generated metrics can be found in the table below:
| Scoring Metrics | Explanation |
|---|---|
| Confusion Matrix | Summarizes the performance of a classification model by showing the counts of correct and incorrect predictions for each class |
| Detection Accuracy | Ability to classify normal and intrusive traffic |
| Precision | Proportion of true positive predictions among all positive predictions |
| Recall | Proportion of true positive predictions among all actual positive instances |
| ROC curve | Plot of true positive rate vs false positive rate at various classification thresholds |
| ROC AUC | Area under the ROC curve, measuring overall classification performance |