Random Forest is one of the most popular supervised learning ensemble methods in machine learning. Random Forest engenders a set of random trees and considers majority voting technique to classify known and unknown data instances. In Random Forest, decision tree induction is used as a baseline classifier.
Decision tree is a top-down divide and conquer recursive algorithm that applies feature selection technique to select the root/best feature, including:
- ID3 (Iterative Dichotomiser 3)
- C4.5 (an extension of ID3)
- CART (Classification and Regression Tree)
Our approach follows a systematic two-stage process:
Stage 1: Data Clustering
We cluster the data into several clusters using K-Means Clustering algorithm to create homogeneous subgroups within the dataset.
Stage 2: Ensemble Classification
We apply the Random Forest technique independently to each cluster, leveraging the reduced complexity and improved data homogeneity within each cluster.
We have conducted comprehensive experiments comparing our proposed clustering-based Random Forest approach with the traditional Random Forest algorithm. The evaluation was performed on five benchmark datasets obtained from the UCI Machine Learning Repository.
Enhanced Performance Improved accuracy over traditional RF |
Comprehensive Evaluation Tested on 5 UCI datasets |
Scalability Suitable for Big Data applications |
Key Findings: Our proposed Random Forest technique demonstrates superior performance compared to the traditional Random Forest algorithm across all evaluated datasets, showing particular promise for large-scale data mining applications.
Published in IEEE Conference Proceedings