Skip to content

Ameliorating Performance of Random Forest using Data Clustering | Research for a novel approach for binary class classification problem

License

Notifications You must be signed in to change notification settings

tinykishore/RFWOC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ameliorating Performance of Random Forest using Data Clustering


Abstract

Random Forest is one of the most popular supervised learning ensemble methods in machine learning. Random Forest engenders a set of random trees and considers majority voting technique to classify known and unknown data instances. In Random Forest, decision tree induction is used as a baseline classifier.

Decision tree is a top-down divide and conquer recursive algorithm that applies feature selection technique to select the root/best feature, including:

  • ID3 (Iterative Dichotomiser 3)
  • C4.5 (an extension of ID3)
  • CART (Classification and Regression Tree)
Key Contribution: In this paper, we have proposed a new approach to improve the performance of Random Forest classifier using clustering technique. This proposed idea can be applied for Big Data mining.

Methodology

Our approach follows a systematic two-stage process:

Stage 1: Data Clustering
We cluster the data into several clusters using K-Means Clustering algorithm to create homogeneous subgroups within the dataset.

Stage 2: Ensemble Classification
We apply the Random Forest technique independently to each cluster, leveraging the reduced complexity and improved data homogeneity within each cluster.


System Architecture

Proposed System Architecture

Figure 1: Proposed System Architecture


Experimental Results

We have conducted comprehensive experiments comparing our proposed clustering-based Random Forest approach with the traditional Random Forest algorithm. The evaluation was performed on five benchmark datasets obtained from the UCI Machine Learning Repository.

Enhanced Performance
Improved accuracy over traditional RF
Comprehensive Evaluation
Tested on 5 UCI datasets
Scalability
Suitable for Big Data applications

Key Findings: Our proposed Random Forest technique demonstrates superior performance compared to the traditional Random Forest algorithm across all evaluated datasets, showing particular promise for large-scale data mining applications.


Publication Details

Published in IEEE Conference Proceedings

About

Ameliorating Performance of Random Forest using Data Clustering | Research for a novel approach for binary class classification problem

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages