Ameliorating Performance of Random Forest using Data Clustering

Authors

Ummay Maria Muna • Shanta Biswas • Syed Abu Ammar Muhammad Zarif • Dewan Md. Farid

Abstract

Random Forest is one of the most popular supervised learning ensemble methods in machine learning. Random Forest engenders a set of random trees and considers majority voting technique to classify known and unknown data instances. In Random Forest, decision tree induction is used as a baseline classifier.

Decision tree is a top-down divide and conquer recursive algorithm that applies feature selection technique to select the root/best feature, including:

ID3 (Iterative Dichotomiser 3)
C4.5 (an extension of ID3)
CART (Classification and Regression Tree)

Key Contribution: In this paper, we have proposed a new approach to improve the performance of Random Forest classifier using clustering technique. This proposed idea can be applied for Big Data mining.

Methodology

Our approach follows a systematic two-stage process:

Stage 1: Data Clustering
We cluster the data into several clusters using K-Means Clustering algorithm to create homogeneous subgroups within the dataset.

Stage 2: Ensemble Classification
We apply the Random Forest technique independently to each cluster, leveraging the reduced complexity and improved data homogeneity within each cluster.

System Architecture

Figure 1: Proposed System Architecture

Experimental Results

We have conducted comprehensive experiments comparing our proposed clustering-based Random Forest approach with the traditional Random Forest algorithm. The evaluation was performed on five benchmark datasets obtained from the UCI Machine Learning Repository.

Enhanced Performance
Improved accuracy over traditional RF

Comprehensive Evaluation
Tested on 5 UCI datasets

Scalability
Suitable for Big Data applications

Key Findings: Our proposed Random Forest technique demonstrates superior performance compared to the traditional Random Forest algorithm across all evaluated datasets, showing particular promise for large-scale data mining applications.

Publication Details

Access Full Paper

Published in IEEE Conference Proceedings

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
IEEE		IEEE
dataset		dataset
rfwoc		rfwoc
.gitignore		.gitignore
LICENSE		LICENSE
NormalRandomForest.py		NormalRandomForest.py
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ameliorating Performance of Random Forest using Data Clustering

Authors

Abstract

Methodology

System Architecture

Experimental Results

Publication Details

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

tinykishore/RFWOC

Folders and files

Latest commit

History

Repository files navigation

Ameliorating Performance of Random Forest using Data Clustering

Authors

Abstract

Methodology

System Architecture

Experimental Results

Publication Details

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages