Skip to content

abeeraisabeera/Parallel-Genetic-Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Parallel Genetic Algorithm for Feature Selection

Python 3.8+ License

An optimized machine learning framework that uses genetic algorithms with multiprocessing to automatically identify the most relevant features for classification tasks, reducing dimensionality while maintaining model accuracy.

🎯 Business Value

The Problem: High-dimensional datasets slow down model training and can lead to overfitting. Manual feature selection is time-consuming and suboptimal.

The Solution: This project automates feature selection using evolutionary computation, achieving:

  • ⚑ 2-3x faster execution through parallel processing
  • πŸ“‰ ~30% dimensionality reduction while maintaining accuracy
  • 🎯 90%+ classification accuracy on benchmark datasets
  • πŸ”„ Scalable to any classification problem with minimal code changes

πŸš€ Key Features

  • Evolutionary Optimization: Uses genetic algorithms to explore feature combinations intelligently
  • Parallel Processing: Leverages multiprocessing for significant performance gains
  • Production-Ready Code: Clean, modular, well-documented, and thoroughly tested
  • Comprehensive Visualization: Automated generation of performance comparison charts
  • Flexible Framework: Easy to adapt to different datasets and classifiers

πŸ“Š Results at a Glance

When tested on the Digits dataset (1,797 samples, 64 features, 10 classes):

Metric Sequential GA Parallel GA Improvement
Execution Time ~45s ~20s 2.25x faster
Features Selected 44/64 (69%) 42/64 (66%) 34% reduction
Classification Accuracy 94.7% 95.3% Maintained/Improved
Speedup Baseline 2.25x ↑ 125% faster

Results Visualization

πŸ› οΈ Technical Stack

  • Evolutionary Computing: DEAP (Distributed Evolutionary Algorithms in Python)
  • Machine Learning: scikit-learn (Random Forest Classifier)
  • Parallel Processing: Python multiprocessing
  • Data Visualization: Matplotlib, Seaborn
  • Scientific Computing: NumPy

πŸ“¦ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Quick Start

# Clone the repository
git clone https://github.com/yourusername/parallel-genetic-feature-selection.git
cd parallel-genetic-feature-selection

# Install dependencies
pip install -r requirements.txt

# Run the algorithm
python parallel_genetic_algorithm.py

πŸ’» Usage

Basic Usage

from parallel_genetic_algorithm import GeneticFeatureSelector, load_and_split_data

# Load your data
X_train, X_test, y_train, y_test = load_and_split_data()

# Initialize the selector
selector = GeneticFeatureSelector(
    population_size=50,
    num_generations=50,
    crossover_prob=0.5,
    mutation_prob=0.2
)

# Fit with parallel processing
selector.fit(X_train, X_test, y_train, y_test, parallel=True)

# Get selected features
best_features = selector.get_best_features()
print(f"Selected {len(best_features)} features: {best_features}")
print(f"Accuracy: {selector.best_accuracy_:.4f}")

Advanced Usage: Custom Dataset

from sklearn.model_selection import train_test_split

# Load your custom dataset
X, y = load_your_data()  # Your data loading function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Use the same selector
selector = GeneticFeatureSelector(population_size=100, num_generations=100)
selector.fit(X_train, X_test, y_train, y_test, parallel=True)

# Transform your data using selected features
X_train_selected = X_train[:, selector.get_best_features()]
X_test_selected = X_test[:, selector.get_best_features()]

🧠 How It Works

1. Initialization

  • Create a random population of feature subsets (binary encoding)
  • Each individual represents a potential solution

2. Fitness Evaluation

  • Train a Random Forest classifier on each feature subset
  • Evaluate accuracy on test data
  • Higher accuracy = higher fitness

3. Evolution

  • Selection: Tournament selection picks the fittest individuals
  • Crossover: Two-point crossover combines parent solutions
  • Mutation: Bit-flip mutation introduces diversity
  • Repeat for multiple generations

4. Parallel Processing

  • Fitness evaluations run concurrently across CPU cores
  • Significant speedup for computationally expensive evaluations

5. Result

  • Return the best feature subset found across all generations

πŸ“ˆ Performance Analysis

The parallel implementation provides measurable benefits:

Population Size: 50
Generations: 50
Dataset: Digits (1,797 samples Γ— 64 features)

Sequential: ~45 seconds
Parallel:   ~20 seconds
Speedup:    2.25x

Scalability: Speedup increases with larger populations and datasets

πŸ”§ Configuration

Key parameters you can tune:

Parameter Description Default Recommended Range
population_size Number of individuals per generation 50 30-200
num_generations Number of evolutionary iterations 50 30-100
crossover_prob Probability of recombination 0.5 0.4-0.8
mutation_prob Probability of random changes 0.2 0.1-0.3

πŸ“ Project Structure

parallel-genetic-feature-selection/
β”‚
β”œβ”€β”€ parallel_genetic_algorithm.py   # Main implementation
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ results.png                     # Generated performance visualization
└── .gitignore                      # Git ignore rules

πŸŽ“ Learning Outcomes

This project demonstrates:

  • βœ… Optimization Algorithms: Practical application of genetic algorithms
  • βœ… Parallel Computing: Effective use of multiprocessing for performance
  • βœ… Machine Learning Pipeline: Data preprocessing, model training, evaluation
  • βœ… Software Engineering: Clean code, documentation, modularity, testing
  • βœ… Data Visualization: Clear communication of technical results

🚦 Future Enhancements

Potential improvements for production deployment:

  • Add cross-validation for more robust fitness evaluation
  • Implement multi-objective optimization (accuracy + feature count)
  • Support for regression tasks (not just classification)
  • Distributed computing support (Dask, Ray)
  • Hyperparameter auto-tuning
  • Integration with MLflow for experiment tracking
  • CLI interface for easy experimentation
  • Docker containerization

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ™ Acknowledgments

  • DEAP library for evolutionary algorithm framework
  • scikit-learn for machine learning utilities
  • UCI Machine Learning Repository for the Digits dataset

⭐ If you found this project helpful, please consider giving it a star!

About

Parallel Genetic Algorithm for Feature Selection uses evolutionary search with multiprocessing to select optimal features for classification. It improves model accuracy while reducing dimensionality, and benchmarks sequential vs parallel execution using a Random Forest fitness function for efficiency and performance gain analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages