Skip to content

Commit 95d6d94

Browse files
FEAT: Adding the basic structure of the genetic_algorithm.py
1 parent 273fe19 commit 95d6d94

File tree

1 file changed

+128
-0
lines changed

1 file changed

+128
-0
lines changed

genetic_algorithm.py

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
"""
2+
================================================================================
3+
Genetic Algorithm Feature Selection & Analysis
4+
================================================================================
5+
Author : Breno Farias da Silva
6+
Created : 2025-10-07
7+
Description :
8+
This script runs a DEAP-based Genetic Algorithm (GA) to perform feature
9+
selection for classification problems. It provides an end-to-end pipeline:
10+
dataset loading and cleaning, scaling, GA setup and execution, candidate
11+
evaluation (with a Random Forest base estimator), and post-hoc analysis
12+
(RFE ranking correlation, CSV summaries and boxplot visualizations).
13+
14+
Key features include:
15+
- DEAP-based GA for binary-mask feature selection
16+
- Fitness evaluation using a RandomForest and returning multi-metrics
17+
(accuracy, precision, recall, F1, FPR, FNR, elapsed time)
18+
- Population sweep support (run GA over a range of population sizes)
19+
- Integration with previously-computed RFE rankings for cross-checking
20+
- Exports: best-subset text file, CSV summaries and per-feature boxplots
21+
- Progress bars via tqdm and safe filename handling for outputs
22+
- Cross-platform completion notification (optional sound)
23+
24+
Usage:
25+
1. Configure the dataset:
26+
- Edit the `csv_file` variable in the `main()` function to point to
27+
the CSV dataset you want to analyze (the script assumes the last
28+
column is the target and numeric features are used).
29+
2. Optionally tune GA parameters in `main()` or call sites:
30+
- n_generations, min_pop, max_pop, population_size, train_test_ratio
31+
3. Run the pipeline via the project's Makefile:
32+
$ make main
33+
(Makefile is expected to setup env / deps and execute this script.)
34+
NOTE:
35+
- If you prefer not to use the Makefile, you can run the module/script
36+
directly from Python in your dev environment, but the recommended
37+
workflow for the project is `make main`.
38+
39+
Outputs:
40+
- Feature_Analysis/Genetic_Algorithm_results.txt (best subset + RFE cross-info)
41+
- Feature_Analysis/<dataset>_feature_summary.csv (mean/std per class for selected features)
42+
- Feature_Analysis/<dataset>-<feature>.png (boxplots for top features)
43+
- Console summary of best subsets per population size (when sweeping)
44+
45+
TODOs:
46+
- Add CLI argument parsing (argparse) to avoid editing `main()` for different runs.
47+
- Add cross-validation or nested CV to make fitness evaluation more robust.
48+
- Support multi-objective optimization (e.g., F1 vs. model training time).
49+
- Parallelize individual evaluations (joblib / dask) to speed up GA fitness calls.
50+
- Save and version best individuals (pickle/JSON) and GA run metadata.
51+
- Implement reproducible seeding across DEAP, numpy, random and sklearn.
52+
- Add automatic handling of categorical features and missing-value imputation.
53+
- Add early stopping and convergence checks to the GA loop.
54+
- Produce a machine-readable summary (JSON) of final metrics and selected features.
55+
- Add unit tests for core functions (fitness evaluation, GA setup, I/O).
56+
57+
Dependencies:
58+
- Python >= 3.9
59+
- pandas, numpy, scikit-learn, deap, tqdm, matplotlib, seaborn, colorama
60+
61+
Assumptions & Notes:
62+
- Dataset format: CSV, last column = target. Only numeric features are used.
63+
- RFE results (if present) are read from `Feature_Analysis/RFE_results_RandomForestClassifier.txt`.
64+
- Sound notification is skipped on Windows by default.
65+
- The script uses RandomForestClassifier as the default evaluator; change as needed.
66+
- Inspect output directories (`Feature_Analysis/`) after runs for artifacts.
67+
"""
68+
69+
import atexit # For playing a sound when the program finishes
70+
import matplotlib.pyplot as plt # For plotting graphs
71+
import numpy as np # For numerical operations
72+
import os # For running a command in the terminal
73+
import pandas as pd # For data manipulation
74+
import platform # For getting the operating system name
75+
import random # For random number generation
76+
import re # For sanitizing filenames
77+
import seaborn as sns # For enhanced plotting
78+
import time # For measuring execution time
79+
from colorama import Style # For coloring the terminal
80+
from deap import base, creator, tools, algorithms # For the genetic algorithm
81+
from tqdm import tqdm # For progress bars
82+
from sklearn.ensemble import RandomForestClassifier # For the machine learning model
83+
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # For model evaluation
84+
from sklearn.model_selection import train_test_split # For splitting the dataset
85+
from sklearn.preprocessing import StandardScaler # For feature scaling
86+
87+
# Macros:
88+
class BackgroundColors: # Colors for the terminal
89+
CYAN = "\033[96m" # Cyan
90+
GREEN = "\033[92m" # Green
91+
YELLOW = "\033[93m" # Yellow
92+
RED = "\033[91m" # Red
93+
BOLD = "\033[1m" # Bold
94+
UNDERLINE = "\033[4m" # Underline
95+
CLEAR_TERMINAL = "\033[H\033[J" # Clear the terminal
96+
97+
# Execution Constants:
98+
VERBOSE = False # Set to True to output verbose messages
99+
100+
# Sound Constants:
101+
SOUND_COMMANDS = {"Darwin": "afplay", "Linux": "aplay", "Windows": "start"} # The commands to play a sound for each operating system
102+
SOUND_FILE = "./.assets/Sounds/NotificationSound.wav" # The path to the sound file
103+
104+
# RUN_FUNCTIONS:
105+
RUN_FUNCTIONS = {
106+
"Play Sound": True, # Set to True to play a sound when the program finishes
107+
}
108+
109+
# Functions Definition
110+
111+
def main():
112+
"""
113+
Main function.
114+
115+
:param: None
116+
:return: None
117+
"""
118+
119+
pass
120+
121+
if __name__ == "__main__":
122+
"""
123+
This is the standard boilerplate that calls the main() function.
124+
125+
:return: None
126+
"""
127+
128+
main() # Call the main function

0 commit comments

Comments
 (0)