A comprehensive machine learning project that predicts electrocatalyst performance using Random Forest classification with advanced feature engineering and permutation feature importance analysis.
This project develops a machine learning pipeline to classify electrocatalyst quality based on overpotential measurements at 50.0 mA/cm². By analyzing 500 electrochemical experiments with 11 compositional and operational features, the model predicts whether catalysts are "good" (|η| < 0.6 V) or "bad" (|η| ≥ 0.6 V).
Key Achievement: Built a Random Forest classifier achieving 70.6% precision on test data, enabling reliable catalyst quality screening and reducing the need for expensive experimental validation.
- Cost Reduction: Predicts catalyst quality before expensive electrochemical testing
- Accelerated R&D: Identifies key compositional factors (Co, Se, Ni) for catalyst optimization
- Data-Driven Design: Provides actionable insights for next-generation catalyst development
- Python 3.12 | scikit-learn | pandas | NumPy
- Matplotlib | Seaborn | PCA | Random Forest
- ✅ High-dimensional data visualization (PCA, correlation analysis)
- ✅ Hyperparameter tuning with stratified validation
- ✅ Permutation Feature Importance (PFI) analysis
- ✅ Model evaluation with imbalanced datasets
- ✅ Scale-invariance validation for tree-based models
- Exploratory Data Analysis → Visualized 11D feature space using PCA and correlation heatmaps
- Feature Engineering → Identified top correlated features and non-linear patterns
- Model Development → Trained and optimized Random Forest with 5 hyperparameter configurations
- Model Interpretation → Validated feature importance with/without normalization
- Performance Analysis → Evaluated precision-recall trade-offs for imbalanced classes
- Source: Electrochemical experiments dataset
- Size: 500 samples
- Features: 11 compositional and operational parameters
- Compositional: V, Cr, Mg, Fe, Co, Ni, Cu, S, Se, P
- Operational: Voltage, Time
- Target: Overpotential η at 50.0 mA/cm²
- Class Distribution: 33% good / 67% bad (moderate imbalance)
| Metric | Training | Validation | Test |
|---|---|---|---|
| Precision | 1.000 | 0.667 | 0.706 |
| Recall | 1.000 | 0.480 | 0.500 |
| F1 Score | 1.000 | 0.558 | 0.585 |
- Cobalt (Co) - Importance: 0.211 ± 0.079
- Selenium (Se) - Importance: 0.118 ± 0.076
- Nickel (Ni) - Importance: 0.078 ± 0.053
- Vanadium (V) - Importance: 0.049 ± 0.062
- Magnesium (Mg) - Importance: 0.044 ± 0.026
Insight: Transition metals (Co, Ni) and chalcogens (Se) dominate catalyst performance—aligning with electrochemistry domain knowledge.
Tested 5 Random Forest configurations, achieving best validation F1 (0.565) with:
n_estimators=100max_depth=20min_samples_split=5min_samples_leaf=2
.
├── homework1_analysis.ipynb # Complete analysis pipeline (32 cells)
├── ExerciseData.csv # Dataset (500 experiments)
├── Homework1_Documentation.md # Detailed technical report
└── README.md # This file
- PCA Analysis: 2D projection explains only 24.4% variance → True high-dimensionality
- Correlation Heatmap: Identified Co (-0.393), Se (+0.352), V (+0.316) as top correlates
- Pairplots: Visualized non-linear class separability across top features
- 70% Training (350 samples) - Model learning
- 15% Validation (75 samples) - Hyperparameter selection
- 15% Test (75 samples) - Unbiased evaluation
Ran PFI analysis with and without normalization to validate Random Forest's scale-invariance:
- ✅ Identical top-5 rankings in both scenarios
- ✅ Confirms importance stems from information gain, not feature magnitude
-
Model Selection Rationale: Random Forest outperforms linear models due to:
- PCA's low explained variance (35% in 3D) → non-linear relationships dominate
- Heavy class intermixing in PCA space → complex decision boundaries required
-
Precision-Recall Trade-off: Model prioritizes precision (0.71) over recall (0.50):
- Conservative predictions minimize false positives
- Suitable for screening applications where experimental validation follows
-
Domain Alignment: Top features (Co, Se, Ni) match electrochemistry literature:
- Transition metals enable variable oxidation states for redox reactions
- Chalcogens modulate electronic structure and conductivity
pip install numpy pandas scikit-learn matplotlib seaborn jupyterjupyter notebook homework1_analysis.ipynbThe notebook runs end-to-end in ~30 seconds on modern hardware.
The project includes 6 comprehensive visualizations:
- Overpotential Distribution - Histogram with good/bad threshold
- Correlation Heatmap - Feature-target relationships
- Pairwise Scatter Plots - Top 5 features colored by quality
- 2D PCA Projection - Class separation in reduced space
- 3D PCA Projection - Multi-dimensional class structure
- PFI Comparison - Normalized vs. non-normalized feature importance
This project demonstrates:
- ✅ End-to-end ML pipeline design and implementation
- ✅ Imbalanced data handling with stratified sampling and F1 optimization
- ✅ Model interpretability through feature importance and domain validation
- ✅ Scientific rigor via normalization studies and cross-validation
- ✅ Professional documentation and reproducible research practices
- Implement SHAP values for instance-level interpretability
- Test gradient boosting methods (XGBoost, LightGBM, CatBoost)
- Engineer interaction features (e.g., Co×Se, Ni×Voltage)
- Perform threshold optimization for precision-recall balance
- Deploy model as REST API for real-time predictions
Abel Saj
- GitHub: @abelsaj
- LinkedIn: Connect with me