Skip to content

This project analyzes species distribution data across different geographic locations, combining environmental features to predict: 1. **Classification Task**: Which species is present at a given location 2. **Regression Task**: How many species observations occur at grid locations

Notifications You must be signed in to change notification settings

Emaad2405/Multi-Model-Species-Classification-Mapping-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Species Mapping Project

Overview

This project analyzes species distribution data across different geographic locations, combining environmental features to predict:

  1. Classification Task: Which species is present at a given location
  2. Regression Task: How many species observations occur at grid locations

Dataset

  • File: species_with_country_final.csv
  • Total Observations: ~270,000 (after cleaning)
  • Unique Species: 500
  • Geographic Coverage: Global (multiple countries)
  • Features:
    • Geographic: latitude, longitude, country
    • Environmental: temperature (avg/max/min), precipitation, solar radiation, water vapor pressure, wind speed
    • Target: species_id, species_count (aggregated)

Project Structure

📁 Main Analysis Scripts

1. species_pipeline_augmented.ipynb

Purpose: Complete ML pipeline for species prediction (both classification and regression)

Key Features:

  • Handles class imbalance using class_weight='balanced'
  • Log transformation for skewed species counts
  • Models tested:
    • Logistic Regression
    • Random Forest (best performer)
    • XGBoost
  • Cross-validation for robust evaluation
  • Comprehensive visualizations

Results (Global Data):

  • Classification: Random Forest achieves ~33% accuracy (challenging due to 500 classes and imbalance)
  • Regression: Random Forest R² = 0.13 (species count prediction)

Output Directory: plots_augmented/


2. species_pipeline_augmented_USA.ipynb ⭐ NEW

Purpose: Same analysis as above but filtered to USA data only

Key Differences:

  • Data filtered to country == 'United States of America'
  • ~86,000 USA observations (32% of total dataset)
  • 116 unique species in USA
  • No country feature in model (single country)
  • Focused geographic analysis

Why USA Only?

  • Homogeneous geographic region
  • Potentially better model performance
  • Regional species patterns
  • Easier to interpret results

Output Directory: plots_USA/


3. species_imbalance_skewness_analysis.ipynb ⭐ NEW

Purpose: Deep dive into data quality issues affecting model performance

What It Analyzes:

A. Class Imbalance (Classification Problem)
  • Imbalance Ratio: 40:1 (most common vs. least common species)
  • Data Concentration: Top 169 species (34%) contain 80% of all data
  • Rarity Categories:
    • Very Rare (<50 samples): 0% of species
    • Rare (50-200): 46% of species
    • Common (200-500): 23%
    • Very Common (500-1000): 20%
    • Abundant (1000+): 11%

Visualizations:

  • Comprehensive 9-panel class imbalance analysis
  • Cumulative distribution (Pareto principle)
  • Top/bottom species frequency
  • Category breakdowns
B. Skewness (Regression Problem)
  • Skewness: 25.93 (extremely right-skewed!)
  • Distribution:
    • Mean: 4.79 species/location
    • Median: 1.00 species/location (huge gap!)
  • 98.8% of locations have <50 species
  • Only 1.2% have 200+ species

Transformations Tested:

  • Original: Skewness = 25.93
  • Log(1+x): Skewness = 2.13 ✓ (best)
  • Sqrt(x): Skewness = 8.71

Visualizations:

  • Comprehensive 9-panel skewness analysis
  • Q-Q plots (original vs. transformed)
  • CDF and percentile analysis
  • Transformation comparisons

Output Directory: imbalance_analysis/

Key Findings:

  • ⚠️ Severe class imbalance makes classification very challenging
  • ⚠️ Extreme skewness requires log transformation for regression
  • ✓ Tree-based models (RF, XGBoost) handle these issues better than linear models

4. species_pipeline_deep_learning.ipynb ⭐ NEW

Purpose: Neural network approach as an alternative to traditional ML

Question Answered: Can deep learning improve performance over Random Forest and XGBoost?

Architecture:

Classification Model
Input (features) → Dense(512) → BatchNorm → Dropout(0.3)
                 → Dense(256) → BatchNorm → Dropout(0.3)
                 → Dense(128) → BatchNorm → Dropout(0.3)
                 → Dense(500, softmax) → Output (species)
Regression Model
Input (features) → Dense(256) → BatchNorm → Dropout(0.3)
                 → Dense(128) → BatchNorm → Dropout(0.3)
                 → Dense(64)  → BatchNorm → Dropout(0.3)
                 → Dense(1, linear) → Output (count)

Deep Learning Techniques Used:

  • Batch Normalization: Stabilizes training
  • Dropout: Prevents overfitting
  • Class Weights: Handles imbalance
  • Early Stopping: Prevents overfitting
  • Learning Rate Scheduling: Adaptive learning
  • Log Transformation: Handles skewness

Expected Performance:

  • Neural networks typically excel with:

    • Very large datasets (millions of samples)
    • Complex feature interactions
    • Unstructured data (images, text)
  • For this tabular dataset with moderate size:

    • Random Forest likely remains competitive
    • Deep learning may not significantly outperform
    • Tree models are naturally suited for tabular data

Output Directory: plots_deep_learning/

Models Saved:

  • best_cls_model.h5 (classification)
  • best_reg_model.h5 (regression)

5. detailed_analysis_and_viz.ipynb (Reference)

Purpose: Original exploratory data analysis (EDA)

Contains:

  • Interactive visualizations (Plotly)
  • Geographic distribution maps
  • Country-level analysis
  • Temperature vs. diversity patterns

Output Directory: detailed_analysis/


Key Challenges & Solutions

Challenge 1: Class Imbalance

Problem: Species distribution is highly imbalanced (40:1 ratio)

Solutions Implemented:

  • class_weight='balanced' in scikit-learn models
  • ✓ Stratified sampling (maintains class proportions)
  • ✓ Weighted loss in neural networks
  • ✓ Focus on weighted F1-score, not just accuracy

Not Implemented (computationally expensive):

  • SMOTE (Synthetic Minority Over-sampling)
  • Data augmentation

Challenge 2: Extreme Skewness

Problem: Species counts are extremely right-skewed (skewness = 25.93)

Solutions Implemented:

  • ✓ Log1p transformation: y_transformed = log(1 + y)
  • ✓ Inverse transform for predictions: y_pred = exp(y_pred_log) - 1
  • ✓ Tree-based models (robust to skewness)

Why Log1p:

  • Handles zero values (log(0) is undefined, but log(1+0) = 0)
  • Reduces skewness from 25.93 → 2.13
  • Allows use of MSE/RMSE metrics

Challenge 3: High Dimensionality

Problem: 500 species classes + multiple countries

Solutions Implemented:

  • ✓ One-hot encoding for country features
  • ✓ StandardScaler for numerical features
  • ✓ Dimensionality reduction via feature selection
  • ✓ Ensemble methods (reduce overfitting)

Model Comparison

Classification Results (Global Data)

Model Test Accuracy CV Accuracy Training Time Notes
Logistic Regression 0.162 0.161 ~22 min Linear model struggles
Random Forest 0.331 0.327 ~3 min Best performer
XGBoost 0.097 0.084 ~10 min Underperforms unexpectedly
Neural Network (MLP) TBD TBD TBD See deep_learning notebook

Why Random Forest Wins:

  • Handles class imbalance well with class_weight='balanced'
  • Robust to feature scaling
  • Can model complex non-linear relationships
  • Less prone to overfitting than XGBoost on this dataset

Regression Results (Global Data)

Model Test R² CV R² RMSE Training Time Notes
Linear Regression -0.011 0.051 16.18 ~1s Fails on skewed data
Random Forest 0.126 0.224 15.05 ~38s Best performer
XGBoost 0.076 0.183 15.47 ~2s Decent performance
Neural Network (MLP) TBD TBD TBD TBD See deep_learning notebook

Why Random Forest Wins:

  • Robust to outliers and skewness
  • Handles spatial patterns well
  • No need for complex feature engineering
  • Fast training on moderate-sized datasets

Can Deep Learning Help?

Expected Scenarios:

✅ Deep Learning May Help If:

  1. Very large dataset (millions of samples)
    • Current: ~270k samples → moderate size
  2. Complex feature interactions (non-linear, high-order)
    • Current: mostly environmental features (temperature, precipitation)
  3. Unstructured data (images, text, time series)
    • Current: tabular data
  4. Need for transfer learning
    • Current: no pre-trained models applicable

❌ Deep Learning May NOT Help Because:

  1. Tabular data: Tree models (RF, XGBoost) are state-of-the-art
  2. Moderate dataset size: Not enough data to train deep networks effectively
  3. High class imbalance: Neural networks struggle more than tree models
  4. Interpretability: Tree models provide feature importance easily

Verdict:

Traditional ML (Random Forest) likely remains the best choice for this specific dataset. However, the deep learning notebook is provided as:

  • An alternative approach to explore
  • A learning exercise in neural network architectures
  • A baseline for future improvements (e.g., with more data)

How to Use This Project

1. Run the Notebooks in Order

For Complete Analysis:

1. detailed_analysis_and_viz.ipynb         # EDA (optional)
2. species_imbalance_skewness_analysis.ipynb  # Understand data issues
3. species_pipeline_augmented.ipynb        # Train ML models (global)
4. species_pipeline_augmented_USA.ipynb    # Train ML models (USA only)
5. species_pipeline_deep_learning.ipynb    # Try deep learning (optional)

For Quick Results:

1. species_pipeline_augmented.ipynb        # Global ML models
OR
2. species_pipeline_augmented_USA.ipynb    # USA-only ML models

2. Dependencies

# Core libraries
pip install pandas numpy matplotlib seaborn scikit-learn

# Advanced ML
pip install xgboost

# Deep learning (optional)
pip install tensorflow

# Interactive visualizations (optional)
pip install plotly

3. Expected Outputs

Each notebook generates:

  • PNG plots (static visualizations)
  • HTML dashboards (interactive, where applicable)
  • TXT reports (model performance summaries)
  • CSV summaries (statistics and comparisons)

Output Directories:

  • plots_augmented/ - Global ML results
  • plots_USA/ - USA-only ML results
  • imbalance_analysis/ - Data quality analysis
  • detailed_analysis/ - EDA outputs
  • plots_deep_learning/ - Neural network results

Key Insights

1. Data Quality Issues Are Major Challenges

  • Class imbalance (40:1 ratio) severely limits classification accuracy
  • Extreme skewness (25.93) makes regression difficult
  • Top 34% of species contain 80% of data

2. Random Forest Is the Best Traditional ML Model

  • Classification: 33% accuracy (given 500 classes and imbalance, this is decent)
  • Regression: R² = 0.13 (moderate performance)
  • Robust to imbalance and skewness

3. Feature Engineering Could Help

  • Spatial features (k-NN neighbors, distance to biodiversity hotspots)
  • Temporal features (season, year)
  • Interaction terms (temperature × precipitation)

4. More Data Would Help

  • Current: 50-2000 samples per species
  • Ideal: 1000+ samples per species for rare classes
  • Additional features: soil type, elevation, habitat characteristics

Recommendations for Improvement

For Classification:

  1. Focus on top N species (e.g., top 100 with most data)
    • Reduces classes, improves accuracy
  2. Hierarchical classification (family → genus → species)
    • Break down the problem into stages
  3. Ensemble methods (stacking, blending)
    • Combine multiple models
  4. More features (habitat type, ecosystem)
    • Richer information for predictions

For Regression:

  1. Add spatial features (neighboring cell counts)
    • Species cluster geographically
  2. Quantile regression (instead of mean prediction)
    • Better handles outliers
  3. Time series analysis (if temporal data available)
    • Seasonal patterns in species counts
  4. Geographically weighted regression
    • Local models for different regions

File Descriptions

Notebooks (.ipynb)

  • species_pipeline_augmented.ipynb - Main ML pipeline (global)
  • species_pipeline_augmented_USA.ipynb - ML pipeline (USA only) ⭐ NEW
  • species_imbalance_skewness_analysis.ipynb - Data quality analysis ⭐ NEW
  • species_pipeline_deep_learning.ipynb - Neural network approach ⭐ NEW
  • detailed_analysis_and_viz.ipynb - EDA and visualizations
  • species_pipeline_augmented_latest.ipynb - Alternative version
  • species_pipeline_original.ipynb - Original baseline

Data Files

  • species_with_country_final.csv - Main dataset (with country info)
  • species_train.npz - Training data (numpy format)
  • species_test.npz - Test data (numpy format)

Output Directories

  • plots_augmented/ - Global ML results
  • plots_USA/ - USA-only results ⭐ NEW
  • imbalance_analysis/ - Data issue analysis ⭐ NEW
  • plots_deep_learning/ - Neural network results ⭐ NEW
  • detailed_analysis/ - EDA outputs

Frequently Asked Questions

Q1: Why is classification accuracy only 33%?

A: With 500 species classes and severe imbalance (40:1), this is actually reasonable:

  • Random guessing would give 0.2% accuracy (1/500)
  • 33% means the model is learning meaningful patterns
  • Focus on top-k accuracy (top-5, top-10) for better metrics

Q2: Why use log transformation for regression?

A: The target (species count) is extremely skewed:

  • Mean (4.79) >> Median (1.00)
  • Skewness = 25.93 (very high)
  • Log1p reduces this to 2.13, enabling better model training

Q3: Should I use deep learning or Random Forest?

A: For this dataset, Random Forest is recommended because:

  • Tabular data (tree models excel here)
  • Moderate dataset size (not big enough for deep learning)
  • Class imbalance (trees handle better)
  • Faster training and better interpretability

However, try deep learning if:

  • You have more data (millions of samples)
  • You want to learn neural network techniques
  • You can leverage GPUs for faster training

Q4: Why analyze USA data separately?

A: Several reasons:

  1. Homogeneous region: Similar climate/geography patterns
  2. Data concentration: USA has 32% of all observations
  3. Simpler model: No need for country feature
  4. Better performance: Potentially higher accuracy in focused region

Q5: Can I apply this to other species datasets?

A: Yes! The methodology is generalizable:

  1. Handle class imbalance (class weights, stratified sampling)
  2. Transform skewed targets (log1p)
  3. Use tree-based models (Random Forest, XGBoost)
  4. Evaluate with appropriate metrics (weighted F1, R²)

Citation

If you use this project, please cite:

Species Mapping Project
Applied Machine Learning Mini Project
November 2025

Contact

For questions or issues, please refer to the notebook comments or reach out via the project repository.


Version History

  • v1.0 (Nov 2025) - Initial release with augmented pipeline
  • v2.0 (Nov 2025) - Added USA-specific analysis, imbalance/skewness analysis, and deep learning approach

Happy Modeling! 🌿🐾

About

This project analyzes species distribution data across different geographic locations, combining environmental features to predict: 1. **Classification Task**: Which species is present at a given location 2. **Regression Task**: How many species observations occur at grid locations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published