Species Mapping Project

Overview

This project analyzes species distribution data across different geographic locations, combining environmental features to predict:

Classification Task: Which species is present at a given location
Regression Task: How many species observations occur at grid locations

Dataset

File: species_with_country_final.csv
Total Observations: ~270,000 (after cleaning)
Unique Species: 500
Geographic Coverage: Global (multiple countries)
Features:
- Geographic: latitude, longitude, country
- Environmental: temperature (avg/max/min), precipitation, solar radiation, water vapor pressure, wind speed
- Target: species_id, species_count (aggregated)

Project Structure

📁 Main Analysis Scripts

1. species_pipeline_augmented.ipynb

Purpose: Complete ML pipeline for species prediction (both classification and regression)

Key Features:

Handles class imbalance using class_weight='balanced'
Log transformation for skewed species counts
Models tested:
- Logistic Regression
- Random Forest (best performer)
- XGBoost
Cross-validation for robust evaluation
Comprehensive visualizations

Results (Global Data):

Classification: Random Forest achieves ~33% accuracy (challenging due to 500 classes and imbalance)
Regression: Random Forest R² = 0.13 (species count prediction)

Output Directory: plots_augmented/

2. species_pipeline_augmented_USA.ipynb ⭐ NEW

Purpose: Same analysis as above but filtered to USA data only

Key Differences:

Data filtered to country == 'United States of America'
~86,000 USA observations (32% of total dataset)
116 unique species in USA
No country feature in model (single country)
Focused geographic analysis

Why USA Only?

Homogeneous geographic region
Potentially better model performance
Regional species patterns
Easier to interpret results

Output Directory: plots_USA/

3. species_imbalance_skewness_analysis.ipynb ⭐ NEW

Purpose: Deep dive into data quality issues affecting model performance

What It Analyzes:

A. Class Imbalance (Classification Problem)

Imbalance Ratio: 40:1 (most common vs. least common species)
Data Concentration: Top 169 species (34%) contain 80% of all data
Rarity Categories:
- Very Rare (<50 samples): 0% of species
- Rare (50-200): 46% of species
- Common (200-500): 23%
- Very Common (500-1000): 20%
- Abundant (1000+): 11%

Visualizations:

Comprehensive 9-panel class imbalance analysis
Cumulative distribution (Pareto principle)
Top/bottom species frequency
Category breakdowns

B. Skewness (Regression Problem)

Skewness: 25.93 (extremely right-skewed!)
Distribution:
- Mean: 4.79 species/location
- Median: 1.00 species/location (huge gap!)
98.8% of locations have <50 species
Only 1.2% have 200+ species

Transformations Tested:

Original: Skewness = 25.93
Log(1+x): Skewness = 2.13 ✓ (best)
Sqrt(x): Skewness = 8.71

Visualizations:

Comprehensive 9-panel skewness analysis
Q-Q plots (original vs. transformed)
CDF and percentile analysis
Transformation comparisons

Output Directory: imbalance_analysis/

Key Findings:

⚠️ Severe class imbalance makes classification very challenging
⚠️ Extreme skewness requires log transformation for regression
✓ Tree-based models (RF, XGBoost) handle these issues better than linear models

4. species_pipeline_deep_learning.ipynb ⭐ NEW

Purpose: Neural network approach as an alternative to traditional ML

Question Answered: Can deep learning improve performance over Random Forest and XGBoost?

Architecture:

Classification Model

Input (features) → Dense(512) → BatchNorm → Dropout(0.3)
                 → Dense(256) → BatchNorm → Dropout(0.3)
                 → Dense(128) → BatchNorm → Dropout(0.3)
                 → Dense(500, softmax) → Output (species)

Regression Model

Input (features) → Dense(256) → BatchNorm → Dropout(0.3)
                 → Dense(128) → BatchNorm → Dropout(0.3)
                 → Dense(64)  → BatchNorm → Dropout(0.3)
                 → Dense(1, linear) → Output (count)

Deep Learning Techniques Used:

✓ Batch Normalization: Stabilizes training
✓ Dropout: Prevents overfitting
✓ Class Weights: Handles imbalance
✓ Early Stopping: Prevents overfitting
✓ Learning Rate Scheduling: Adaptive learning
✓ Log Transformation: Handles skewness

Expected Performance:

Neural networks typically excel with:
- Very large datasets (millions of samples)
- Complex feature interactions
- Unstructured data (images, text)
For this tabular dataset with moderate size:
- Random Forest likely remains competitive
- Deep learning may not significantly outperform
- Tree models are naturally suited for tabular data

Output Directory: plots_deep_learning/

Models Saved:

best_cls_model.h5 (classification)
best_reg_model.h5 (regression)

5. detailed_analysis_and_viz.ipynb (Reference)

Purpose: Original exploratory data analysis (EDA)

Contains:

Interactive visualizations (Plotly)
Geographic distribution maps
Country-level analysis
Temperature vs. diversity patterns

Output Directory: detailed_analysis/

Key Challenges & Solutions

Challenge 1: Class Imbalance

Problem: Species distribution is highly imbalanced (40:1 ratio)

Solutions Implemented:

✓ class_weight='balanced' in scikit-learn models
✓ Stratified sampling (maintains class proportions)
✓ Weighted loss in neural networks
✓ Focus on weighted F1-score, not just accuracy

Not Implemented (computationally expensive):

SMOTE (Synthetic Minority Over-sampling)
Data augmentation

Challenge 2: Extreme Skewness

Problem: Species counts are extremely right-skewed (skewness = 25.93)

Solutions Implemented:

✓ Log1p transformation: y_transformed = log(1 + y)
✓ Inverse transform for predictions: y_pred = exp(y_pred_log) - 1
✓ Tree-based models (robust to skewness)

Why Log1p:

Handles zero values (log(0) is undefined, but log(1+0) = 0)
Reduces skewness from 25.93 → 2.13
Allows use of MSE/RMSE metrics

Challenge 3: High Dimensionality

Problem: 500 species classes + multiple countries

Solutions Implemented:

✓ One-hot encoding for country features
✓ StandardScaler for numerical features
✓ Dimensionality reduction via feature selection
✓ Ensemble methods (reduce overfitting)

Model Comparison

Classification Results (Global Data)

Model	Test Accuracy	CV Accuracy	Training Time	Notes
Logistic Regression	0.162	0.161	~22 min	Linear model struggles
Random Forest	0.331	0.327	~3 min	Best performer
XGBoost	0.097	0.084	~10 min	Underperforms unexpectedly
Neural Network (MLP)	TBD	TBD	TBD	See deep_learning notebook

Why Random Forest Wins:

Handles class imbalance well with class_weight='balanced'
Robust to feature scaling
Can model complex non-linear relationships
Less prone to overfitting than XGBoost on this dataset

Regression Results (Global Data)

Model	Test R²	CV R²	RMSE	Training Time	Notes
Linear Regression	-0.011	0.051	16.18	~1s	Fails on skewed data
Random Forest	0.126	0.224	15.05	~38s	Best performer
XGBoost	0.076	0.183	15.47	~2s	Decent performance
Neural Network (MLP)	TBD	TBD	TBD	TBD	See deep_learning notebook

Why Random Forest Wins:

Robust to outliers and skewness
Handles spatial patterns well
No need for complex feature engineering
Fast training on moderate-sized datasets

Can Deep Learning Help?

Expected Scenarios:

✅ Deep Learning May Help If:

Very large dataset (millions of samples)
- Current: ~270k samples → moderate size
Complex feature interactions (non-linear, high-order)
- Current: mostly environmental features (temperature, precipitation)
Unstructured data (images, text, time series)
- Current: tabular data
Need for transfer learning
- Current: no pre-trained models applicable

❌ Deep Learning May NOT Help Because:

Tabular data: Tree models (RF, XGBoost) are state-of-the-art
Moderate dataset size: Not enough data to train deep networks effectively
High class imbalance: Neural networks struggle more than tree models
Interpretability: Tree models provide feature importance easily

Verdict:

Traditional ML (Random Forest) likely remains the best choice for this specific dataset. However, the deep learning notebook is provided as:

An alternative approach to explore
A learning exercise in neural network architectures
A baseline for future improvements (e.g., with more data)

How to Use This Project

1. Run the Notebooks in Order

For Complete Analysis:

1. detailed_analysis_and_viz.ipynb         # EDA (optional)
2. species_imbalance_skewness_analysis.ipynb  # Understand data issues
3. species_pipeline_augmented.ipynb        # Train ML models (global)
4. species_pipeline_augmented_USA.ipynb    # Train ML models (USA only)
5. species_pipeline_deep_learning.ipynb    # Try deep learning (optional)

For Quick Results:

1. species_pipeline_augmented.ipynb        # Global ML models
OR
2. species_pipeline_augmented_USA.ipynb    # USA-only ML models

2. Dependencies

# Core libraries
pip install pandas numpy matplotlib seaborn scikit-learn

# Advanced ML
pip install xgboost

# Deep learning (optional)
pip install tensorflow

# Interactive visualizations (optional)
pip install plotly

3. Expected Outputs

Each notebook generates:

PNG plots (static visualizations)
HTML dashboards (interactive, where applicable)
TXT reports (model performance summaries)
CSV summaries (statistics and comparisons)

Output Directories:

plots_augmented/ - Global ML results
plots_USA/ - USA-only ML results
imbalance_analysis/ - Data quality analysis
detailed_analysis/ - EDA outputs
plots_deep_learning/ - Neural network results

Key Insights

1. Data Quality Issues Are Major Challenges

Class imbalance (40:1 ratio) severely limits classification accuracy
Extreme skewness (25.93) makes regression difficult
Top 34% of species contain 80% of data

2. Random Forest Is the Best Traditional ML Model

Classification: 33% accuracy (given 500 classes and imbalance, this is decent)
Regression: R² = 0.13 (moderate performance)
Robust to imbalance and skewness

3. Feature Engineering Could Help

Spatial features (k-NN neighbors, distance to biodiversity hotspots)
Temporal features (season, year)
Interaction terms (temperature × precipitation)

4. More Data Would Help

Current: 50-2000 samples per species
Ideal: 1000+ samples per species for rare classes
Additional features: soil type, elevation, habitat characteristics

Recommendations for Improvement

For Classification:

Focus on top N species (e.g., top 100 with most data)
- Reduces classes, improves accuracy
Hierarchical classification (family → genus → species)
- Break down the problem into stages
Ensemble methods (stacking, blending)
- Combine multiple models
More features (habitat type, ecosystem)
- Richer information for predictions

For Regression:

Add spatial features (neighboring cell counts)
- Species cluster geographically
Quantile regression (instead of mean prediction)
- Better handles outliers
Time series analysis (if temporal data available)
- Seasonal patterns in species counts
Geographically weighted regression
- Local models for different regions

File Descriptions

Notebooks (`.ipynb`)

species_pipeline_augmented.ipynb - Main ML pipeline (global)
species_pipeline_augmented_USA.ipynb - ML pipeline (USA only) ⭐ NEW
species_imbalance_skewness_analysis.ipynb - Data quality analysis ⭐ NEW
species_pipeline_deep_learning.ipynb - Neural network approach ⭐ NEW
detailed_analysis_and_viz.ipynb - EDA and visualizations
species_pipeline_augmented_latest.ipynb - Alternative version
species_pipeline_original.ipynb - Original baseline

Data Files

species_with_country_final.csv - Main dataset (with country info)
species_train.npz - Training data (numpy format)
species_test.npz - Test data (numpy format)

Output Directories

plots_augmented/ - Global ML results
plots_USA/ - USA-only results ⭐ NEW
imbalance_analysis/ - Data issue analysis ⭐ NEW
plots_deep_learning/ - Neural network results ⭐ NEW
detailed_analysis/ - EDA outputs

Frequently Asked Questions

Q1: Why is classification accuracy only 33%?

A: With 500 species classes and severe imbalance (40:1), this is actually reasonable:

Random guessing would give 0.2% accuracy (1/500)
33% means the model is learning meaningful patterns
Focus on top-k accuracy (top-5, top-10) for better metrics

Q2: Why use log transformation for regression?

A: The target (species count) is extremely skewed:

Mean (4.79) >> Median (1.00)
Skewness = 25.93 (very high)
Log1p reduces this to 2.13, enabling better model training

Q3: Should I use deep learning or Random Forest?

A: For this dataset, Random Forest is recommended because:

Tabular data (tree models excel here)
Moderate dataset size (not big enough for deep learning)
Class imbalance (trees handle better)
Faster training and better interpretability

However, try deep learning if:

You have more data (millions of samples)
You want to learn neural network techniques
You can leverage GPUs for faster training

Q4: Why analyze USA data separately?

A: Several reasons:

Homogeneous region: Similar climate/geography patterns
Data concentration: USA has 32% of all observations
Simpler model: No need for country feature
Better performance: Potentially higher accuracy in focused region

Q5: Can I apply this to other species datasets?

A: Yes! The methodology is generalizable:

Handle class imbalance (class weights, stratified sampling)
Transform skewed targets (log1p)
Use tree-based models (Random Forest, XGBoost)
Evaluate with appropriate metrics (weighted F1, R²)

Citation

If you use this project, please cite:

Species Mapping Project
Applied Machine Learning Mini Project
November 2025

Contact

For questions or issues, please refer to the notebook comments or reach out via the project repository.

Version History

v1.0 (Nov 2025) - Initial release with augmented pipeline
v2.0 (Nov 2025) - Added USA-specific analysis, imbalance/skewness analysis, and deep learning approach

Happy Modeling! 🌿🐾

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Figure_1.png		Figure_1.png
README.md		README.md
detailed_analysis_and_viz.ipynb		detailed_analysis_and_viz.ipynb
explore_species_data.py		explore_species_data.py
kmeans_model.joblib		kmeans_model.joblib
label_encoder.joblib		label_encoder.joblib
species_imbalance_skewness_analysis.ipynb		species_imbalance_skewness_analysis.ipynb
species_pipeline.py		species_pipeline.py
species_pipeline_augmented.ipynb		species_pipeline_augmented.ipynb
species_pipeline_augmented_USA.ipynb		species_pipeline_augmented_USA.ipynb
species_pipeline_augmented_latest.ipynb		species_pipeline_augmented_latest.ipynb
species_pipeline_deep_learning.ipynb		species_pipeline_deep_learning.ipynb
species_pipeline_original.ipynb		species_pipeline_original.ipynb
species_test.npz		species_test.npz
species_train.npz		species_train.npz
species_train_extra.npz		species_train_extra.npz

Emaad2405/Multi-Model-Species-Classification-Mapping-System

Folders and files

Latest commit

History

Repository files navigation

Species Mapping Project

Overview

Dataset

Project Structure

📁 Main Analysis Scripts

1. species_pipeline_augmented.ipynb

2. species_pipeline_augmented_USA.ipynb ⭐ NEW

3. species_imbalance_skewness_analysis.ipynb ⭐ NEW

A. Class Imbalance (Classification Problem)

B. Skewness (Regression Problem)

4. species_pipeline_deep_learning.ipynb ⭐ NEW

Classification Model

Regression Model

5. detailed_analysis_and_viz.ipynb (Reference)

Key Challenges & Solutions

Challenge 1: Class Imbalance

Challenge 2: Extreme Skewness

Challenge 3: High Dimensionality

Model Comparison

Classification Results (Global Data)

Regression Results (Global Data)

Can Deep Learning Help?

Expected Scenarios:

✅ Deep Learning May Help If:

❌ Deep Learning May NOT Help Because:

Verdict:

How to Use This Project

1. Run the Notebooks in Order

2. Dependencies

3. Expected Outputs

Key Insights

1. Data Quality Issues Are Major Challenges

2. Random Forest Is the Best Traditional ML Model

3. Feature Engineering Could Help

4. More Data Would Help

Recommendations for Improvement

For Classification:

For Regression:

File Descriptions

Notebooks (.ipynb)

Data Files

Output Directories

Frequently Asked Questions

Q1: Why is classification accuracy only 33%?

Q2: Why use log transformation for regression?

Q3: Should I use deep learning or Random Forest?

Q4: Why analyze USA data separately?

Q5: Can I apply this to other species datasets?

Citation

Contact

Version History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Notebooks (`.ipynb`)

Packages