This project analyzes species distribution data across different geographic locations, combining environmental features to predict:
- Classification Task: Which species is present at a given location
- Regression Task: How many species observations occur at grid locations
- File:
species_with_country_final.csv - Total Observations: ~270,000 (after cleaning)
- Unique Species: 500
- Geographic Coverage: Global (multiple countries)
- Features:
- Geographic: latitude, longitude, country
- Environmental: temperature (avg/max/min), precipitation, solar radiation, water vapor pressure, wind speed
- Target: species_id, species_count (aggregated)
Purpose: Complete ML pipeline for species prediction (both classification and regression)
Key Features:
- Handles class imbalance using
class_weight='balanced' - Log transformation for skewed species counts
- Models tested:
- Logistic Regression
- Random Forest (best performer)
- XGBoost
- Cross-validation for robust evaluation
- Comprehensive visualizations
Results (Global Data):
- Classification: Random Forest achieves ~33% accuracy (challenging due to 500 classes and imbalance)
- Regression: Random Forest R² = 0.13 (species count prediction)
Output Directory: plots_augmented/
Purpose: Same analysis as above but filtered to USA data only
Key Differences:
- Data filtered to
country == 'United States of America' - ~86,000 USA observations (32% of total dataset)
- 116 unique species in USA
- No country feature in model (single country)
- Focused geographic analysis
Why USA Only?
- Homogeneous geographic region
- Potentially better model performance
- Regional species patterns
- Easier to interpret results
Output Directory: plots_USA/
Purpose: Deep dive into data quality issues affecting model performance
What It Analyzes:
- Imbalance Ratio: 40:1 (most common vs. least common species)
- Data Concentration: Top 169 species (34%) contain 80% of all data
- Rarity Categories:
- Very Rare (<50 samples): 0% of species
- Rare (50-200): 46% of species
- Common (200-500): 23%
- Very Common (500-1000): 20%
- Abundant (1000+): 11%
Visualizations:
- Comprehensive 9-panel class imbalance analysis
- Cumulative distribution (Pareto principle)
- Top/bottom species frequency
- Category breakdowns
- Skewness: 25.93 (extremely right-skewed!)
- Distribution:
- Mean: 4.79 species/location
- Median: 1.00 species/location (huge gap!)
- 98.8% of locations have <50 species
- Only 1.2% have 200+ species
Transformations Tested:
- Original: Skewness = 25.93
- Log(1+x): Skewness = 2.13 ✓ (best)
- Sqrt(x): Skewness = 8.71
Visualizations:
- Comprehensive 9-panel skewness analysis
- Q-Q plots (original vs. transformed)
- CDF and percentile analysis
- Transformation comparisons
Output Directory: imbalance_analysis/
Key Findings:
⚠️ Severe class imbalance makes classification very challenging⚠️ Extreme skewness requires log transformation for regression- ✓ Tree-based models (RF, XGBoost) handle these issues better than linear models
Purpose: Neural network approach as an alternative to traditional ML
Question Answered: Can deep learning improve performance over Random Forest and XGBoost?
Architecture:
Input (features) → Dense(512) → BatchNorm → Dropout(0.3)
→ Dense(256) → BatchNorm → Dropout(0.3)
→ Dense(128) → BatchNorm → Dropout(0.3)
→ Dense(500, softmax) → Output (species)
Input (features) → Dense(256) → BatchNorm → Dropout(0.3)
→ Dense(128) → BatchNorm → Dropout(0.3)
→ Dense(64) → BatchNorm → Dropout(0.3)
→ Dense(1, linear) → Output (count)
Deep Learning Techniques Used:
- ✓ Batch Normalization: Stabilizes training
- ✓ Dropout: Prevents overfitting
- ✓ Class Weights: Handles imbalance
- ✓ Early Stopping: Prevents overfitting
- ✓ Learning Rate Scheduling: Adaptive learning
- ✓ Log Transformation: Handles skewness
Expected Performance:
-
Neural networks typically excel with:
- Very large datasets (millions of samples)
- Complex feature interactions
- Unstructured data (images, text)
-
For this tabular dataset with moderate size:
- Random Forest likely remains competitive
- Deep learning may not significantly outperform
- Tree models are naturally suited for tabular data
Output Directory: plots_deep_learning/
Models Saved:
best_cls_model.h5(classification)best_reg_model.h5(regression)
Purpose: Original exploratory data analysis (EDA)
Contains:
- Interactive visualizations (Plotly)
- Geographic distribution maps
- Country-level analysis
- Temperature vs. diversity patterns
Output Directory: detailed_analysis/
Problem: Species distribution is highly imbalanced (40:1 ratio)
Solutions Implemented:
- ✓
class_weight='balanced'in scikit-learn models - ✓ Stratified sampling (maintains class proportions)
- ✓ Weighted loss in neural networks
- ✓ Focus on weighted F1-score, not just accuracy
Not Implemented (computationally expensive):
- SMOTE (Synthetic Minority Over-sampling)
- Data augmentation
Problem: Species counts are extremely right-skewed (skewness = 25.93)
Solutions Implemented:
- ✓ Log1p transformation:
y_transformed = log(1 + y) - ✓ Inverse transform for predictions:
y_pred = exp(y_pred_log) - 1 - ✓ Tree-based models (robust to skewness)
Why Log1p:
- Handles zero values (
log(0)is undefined, butlog(1+0) = 0) - Reduces skewness from 25.93 → 2.13
- Allows use of MSE/RMSE metrics
Problem: 500 species classes + multiple countries
Solutions Implemented:
- ✓ One-hot encoding for country features
- ✓ StandardScaler for numerical features
- ✓ Dimensionality reduction via feature selection
- ✓ Ensemble methods (reduce overfitting)
| Model | Test Accuracy | CV Accuracy | Training Time | Notes |
|---|---|---|---|---|
| Logistic Regression | 0.162 | 0.161 | ~22 min | Linear model struggles |
| Random Forest | 0.331 | 0.327 | ~3 min | Best performer |
| XGBoost | 0.097 | 0.084 | ~10 min | Underperforms unexpectedly |
| Neural Network (MLP) | TBD | TBD | TBD | See deep_learning notebook |
Why Random Forest Wins:
- Handles class imbalance well with
class_weight='balanced' - Robust to feature scaling
- Can model complex non-linear relationships
- Less prone to overfitting than XGBoost on this dataset
| Model | Test R² | CV R² | RMSE | Training Time | Notes |
|---|---|---|---|---|---|
| Linear Regression | -0.011 | 0.051 | 16.18 | ~1s | Fails on skewed data |
| Random Forest | 0.126 | 0.224 | 15.05 | ~38s | Best performer |
| XGBoost | 0.076 | 0.183 | 15.47 | ~2s | Decent performance |
| Neural Network (MLP) | TBD | TBD | TBD | TBD | See deep_learning notebook |
Why Random Forest Wins:
- Robust to outliers and skewness
- Handles spatial patterns well
- No need for complex feature engineering
- Fast training on moderate-sized datasets
- Very large dataset (millions of samples)
- Current: ~270k samples → moderate size
- Complex feature interactions (non-linear, high-order)
- Current: mostly environmental features (temperature, precipitation)
- Unstructured data (images, text, time series)
- Current: tabular data
- Need for transfer learning
- Current: no pre-trained models applicable
- Tabular data: Tree models (RF, XGBoost) are state-of-the-art
- Moderate dataset size: Not enough data to train deep networks effectively
- High class imbalance: Neural networks struggle more than tree models
- Interpretability: Tree models provide feature importance easily
Traditional ML (Random Forest) likely remains the best choice for this specific dataset. However, the deep learning notebook is provided as:
- An alternative approach to explore
- A learning exercise in neural network architectures
- A baseline for future improvements (e.g., with more data)
For Complete Analysis:
1. detailed_analysis_and_viz.ipynb # EDA (optional)
2. species_imbalance_skewness_analysis.ipynb # Understand data issues
3. species_pipeline_augmented.ipynb # Train ML models (global)
4. species_pipeline_augmented_USA.ipynb # Train ML models (USA only)
5. species_pipeline_deep_learning.ipynb # Try deep learning (optional)For Quick Results:
1. species_pipeline_augmented.ipynb # Global ML models
OR
2. species_pipeline_augmented_USA.ipynb # USA-only ML models# Core libraries
pip install pandas numpy matplotlib seaborn scikit-learn
# Advanced ML
pip install xgboost
# Deep learning (optional)
pip install tensorflow
# Interactive visualizations (optional)
pip install plotlyEach notebook generates:
- PNG plots (static visualizations)
- HTML dashboards (interactive, where applicable)
- TXT reports (model performance summaries)
- CSV summaries (statistics and comparisons)
Output Directories:
plots_augmented/- Global ML resultsplots_USA/- USA-only ML resultsimbalance_analysis/- Data quality analysisdetailed_analysis/- EDA outputsplots_deep_learning/- Neural network results
- Class imbalance (40:1 ratio) severely limits classification accuracy
- Extreme skewness (25.93) makes regression difficult
- Top 34% of species contain 80% of data
- Classification: 33% accuracy (given 500 classes and imbalance, this is decent)
- Regression: R² = 0.13 (moderate performance)
- Robust to imbalance and skewness
- Spatial features (k-NN neighbors, distance to biodiversity hotspots)
- Temporal features (season, year)
- Interaction terms (temperature × precipitation)
- Current: 50-2000 samples per species
- Ideal: 1000+ samples per species for rare classes
- Additional features: soil type, elevation, habitat characteristics
- Focus on top N species (e.g., top 100 with most data)
- Reduces classes, improves accuracy
- Hierarchical classification (family → genus → species)
- Break down the problem into stages
- Ensemble methods (stacking, blending)
- Combine multiple models
- More features (habitat type, ecosystem)
- Richer information for predictions
- Add spatial features (neighboring cell counts)
- Species cluster geographically
- Quantile regression (instead of mean prediction)
- Better handles outliers
- Time series analysis (if temporal data available)
- Seasonal patterns in species counts
- Geographically weighted regression
- Local models for different regions
species_pipeline_augmented.ipynb- Main ML pipeline (global)species_pipeline_augmented_USA.ipynb- ML pipeline (USA only) ⭐ NEWspecies_imbalance_skewness_analysis.ipynb- Data quality analysis ⭐ NEWspecies_pipeline_deep_learning.ipynb- Neural network approach ⭐ NEWdetailed_analysis_and_viz.ipynb- EDA and visualizationsspecies_pipeline_augmented_latest.ipynb- Alternative versionspecies_pipeline_original.ipynb- Original baseline
species_with_country_final.csv- Main dataset (with country info)species_train.npz- Training data (numpy format)species_test.npz- Test data (numpy format)
plots_augmented/- Global ML resultsplots_USA/- USA-only results ⭐ NEWimbalance_analysis/- Data issue analysis ⭐ NEWplots_deep_learning/- Neural network results ⭐ NEWdetailed_analysis/- EDA outputs
A: With 500 species classes and severe imbalance (40:1), this is actually reasonable:
- Random guessing would give 0.2% accuracy (1/500)
- 33% means the model is learning meaningful patterns
- Focus on top-k accuracy (top-5, top-10) for better metrics
A: The target (species count) is extremely skewed:
- Mean (4.79) >> Median (1.00)
- Skewness = 25.93 (very high)
- Log1p reduces this to 2.13, enabling better model training
A: For this dataset, Random Forest is recommended because:
- Tabular data (tree models excel here)
- Moderate dataset size (not big enough for deep learning)
- Class imbalance (trees handle better)
- Faster training and better interpretability
However, try deep learning if:
- You have more data (millions of samples)
- You want to learn neural network techniques
- You can leverage GPUs for faster training
A: Several reasons:
- Homogeneous region: Similar climate/geography patterns
- Data concentration: USA has 32% of all observations
- Simpler model: No need for country feature
- Better performance: Potentially higher accuracy in focused region
A: Yes! The methodology is generalizable:
- Handle class imbalance (class weights, stratified sampling)
- Transform skewed targets (log1p)
- Use tree-based models (Random Forest, XGBoost)
- Evaluate with appropriate metrics (weighted F1, R²)
If you use this project, please cite:
Species Mapping Project
Applied Machine Learning Mini Project
November 2025
For questions or issues, please refer to the notebook comments or reach out via the project repository.
- v1.0 (Nov 2025) - Initial release with augmented pipeline
- v2.0 (Nov 2025) - Added USA-specific analysis, imbalance/skewness analysis, and deep learning approach
Happy Modeling! 🌿🐾