A comprehensive data science investigation of 26,551 wildfire incidents across two decades
This project analyzes 26,551 wildfire incidents from Alberta, Canada (2006-2025) using advanced data science techniques to answer critical questions about fire patterns, causes, and predictability.
| # | Question | Methods | Key Finding |
|---|---|---|---|
| 1 | Are wildfires increasing? | Linear regression, trend analysis | High variability, no simple trend |
| 2 | Where do fires concentrate? | Geospatial clustering (EPSG:3403) | Three distinct regions identified |
| 3 | What causes fires by region? | Chi-square test, contingency analysis | Causes vary significantly NβS |
| 4 | Does fast response reduce size? | Correlation analysis | Weak correlation (rβ0.3) |
| 5 | What weather predicts fire behavior? | Pearson correlation, scatter analysis | Combinations matter most |
| 6 | Can ML predict fire types? | K-means, Random Forest | 87% accuracy, 4 fire types |
β
87% ML prediction accuracy for fire size classification
β
4 distinct fire behavior types identified through clustering
β
Regional differences support tailored management strategies
β
High year-to-year variability dominates temporal patterns
β
Weather combinations predict risk better than individual variables
- Python 3.12 or higher
- Jupyter Notebook
- 4GB+ RAM recommended
# Clone repository
git clone https://github.com/yourusername/alberta-wildfire-analysis.git
cd alberta-wildfire-analysis
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook Wildfire_DataStory_Enhanced.ipynbOption 1: Download from Source
- Visit Alberta Wildfire Historical Data
- Download complete dataset (2006-2025)
- Save as
data/wildfire_data.csv
Option 2: Use Sample Data
- Sample dataset available in
/datafolder (10% random sample for testing)
alberta-wildfire-analysis/
β
βββ Wildfire_DataStory_Enhanced.ipynb # Main analysis notebook β
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
β
βββ data/
β βββ README.md # Data source info
β βββ wildfire_data.csv # Dataset (download separately)
β
βββ images/ # Visualizations
β βββ eda_*.png # Exploratory charts
β βββ q1_*.png # Question 1 visuals
β βββ q2_*.png # Question 2 visuals
β βββ ... # All generated charts
β
βββ docs/ # Additional documentation
βββ methodology.md # Detailed methods
βββ data_dictionary.md # Variable definitions
| Library | Purpose | Version |
|---|---|---|
| pandas | Data manipulation | 2.0+ |
| numpy | Numerical computing | 1.24+ |
| scipy | Statistical analysis | 1.10+ |
| scikit-learn | Machine learning | 1.3+ |
| Library | Purpose | Version |
|---|---|---|
| matplotlib | Static plots | 3.7+ |
| seaborn | Statistical graphics | 0.12+ |
| plotly | Interactive charts | 5.14+ |
| Library | Purpose | Version |
|---|---|---|
| geopandas | Geographic data structures | 0.13+ |
| pyproj | Coordinate transformations | 3.5+ |
| contextily | Basemap tiles | 1.3+ |
| shapely | Geometric operations | 2.0+ |
Coordinate System: EPSG:3403 (NAD83 Alberta 10-TM Forest)
- Import 26,551 fire records
- Assess data quality (completeness, types, distributions)
- Identify missing data patterns
- Handle missing values appropriately
- Engineer features (Fire Weather Index, periods, regions)
- Convert dates and categorize variables
- Visualize distributions
- Identify temporal and spatial patterns
- Compute initial correlations
- Each question follows: Motivation β Methods β Analysis β Findings β Implications
- Statistical rigor: hypothesis tests, significance levels, confidence intervals
- Multiple visualization types for each question
- Unsupervised: K-means clustering (k=4) to discover fire types
- Supervised: Random Forest to predict fire size categories
- Validation: Silhouette scores, confusion matrices, precision/recall
- Connect findings across questions
- Identify actionable insights
- Acknowledge limitations
- Recommend next steps
Annual fire frequency shows high year-to-year variability with extreme years (2016, 2019, 2023) rather than a consistent linear increase.
Geographic analysis reveals three distinct fire environments: Northern boreal (remote, lightning-caused), Central transition zone, and Southern grassland (human-caused).
K-means clustering identified 4 fire behavior types with 87% Random Forest classification accuracy.
Note: All visualizations are generated automatically when running the notebook
- Linear Regression - Trend detection (coefficients, RΒ², p-values)
- Pearson Correlation - Association strength and significance
- Chi-Square Test - Independence testing (categorical variables)
- Hypothesis Testing - Ξ± = 0.05 significance level throughout
-
K-Means Clustering
- Optimal k selection via elbow method and silhouette scores
- Feature standardization (StandardScaler)
- 9 variables: weather, location, timing, fire characteristics
-
Random Forest Classification
- 70/30 train/test split
- Hyperparameter tuning via grid search
- Performance metrics: accuracy, precision, recall, F1
- Feature importance analysis
- Projection: EPSG:3403 (NAD83 Alberta 10-TM Forest)
- Grid Resolution: 5km Γ 5km cells
- Smoothing: Gaussian kernel (Ο=2.5 cells)
- Density Mapping: 2D histograms with interpolation
β
Pre-position resources based on identified geographic hotspots
β
Use cluster profiles for initial fire risk assessment
β
Differentiate strategies by region (North vs. Central vs. South)
β
Peak suppression capacity needed June-August
β
Evidence supports regional (not province-wide) strategies
β
Invest in northern detection (helicopters, remote sensing)
β
Invest in southern prevention (public education, fuel mgmt)
β
Climate adaptation: prepare for high-variability future
β
Demonstrates ML feasibility for fire classification
β
Identifies data gaps (fuel moisture, suppression effort)
β
Provides baseline for climate change studies
β
Methodology transferable to other regions
- 60% missing environmental data (weather measurements)
- Small fires receive abbreviated assessments
- May over-represent larger fires in correlations
- Complete case analysis is valid but introduces bias
- Correlation β Causation - We show associations, not proven causes
- 20-year window - May be too short for climate trend detection
- Suppression effects - Final fire size reflects both behavior AND firefighting
- Missing variables - Fuel moisture, suppression effort, economic costs
- 87% accuracy - Means 13% error rate (168 large fires misclassified)
- Cannot replace experts - Models support, don't replace human judgment
- Temporal validity - Patterns may shift with climate change
See notebook for complete limitations discussion
Contributions welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -m 'Add improvement') - Push to branch (
git push origin feature/improvement) - Open a Pull Request
- π¬ Additional analyses - Temporal forecasting, fuel type deep-dive
- π New visualizations - Interactive dashboards, animated maps
- π€ Model improvements - Alternative ML algorithms, ensemble methods
- π Documentation - Data dictionary expansion, methodology details
- π Bug fixes - Code optimization, error handling
- Follow PEP 8 guidelines
- Add docstrings to functions
- Include comments for complex logic
- Update requirements.txt if adding dependencies
Questions? Open an issue
Project Maintainer:
- Name: Ifeanyi Njoku
- Email: ifeanyinjoku2@gmail.com
- LinkedIn: www.linkedin.com/in/ifeanyi-e-njoku
- Portfolio: https://github.com/cnero101/alberta-wildfire-analysis.git
Want to collaborate? Reach out directly or open a discussion!
This project is licensed under the MIT License - see LICENSE file for details.
Alberta Wildfire data is public domain (Government of Alberta).
Attribution required: "Data provided by Alberta Wildfire Management"
- Alberta Wildfire Management for maintaining comprehensive public records
- Government of Alberta for open data commitment
- Fire management professionals whose expertise keeps communities safe
- Data science community for tools and best practices
- Open source contributors for excellent libraries
- Python ecosystem (pandas, scikit-learn, matplotlib, geopandas)
- Jupyter Project for notebook environment
- GitHub for version control and collaboration
- Alberta Wildfire Historical Data: https://wildfire.alberta.ca/resources/historical-data/
- Canadian Wildland Fire Information System: https://cwfis.cfs.nrcan.gc.ca/
- Flannigan, M., et al. (2013). Global wildland fire season severity in the 21st century. Forest Ecology and Management.
- Rodrigues, M., & de la Riva, J. (2014). An insight into machine-learning algorithms to model wildfire susceptibility. Environmental Modelling & Software.
- Tymstra, C., et al. (2010). Development of Prometheus: Canadian Wildland Fire Growth Model. Natural Resources Canada.
- FireSmart Canada: https://www.firesmartcanada.ca/
- NASA FIRMS (Fire Information for Resource Management): https://firms.modaps.eosdis.nasa.gov/
If you found this analysis useful:
β Star this repository
π Fork for your own use
π’ Share with colleagues
π¬ Provide feedback
π€ Contribute improvements
Every star helps make data science research more visible!
- Initial release
- Complete 6-question analysis
- Machine learning implementation
- EPSG:3403 geospatial visualization
- Comprehensive documentation
π₯ Analyzing wildfires with data science to build a more resilient Alberta
Last Updated: February 2026