This project investigates the relationship between seed oil intake (focusing on linoleic acid, LA) and metabolic health outcomes in Australia from 1980 to present. Through rigorous data engineering and statistical analysis, we explore associations between dietary patterns and health outcomes while acknowledging important limitations and confounding factors.
Our analysis reveals several significant patterns:
- Strong positive correlations between LA intake and obesity/diabetes prevalence (r > 0.85)
- Strong negative correlation with CVD mortality (r ≈ -0.94)
- Moderate to strong positive correlations with CVD and dementia prevalence
- Lag analysis suggests potential delayed effects, particularly for obesity
- GAMs reveal complex non-linear relationships between LA intake and health outcomes
- Time series models effectively capture temporal patterns and seasonality
- Tree-based models highlight LA intake as a consistently important predictor
- Cross-validation across multiple modelling approaches increases result robustness
For detailed findings, limitations, and caveats, see reports/findings_and_limitations.md.
The analysis combines data from multiple authoritative sources:
- FAOSTAT Food Balance Sheets (dietary data, 1961-present)
- NCD Risk Factor Collaboration (diabetes, cholesterol, BMI, 1980-2022)
- IHME Global Burden of Disease Study (dementia, CVD, 1990-present)
- Australian Bureau of Statistics (mortality data, ~1980-present)
- Comprehensive data processing and validation
- Multiple statistical modelling approaches:
- Generalized Additive Models (GAMs)
- Time Series Models (ARIMA, Prophet)
- Tree-Based Models (Random Forest, XGBoost)
- Extensive correlation and lag analyses
- Robust data validation using Pydantic
- Time series visualisations with zoom/pan
- Correlation heatmaps with detailed hover information
- Scatter plots with trend lines and confidence intervals
- Feature importance visualisations
- Model comparison plots
- GAM partial dependence plots
- Python 3.8+
- pip package manager
- Git
- Clone the repository:
git clone [repository-url]
cd SeedoilsML- Create and activate a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt-
Download required data files:
- Place IHME GBD data zip file in
data/raw/ - Download ABS Causes of Death data (if needed)
- NCD-RisC and FAOSTAT data are downloaded automatically
- Place IHME GBD data zip file in
-
Run the ETL pipeline:
python src/run_etl.py- Run the analysis:
python src/run_analysis.py- Generate the interactive dashboard:
python src/visualisation/create_dashboard.py- Start the local server:
python -m http.server 8000- View the dashboard:
Open
http://localhost:8000/figures/dashboard.htmlin your browser
- Zoom: Use the zoom tools in the toolbar to focus on specific time periods
- Pan: Click and drag to move through time periods
- Reset: Double-click to reset the view
- Legend: Click items to show/hide specific series
- Hover: Move mouse over points for detailed values
- Hover Details: Move mouse over cells for exact correlation values
- Scale: Colour intensity indicates correlation strength
- Interpretation: Red = positive correlation, Blue = negative correlation
- Trend Lines: Show relationship direction and confidence intervals
- Hover: View exact values and additional metrics
- Zoom: Focus on specific regions of interest
- Export: Use camera icon to save plots as PNG files
- Performance Metrics: Compare RMSE, MAPE, and R² across models
- Feature Importance: View relative importance of predictors
- Hover: See exact values and descriptions
- Legend: Toggle different models and metrics
- Partial Dependence: Visualise non-linear relationships
- Confidence Bands: Show uncertainty in relationships
- Interpretation Guide: Available in hover text
- Export Options: Save plots for presentations
SeedoilsML/
├── data/ # Data files
│ ├── processed/ # Processed datasets
│ ├── raw/ # Original data sources
│ └── staging/ # Intermediate processing
├── figures/ # Generated visualisations
│ ├── interactive/ # Interactive plot files
│ ├── gam_analysis/ # GAM visualisations
│ └── time_series/ # Time series plots
├── reports/ # Analysis reports
├── src/ # Source code
│ ├── analysis/ # Analysis modules
│ ├── data_processing/ # Data processing scripts
│ ├── models/ # Statistical models
│ └── visualisation/ # Visualisation code
├── tests/ # Test files
└── requirements.txt # Python dependencies
- Fork the repository
- Create your feature branch
- Make your changes
- Run tests:
pytest - Submit a pull request
This project is licensed under the terms of the LICENSE file included in the repository.
- NCD Risk Factor Collaboration for health metrics data
- FAOSTAT for dietary data
- IHME for Global Burden of Disease data
- Australian Bureau of Statistics for mortality data