geoluck is a supervised prediction project. The modeling stack should stay broad enough to test nonlinear geography patterns, but disciplined enough that results remain auditable.
| Model | Family | Use |
|---|---|---|
baseline_mean |
naive baseline | Minimum benchmark. |
baseline_region_mean |
grouped baseline | Tests whether geography beats broad regional averages. |
ridge |
penalized linear | Stable linear benchmark with shrinkage. |
lasso |
penalized linear | Sparse linear benchmark. |
elastic_net |
penalized linear | Mixed sparse/dense linear benchmark. |
huber |
robust linear | Downweights outliers while keeping a linear specification. |
random_forest |
tree ensemble | Nonlinear interactions with robust default behavior. |
extra_trees |
tree ensemble | Higher-variance tree benchmark. |
gradient_boosting |
boosted trees | Slower classic boosting benchmark with feature-importance support. |
hist_gb |
boosted trees | Strong default nonlinear model on tabular data. |
- Primary maintained public target:
income_rank_pct - Supported exploratory targets via
train-level-models --target:income,life_expectancy,inequality,wealth,gender_inequality,female_lfpr,women_business_law - Current target columns behind those names:
income->income_rank_pctlife_expectancy->life_expectancy_rank_pctinequality->gini_disp_rank_pctwealth->produced_capital_per_capita_rank_pctgender_inequality->gender_inequality_rank_pctfemale_lfpr->female_labor_force_participation_rank_pctwomen_business_law->women_business_law_rank_pct
- Target-specific feature guardrails:
life_expectancyexcludes the exact WPP life-expectancy column and the crude-death-rate column from the feature matrix to avoid same-source outcome leakage.inequalityexcludes the merged SWIID Gini outcome columns plus near-target Fragile States Index inequality and total-score columns from the feature matrix.wealthexcludes the merged Wealth Accounts level/log/rank outcome columns from the feature matrix.gender_inequality,female_lfpr, andwomen_business_lawexclude the merged gender-target outcome columns plus the UNDP GII feature block from the feature matrix to avoid same-theme target leakage.
- Unit of training: country within decade
- Primary split: within-decade cross-validation
- Current metrics:
r2,rmse,mae,spearman - Additional association exports: numeric feature correlations with income, population, life expectancy, inequality, wealth, gender inequality, female LFPR, and Women, Business and the Law target columns when they are present in the shared outcomes table
- Near-term additions: leave-region-out checks, calibration diagnostics, rank-hit metrics
deep_geo_no_region_controls_v1: geometry-only features without region-category controlsdeep_geo_v1: geometry-derived features onlydeep_geo_plus_wdi_controls_v1: geometry plus the core WDI land/resource controlsdeep_geo_plus_wdi_agri_water_v1: geometry plus broader WDI agriculture/water variablesdeep_geo_plus_wdi_resources_v1: geometry plus expanded WDI resource variables, including rents, depletion, fisheries, and resource-export mixdeep_geo_plus_climate_normals_v1: geometry plus WorldClim baseline climatedeep_geo_plus_climate_variability_v1: geometry plus decade climate variability/shock featuresdeep_geo_plus_hydro_terrain_v1: geometry plus coastline, rivers, lakes, and terrain-structure featuresdeep_geo_plus_aquastat_dams_v1: geometry plus AQUASTAT dam/storage infrastructure featuresnatural_endowment_full_v1: geometry plus climate normals and climate variability, without region controlsnatural_endowment_hydro_terrain_v1: geometry plus climate and hydro/terrain structure, without region controlswdi_controls_agri_water_only_v1: WDI controls without geometry inputswdi_resources_only_v1: resource-heavy WDI block without geometry inputscombined_geo_wdi_climate_v1: geometry, WDI controls, and climate normalscombined_geo_wdi_climate_full_v1: geometry, WDI controls, climate normals, and climate variabilitycombined_geo_wdi_agri_water_climate_v1: geometry, broader WDI agri/water controls, and climate normalscombined_geo_wdi_agri_water_climate_full_v1: current fullest maintained natural-plus-controls tabular stackcombined_geo_wdi_agri_water_climate_hydro_terrain_full_v1: wide combined stack including hydro/terrain structurecombined_geo_wdi_resources_agri_water_climate_hydro_terrain_full_v1: current widest combined stack with hydro/terrain plus expanded WDI resourcescombined_geo_wdi_resources_agri_water_climate_hydro_terrain_aquastat_full_v1: widened combined stack that adds AQUASTAT dams/storage infrastructure
- No model gets reported without a baseline comparison.
- Source coverage and missingness must be inspectable by feature block.
- Latest-decade diagnostics should expose top tree importances, top linear coefficients, and weakest-coverage features for the selected public model.
- Sensitive controls should be evaluated separately from the "nature alone" model family.
- Public claims should prefer stable multi-model patterns over a single best score.