Skip to content

Commit 6dbb282

Browse files
committed
Add gender metrics targets and site support
1 parent c1ad211 commit 6dbb282

29 files changed

+610285
-19
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ __pycache__/
1717
# Private local agent/project planning files
1818
AGENTS.md
1919
geoluck_codex_plan.md
20+
geoluck-R/
2021

2122
# Private agent workflow state
2223
do/

DATA_SOURCES.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ Track every external dataset here before or when it is added.
4646
| Polity 5 | http://www.systemicpeace.org/inscr/p5v2018.xls | 2026-03-11 | Public academic workbook distributed from Systemic Peace; keep the Polity 5 citation and note the manual browser download path because scripted requests currently return HTTP `406` | Raw XLS is kept locally in `data_raw/polity/`; normalized country-year regime-authority rows and compact trailing-decade governance features may be redistributed as derived outputs after citation review and with explicit notes about excluded dissolved states | `src/geoluck/etl/fetch_polity.py`, `src/geoluck/features/build_polity_features.py`, `data_intermediate/polity/` | active |
4747
| SWIID 9.91 summary CSV | https://fsolt.org/swiid/swiid_downloads/ | 2026-03-11 | SWIID is distributed for academic use with citation; keep the SWIID version note and source-page / Dataverse references in provenance | Raw summary CSV is fetched locally into `data_raw/swiid/`; normalized country-year inequality rows and merged country-decade outcome targets may be redistributed as compact derived outputs after citation review | `src/geoluck/etl/fetch_swiid.py`, `src/geoluck/features/build_outcomes_panel.py`, `data_intermediate/swiid/` | active |
4848
| World Bank Wealth Accounts produced capital per capita | https://api.worldbank.org/v2/indicator/NW.PCA.PC?format=json | 2026-03-11 | World Bank Wealth Accounts / Changing Wealth of Nations indicator; keep the indicator code `NW.PCA.PC`, source id `59`, and CWON citation in provenance | Raw API payloads are fetched locally into `data_raw/wealth_accounts/`; normalized country-year produced-capital rows and merged country-decade wealth targets may be redistributed as compact derived outputs with source citation review | `src/geoluck/etl/fetch_wealth_accounts.py`, `src/geoluck/features/build_outcomes_panel.py`, `data_intermediate/wealth_accounts/` | active |
49+
| World Bank female labor-force participation rate (`SL.TLF.CACT.FE.ZS`) | https://api.worldbank.org/v2/indicator/SL.TLF.CACT.FE.ZS?format=json | 2026-03-13 | World Bank open data terms apply; keep the indicator code and source id in provenance | Raw API payloads are fetched locally into `data_raw/female_lfpr/`; normalized country-year female labor-force participation rows and merged country-decade outcome targets may be redistributed as compact derived outputs with source citation review | `src/geoluck/etl/fetch_female_lfpr.py`, `src/geoluck/features/build_outcomes_panel.py`, `data_intermediate/female_lfpr/` | active |
50+
| World Bank Women, Business and the Law (`SG.LAW.INDX`) | https://api.worldbank.org/v2/indicator/SG.LAW.INDX?format=json | 2026-03-13 | World Bank open data terms apply; keep the indicator code and source id in provenance | Raw API payloads are fetched locally into `data_raw/women_business_law/`; normalized country-year Women, Business and the Law rows and merged country-decade outcome targets may be redistributed as compact derived outputs with source citation review | `src/geoluck/etl/fetch_women_business_law.py`, `src/geoluck/features/build_outcomes_panel.py`, `data_intermediate/women_business_law/` | active |
4951
| UN World Population Prospects 2024 | https://population.un.org/wpp/downloads | 2026-03-09 | UN DESA public downloads; cite WPP 2024 and keep note that workbook schemas may change across releases | Raw WPP 2024 workbooks are fetched locally into `data_raw/wpp/`; normalized country-year demographic tables and compact decade features may be redistributed as derived outputs with source citation review | `src/geoluck/etl/fetch_wpp.py`, `src/geoluck/features/build_wpp_features.py`, `data_intermediate/wpp/` | active |
5052
| UNDP Gender Inequality Index 2025 | https://hdr.undp.org/sites/default/files/2025_HDR/HDR25_Statistical_Annex_GII_Table.xlsx | 2026-03-09 | UNDP HDR workbook is publicly downloadable; cite HDR 2025 and preserve the mixed component-year note in provenance | Raw workbook is fetched locally into `data_raw/undp_gii/`; normalized country-level GII and component indicators plus derived gender-gap features may be redistributed as compact derived outputs with source citation review | `src/geoluck/etl/fetch_undp_gii.py`, `src/geoluck/features/build_undp_gii_features.py`, `data_intermediate/undp_gii/` | active |
5153
| Glottolog CLDF languages | https://raw.githubusercontent.com/glottolog/glottolog-cldf/master/cldf/languages.csv | 2026-03-09 | Open scholarly dataset with attribution; keep note that the raw GitHub branch path is a moving snapshot unless pinned to a release | Raw CSV is fetched locally into `data_raw/glottolog/`; normalized country-language inventory and compact country-level language-count features may be redistributed as derived outputs with source citation review | `src/geoluck/etl/fetch_glottolog.py`, `src/geoluck/features/build_glottolog_features.py`, `data_intermediate/glottolog/` | active |

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**How much of relative country prosperity can be predicted from geography, natural endowments, resource development, and social structure — and who beats their geography?**
44

5-
Geoluck is an open-source research project that builds a country-decade panel (1900–2020) and trains machine learning models to predict four prosperity outcomes from tiered feature sets. The results are published as an interactive static site.
5+
Geoluck is an open-source research project that builds a country-decade panel (1900–2020) and trains machine learning models to predict country-level income, wellbeing, inequality, wealth, and gender outcomes from tiered feature sets. The results are published as an interactive static site.
66

77
This is explicitly about **predictive association**, not causal effect.
88

@@ -12,14 +12,17 @@ This is explicitly about **predictive association**, not causal effect.
1212

1313
## What the site shows
1414

15-
The static site models four outcome metrics, each converted to within-decade percentile ranks:
15+
The static site models seven outcome metrics, each converted to within-decade percentile ranks:
1616

1717
| Outcome | Definition | Source |
1818
|---|---|---|
1919
| **Income** | Log GDP per capita rank | Maddison Project Database 2023 |
2020
| **Wealth** | Produced capital per capita rank | World Bank Changing Wealth of Nations |
2121
| **Life expectancy** | Life expectancy at birth rank | World Bank WDI / UN Population Division |
22-
| **Inequality** | Disposable-income Gini rank (higher = more equal) | SWIID |
22+
| **Inequality** | Disposable-income Gini rank (higher = more unequal) | SWIID |
23+
| **Gender inequality** | UNDP Gender Inequality Index rank (higher = more unequal) | UNDP HDR 2025 |
24+
| **Female LFPR** | Female labor-force participation rate rank | World Bank WDI / ILO |
25+
| **Women, Business and the Law** | Women, Business and the Law index rank | World Bank |
2326

2427
Predictor features are organized into three independently toggleable tiers:
2528

docs/MODEL_SPECS.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,24 @@
2020
## Evaluation protocol
2121

2222
- Primary maintained public target: `income_rank_pct`
23-
- Supported exploratory targets via `train-level-models --target`: `income`, `life_expectancy`, `inequality`, `wealth`
23+
- Supported exploratory targets via `train-level-models --target`: `income`, `life_expectancy`, `inequality`, `wealth`, `gender_inequality`, `female_lfpr`, `women_business_law`
2424
- Current target columns behind those names:
2525
- `income` -> `income_rank_pct`
2626
- `life_expectancy` -> `life_expectancy_rank_pct`
2727
- `inequality` -> `gini_disp_rank_pct`
2828
- `wealth` -> `produced_capital_per_capita_rank_pct`
29+
- `gender_inequality` -> `gender_inequality_rank_pct`
30+
- `female_lfpr` -> `female_labor_force_participation_rank_pct`
31+
- `women_business_law` -> `women_business_law_rank_pct`
2932
- Target-specific feature guardrails:
3033
- `life_expectancy` excludes the exact WPP life-expectancy column and the crude-death-rate column from the feature matrix to avoid same-source outcome leakage.
31-
- `inequality` excludes the merged SWIID Gini outcome columns from the feature matrix.
34+
- `inequality` excludes the merged SWIID Gini outcome columns plus near-target Fragile States Index inequality and total-score columns from the feature matrix.
3235
- `wealth` excludes the merged Wealth Accounts level/log/rank outcome columns from the feature matrix.
36+
- `gender_inequality`, `female_lfpr`, and `women_business_law` exclude the merged gender-target outcome columns plus the UNDP GII feature block from the feature matrix to avoid same-theme target leakage.
3337
- Unit of training: country within decade
3438
- Primary split: within-decade cross-validation
3539
- Current metrics: `r2`, `rmse`, `mae`, `spearman`
36-
- Additional association exports: numeric feature correlations with income, population, life expectancy, inequality, and wealth target columns when they are present in the shared outcomes table
40+
- Additional association exports: numeric feature correlations with income, population, life expectancy, inequality, wealth, gender inequality, female LFPR, and Women, Business and the Law target columns when they are present in the shared outcomes table
3741
- Near-term additions: leave-region-out checks, calibration diagnostics, rank-hit metrics
3842

3943
## Maintained feature-set experiments

docs/UI_DATA_PAYLOADS.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The UI should read the mirrored `web/public/data/` copies.
2222
- `bundle_summary_path`
2323
- `bundle_feature_effects_path`
2424
- `bundle_country_contributions_index_path`
25+
- `bundle_permutation_importance_path`
2526

2627
### `model_summary.json`
2728
- Selected public income model diagnostics.
@@ -47,6 +48,9 @@ These are the new UI-builder payloads for all maintained targets and all non-emp
4748
- `life_expectancy`
4849
- `inequality`
4950
- `wealth`
51+
- `gender_inequality`
52+
- `female_lfpr`
53+
- `women_business_law`
5054
- For each target:
5155
- `best_overall`
5256
- `bundles[]`
@@ -87,6 +91,9 @@ Note:
8791
- `bundle_country_contributions_life_expectancy.json`
8892
- `bundle_country_contributions_inequality.json`
8993
- `bundle_country_contributions_wealth.json`
94+
- `bundle_country_contributions_gender_inequality.json`
95+
- `bundle_country_contributions_female_lfpr.json`
96+
- `bundle_country_contributions_women_business_law.json`
9097
- Each target payload contains `bundles[]`.
9198
- Each bundle contains `countries[]`.
9299
- Each country row contains:
@@ -101,6 +108,19 @@ Important:
101108
- `prediction` is aligned to the cross-validated bundle prediction export for the selected contributing spec.
102109
- For some bundles, the exact best-scoring spec may not have contribution rows. In those cases the payload uses the best available exported spec for that bundle instead of dropping the bundle entirely.
103110

111+
### `bundle_permutation_importance.json`
112+
- Held-out permutation-importance summary for each target and bundle.
113+
- Covers the same maintained targets as `bundle_summary.json`.
114+
- Each target has `bundles[]`.
115+
- Each bundle includes:
116+
- `top_permutation_features`
117+
- `block_summary`
118+
119+
Interpretation:
120+
- `delta_r2_mean` is the mean drop in held-out `R^2` when a feature or feature block is shuffled.
121+
- Larger positive `delta_r2_mean` means the model depends more on that feature or block for predictive accuracy.
122+
- This is a predictive-value metric, not a causal-effect estimate and not the same thing as a country-level up/down contribution.
123+
104124
## Tier semantics
105125

106126
Bundle payloads use these maintained labels:

src/geoluck/cli.py

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from geoluck.etl.fetch_cru_cy import run_fetch as run_fetch_cru_cy
1313
from geoluck.etl.fetch_eia_company_imports import run_fetch as run_fetch_eia_company_imports
1414
from geoluck.etl.fetch_energy_institute_reserves import run_fetch as run_fetch_energy_institute
15+
from geoluck.etl.fetch_female_lfpr import run_fetch as run_fetch_female_lfpr
1516
from geoluck.etl.fetch_freedom_house import run_fetch as run_fetch_freedom_house
1617
from geoluck.etl.fetch_fsi import run_fetch as run_fetch_fsi
1718
from geoluck.etl.fetch_gcmt import run_fetch as run_fetch_gcmt
@@ -45,6 +46,9 @@
4546
from geoluck.etl.fetch_wealth_accounts import run_fetch as run_fetch_wealth_accounts
4647
from geoluck.etl.fetch_wgi import run_fetch as run_fetch_wgi
4748
from geoluck.etl.fetch_wocqi import run_fetch as run_fetch_wocqi
49+
from geoluck.etl.fetch_women_business_law import (
50+
run_fetch as run_fetch_women_business_law,
51+
)
4852
from geoluck.etl.fetch_worldclim import run_fetch as run_fetch_worldclim
4953
from geoluck.etl.fetch_wpp import run_fetch as run_fetch_wpp
5054
from geoluck.features.build_alesina_fractionalization_features import (
@@ -156,7 +160,8 @@
156160
"income",
157161
"--target",
158162
help=(
159-
"Prediction target to use. Available: income, life_expectancy, inequality, wealth."
163+
"Prediction target to use. Available: income, life_expectancy, inequality, "
164+
"wealth, gender_inequality, female_lfpr, women_business_law."
160165
),
161166
)
162167
PERMUTATION_IMPORTANCE_OPTION = typer.Option(
@@ -267,14 +272,23 @@ def build_outcomes_panel() -> None:
267272
typer.echo(f"wpp_input={result.wpp_input_path}")
268273
if result.swiid_input_path is not None:
269274
typer.echo(f"swiid_input={result.swiid_input_path}")
275+
if result.undp_gii_input_path is not None:
276+
typer.echo(f"undp_gii_input={result.undp_gii_input_path}")
277+
if result.female_lfpr_input_path is not None:
278+
typer.echo(f"female_lfpr_input={result.female_lfpr_input_path}")
270279
if result.wealth_input_path is not None:
271280
typer.echo(f"wealth_input={result.wealth_input_path}")
281+
if result.women_business_law_input_path is not None:
282+
typer.echo(f"women_business_law_input={result.women_business_law_input_path}")
272283
typer.echo(f"output={result.output_path}")
273284
typer.echo(f"rows={result.row_count}")
274285
typer.echo(f"decades={result.decades}")
275286
typer.echo(f"life_expectancy_rows={result.life_expectancy_rows}")
276287
typer.echo(f"inequality_rows={result.inequality_rows}")
288+
typer.echo(f"gender_inequality_rows={result.gender_inequality_rows}")
289+
typer.echo(f"female_lfpr_rows={result.female_lfpr_rows}")
277290
typer.echo(f"wealth_rows={result.wealth_rows}")
291+
typer.echo(f"women_business_law_rows={result.women_business_law_rows}")
278292

279293

280294
@app.command("fetch-natural-earth")
@@ -357,6 +371,39 @@ def fetch_undp_gii(
357371
typer.echo(f"unmatched_countries={result.unmatched_country_count}")
358372

359373

374+
@app.command("fetch-female-lfpr")
375+
def fetch_female_lfpr(
376+
force: bool = typer.Option(False, help="Redownload the female LFPR indicator payload."),
377+
) -> None:
378+
"""Fetch and normalize the World Bank / ILO female LFPR series."""
379+
result = run_fetch_female_lfpr(force=force)
380+
typer.echo(f"raw_countries={result.raw_countries_path}")
381+
typer.echo(f"raw_indicators={result.raw_indicators_path}")
382+
typer.echo(f"tidy={result.tidy_path}")
383+
typer.echo(f"provenance={result.provenance_path}")
384+
typer.echo(f"rows={result.row_count}")
385+
typer.echo(f"countries={result.country_count}")
386+
typer.echo(f"years={result.year_min}-{result.year_max}")
387+
388+
389+
@app.command("fetch-women-business-law")
390+
def fetch_women_business_law(
391+
force: bool = typer.Option(
392+
False,
393+
help="Redownload the Women, Business and the Law indicator payload.",
394+
),
395+
) -> None:
396+
"""Fetch and normalize the World Bank Women, Business and the Law index."""
397+
result = run_fetch_women_business_law(force=force)
398+
typer.echo(f"raw_countries={result.raw_countries_path}")
399+
typer.echo(f"raw_indicators={result.raw_indicators_path}")
400+
typer.echo(f"tidy={result.tidy_path}")
401+
typer.echo(f"provenance={result.provenance_path}")
402+
typer.echo(f"rows={result.row_count}")
403+
typer.echo(f"countries={result.country_count}")
404+
typer.echo(f"years={result.year_min}-{result.year_max}")
405+
406+
360407
@app.command("fetch-global-solar-atlas")
361408
def fetch_global_solar_atlas(
362409
force: bool = typer.Option(False, help="Refetch the Global Solar Atlas point samples."),

0 commit comments

Comments
 (0)