Use csv2sqlite.py to save the raw csv.gz data file into an sqlite database.
acs: table from ACS microdatapython csv2sqlite.py --gzip acs_08-16.csv.gz acs_08-16.db acs
mig2met: table to convert migration state/puma to puma- run
data-prep.rto build the csv from the two csv files mapping puma and msa - then load it into the sqlite database as another table
python csv2sqlite.py mig2met.csv acs_08-16.db mig2metThen use SQL queries to get aggregated values to avoid loading the entire dataset into memory. Queries apply categorizations (race, edu) on-the-fly, so no need to pre-clean the data.
- run
- set up data: check these files for correct filenames per model (different specifications by age and type)
smooth-pops.r: query and smooth aggregated population counts in each desired metro- Total/single/married populations, marriage/divorce flows, migration flows
- Smoothing by non-parametric regression (local-polynomial): using hand-rolled "diagonal" smoothing kernel (manual bandwidth)
- Saves smoothed data to csv for loading into
julia
mort-rates.r: interpolates and saves death ratesmain-estim.jl: runs the show, but need to set options first- loads populations from saved JLD files, or calls
prepare-pops.jlto generate them anew prepare-pops.jl: loads csv files generated byRscripts above, then converts DataFrames to multidimensional arrays (per metro) and saves as JLD files- estimate arrival rates and then non-parametric objects using
estim-functions.jlandcompute-npobj.jl - can also do a parameter grid search or monte carlo estimation
- loads populations from saved JLD files, or calls
plot-results.r: plot model-data fit and estimated objectstikz-conversion.R: produce tikz figures from saved plot objects
Run scripts in order to set up resampled datasets, run smoothing, and then estimation. Uses GNU Parallel for efficient batch processing.
Rscript bootstrap-resampler.r: creates directoriesdata/bootstrap-samples/resamp_00with resampled csv databash bootstrap-create-db.sh: creates sqlite db from csv filesbash bootstrap-smooth.sh: runssmooth-data.rfor both ageonly and racedu specifications- Took 40 hours for 100 resamples on 8 cores, low memory usage (<2GB)
bash bootstrap-cp-psi.sh: copies the death rate data into the smoothed populations directories for each resamplebash bootstrap-estim.sh: runsmain-estim.jlfor both ageonly and racedu specifications- Took 100 minutes for 100 resamples on 8 cores, low memory usage (<4GB)
- 35620: 14.5m - New York-Newark-Jersey City, NY-NJ-PA
- 31080: 9.4m - Los Angeles-Long Beach-Anaheim, CA
- 16980: 6.8m - Chicago-Naperville-Elgin, IL-IN-WI
- 19100: 4.6m - Dallas-Fort Worth-Arlington, TX
- 37980: 4.4m - Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
- 26420: 4.2m - Houston-The Woodlands-Sugar Land, TX
- 47900: 4.1m - Washington-Arlington-Alexandria, DC-VA-MD-WV
- 33100: 4.1m - Miami-Fort Lauderdale-West Palm Beach, FL
- 12060: 3.8m - Atlanta-Sandy Springs-Roswell, GA
- 14460: 3.5m - Boston-Cambridge-Newton, MA-NH
- 41860: 3.3m - San Francisco-Oakland-Hayward, CA
- 19820: 3.1m - Detroit-Warren-Dearborn, MI
- 38060: 3.1m - Phoenix-Mesa-Scottsdale, AZ
- 40140: 3.0m - Riverside-San Bernardino-Ontario, CA
- 42660: 2.6m - Seattle-Tacoma-Bellevue, WA
- 33460: 2.4m - Minneapolis-St. Paul-Bloomington, MN-WI
- 41740: 2.3m - San Diego-Carlsbad, CA
- 45300: 2.1m - Tampa-St. Petersburg-Clearwater, FL
- 41180: 2.0m - St. Louis, MO-IL
- 12580: 2.0m - Baltimore-Columbia-Towson, MD
30th metro is 1.4m, 40th is 0.9m.
estimate-rates-full.randestimate-rates.r- very poor accuracy due to noisy inference on divorce flows
- Need marriage and divorce rates for each couple-type (globally)
- Marriage rate (directly observable): SQL queries for flows and stocks to compute rates
- Divorce rate (infer from non-divorce rate and death rate)
- Weighted OLS (by stocks of couples)