update for 2022

Josh Levinger · Josh Levinger · commit 2dd9baa335ca · 2023-01-04T10:18:51.000-05:00
uses census 2020 data source
diff --git a/LICENSE b/LICENSE
@@ -2,7 +2,7 @@ Raw data are United States Government Works, and no copyright is expressed or in
 
 Conversion scripts provided under the MIT License (MIT)
 
-Copyright (c) 2017, Spacedog XYZ
+Copyright (c) 2023, Spacedog XYZ
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/Makefile b/Makefile
@@ -4,31 +4,20 @@ clean:
 	rm -f raw/*
 	rm -f zccd.csv
 
-zccd.csv: raw/natl_zccd_delim.txt  raw/zcta_county_rel_10.txt raw/state_fips.txt raw/zccd_updates.txt
+zccd.csv: raw/cd118 raw/zcta520_tract20_natl.txt raw/state_fips.txt
 	python merge_data.py
 
 zccd_hud.csv: raw/hud_crosswalk.xlsx
 	python hud_crosswalk.py
 
-# Congressional districts by zip code tabulation area (ZCTA) national, comma delimited
-# NB: does not include at-large districts for AK, DE, MT, ND, SD, VT, WY, PR or DC
-raw/natl_zccd_delim.txt:
-	curl "https://www2.census.gov/geo/relfiles/cdsld16/natl/natl_zccd_delim.txt" -o raw/natl_zccd_delim.txt
-
-# inter-censal changes to congressional districts are released only for updated states
-# necessary for CO, FL, MN, NC, PA, VA
-raw/zccd_updates.txt:
-	curl "https://www2.census.gov/geo/relfiles/cdsld18/08/zc_cd_delim_08.txt" -o raw/zc_cd_delim_08.txt
-	curl "https://www2.census.gov/geo/relfiles/cdsld16/12/zc_cd_delim_12.txt" -o raw/zc_cd_delim_12.txt
-	curl "https://www2.census.gov/geo/relfiles/cdsld18/27/zc_cd_delim_27.txt" -o raw/zc_cd_delim_27.txt
-	curl "https://www2.census.gov/geo/relfiles/cdsld16/37/zc_cd_delim_37.txt" -o raw/zc_cd_delim_37.txt
-	curl "https://www2.census.gov/geo/relfiles/cdsld18/42/zc_cd_delim_42.txt" -o raw/zc_cd_delim_42.txt
-	curl "https://www2.census.gov/geo/relfiles/cdsld16/51/zc_cd_delim_51.txt" -o raw/zc_cd_delim_51.txt
-
-# 2010 ZCTA to state & county
-# TODO, try to find an updated version
-raw/zcta_county_rel_10.txt:
-	curl 'https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt' -o $@
+# Districts for 118th Congress, post redistricting
+raw/cd118:
+	curl "https://www2.census.gov/programs-surveys/decennial/rdo/mapping-files/2023/118-congressional-district-bef/cd118.zip" -o raw/cd118.zip
+	unzip raw/cd118.zip -d raw/cd118
+
+# 2020 ZCTA to census block
+raw/zcta520_tract20_natl.txt:
+	curl 'https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_tract20_natl.txt' -o $@
 
 # FIPS State/Territory codes to names
 raw/state_fips.txt:
diff --git a/README.md b/README.md
@@ -16,17 +16,16 @@ There are many commercial sources of zipcode data available, and some of them in
 
 ## How does this work?
 
-We start with the most recent Census mapping for the 115th Congress, which includes redistricting in 2016 for FL, MN, NC and VA. It does not however include data for states and territories with at-large representation (AK, DE, MT, ND, SD, VT, WY, PR, and DC). We  add all available ZCTAs for those states as well at the US Minor Outlying Islands, using 2010 data. This is unfortunately the latest available. We de-duplicate this data, ensuring not to alter ZCTAs that span state lines. We also clean it, to remove unsightly `null` strings, and obviously incorrect values in Colorado that start with `000`.
+We start with the most recent 2020 Census tabulation blocks, which [includes redistricting for the 118th Congress](https://www.census.gov/geographies/mapping-files/2023/dec/rdo/118-congressional-district-bef.html) as submitted on December 16, 2022. We match these to zipcodes through the ZCTA relationship. We de-duplicate these, ensuring not to alter ZCTAs that span state lines. We also clean them, to remove unsightly `null` strings, and rename at-large districts from `98` to `0`.
 
 We are left with a reasonably clean dataset. When tested against older publically available ones from the [Sunlight Foundation](https://sunlightlabs.github.io/congress/#zip-codes-to-congressional-districts]) (`RIP`) and [18F](https://github.com/18F/openFEC/blob/master/data/natl_zccd_delim.csv), we show that we are not missing any ZCTAs, and have updated 1079 out of 39435 to new congressional districts. Run `make test` to see exact changes.
 
 We have also included a crosswalk file [sourced from HUD](https://www.huduser.gov/portal/datasets/usps_crosswalk.html#codebook), parsed from Excel and split to match the format of the above file. This may be more complete, as it is derived from in the quarterly [USPS Vacancy Data](https://www.huduser.gov/portal/datasets/usps.html) and last updated in September 2020. It is available only for government entities and non-profit organizations related to the ["stated purpose"](https://www.huduser.gov/portal/usps/sublicense_agreement.html#statedpurpose) of the HUD Sublicensing Agreement (*measuring and forecasting neighborhood changes, assessing neighborhood needs, and measuring/assessing various HUD programs*).
 
 ## Data Sources
 
-- [2016 US Gazetteer](https://www.census.gov/geo/maps-data/data/gazetteer2016.html)
-- [2010 ZCTA Relationships](https://www.census.gov/geo/maps-data/data/zcta_rel_overview.html)
-- [Guam Zip Codes](http://mcog.guam.gov/guam_zip_codes.html)
+- [2020 US Census Block Equivalency Files](https://www.census.gov/geographies/mapping-files/2023/dec/rdo/118-congressional-district-bef.html)
+- [2020 US Census ZIP Code Tabulation Areas (ZCTAs) Relationship Files](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#zctacomp)
 - [HUD USPS ZIP code Crosswalk](https://www.huduser.gov/portal/datasets/usps_crosswalk.html#data)
 - Checked against state overlaps noted on [GIS StackExchange](http://gis.stackexchange.com/questions/53918/determining-which-us-zipcodes-map-to-more-than-one-state-or-more-than-one-city)
 
diff --git a/merge_data.py b/merge_data.py
@@ -1,20 +1,34 @@
 import utils
 import logging
-import collections
+import sys
 
 log = logging.getLogger(__name__)
-log.addHandler(logging.StreamHandler())
+log.addHandler(logging.StreamHandler(sys.stdout))
+log.setLevel(logging.WARNING)
 
-def load_zccd(fn):
+def load_districts(fn):
     column_map = {
-        'State': 'state_fips',
-        'ZCTA': 'zcta',
-        'Congressional District': 'cd',
-        'CongressionalDistrict': 'cd' # different spellings in natl and state specific files...
+        'GEOID': 'tract',
+        'CDFP': 'cd',
     }
 
-    zccd = utils.load_csv_columns(fn, column_map, skip=1)
-    return zccd
+    tract_list = utils.load_csv_columns(fn, column_map)
+    # trim tract to block geoid, so we can use dict for faster lookup
+    for tract in tract_list:
+        tract['block'] = tract['tract'][:-4]
+
+    blocks = utils.list_key_values(tract_list, 'block')
+    return blocks
+
+def load_tracts(fn):
+    column_map = {
+        'GEOID_TRACT_20': 'tract',
+        'GEOID_ZCTA5_20': 'zcta',
+    }
+
+    tracts_list = utils.load_csv_columns(fn, column_map, delimiter='|')
+    zcta = utils.list_key_values(tracts_list, 'zcta')
+    return zcta
 
 def load_fips(fn):
     column_map = {
@@ -27,120 +41,43 @@ def load_fips(fn):
         fips_dict[row['state_fips']] = row['state']
     return fips_dict
 
-def replace_state_zips(zccd, state_updates):
-    state_fips_to_update = []
-    for s in state_updates.keys():
-        state_fips_to_update.append(STATE_TO_FIPS[s])
-
-    # remove existing data for updated states
-    zccd[:] = [z for z in zccd if z['state_fips'] not in state_fips_to_update]
-    # works in place
-
-    for state_zips in state_updates.values():
-        zccd.extend(state_zips)
-
-    return zccd
-
-def append_missing_zips(zccd, states_list):
-    states_fips = []
-    for s in states_list:
-        states_fips.append(STATE_TO_FIPS[s])
-
-    # load zcta_county_rel, which has full entries for each state
-    column_map = {
-        'ZCTA5': 'zcta',
-        'STATE': 'state_fips'
-    }
-    all_zips_list = utils.load_csv_columns('raw/zcta_county_rel_10.txt', column_map)
-    missing_zips_states = collections.defaultdict(set)
-
-    for z in all_zips_list:
-        # dedupe with a defaultdict
-        if z['state_fips'] in missing_zips_states[z['zcta']]:
-            log.info('zcta %s already in %s' % (z['zcta'], z['state_fips']))
+def merge_by_tract(cd_dict, zcta_dict):
+    merged = []
+    for (zcta, zcta_row) in zcta_dict.items():
+        if not zcta:
+            # skip initial blanks
             continue
-        else:
-            missing_zips_states[z['zcta']].add(z['state_fips'])
-
-        if z['state_fips'] in states_fips:
-            zccd.append({
-                'zcta': z['zcta'],
-                'state_fips': z['state_fips'],
-                'cd': '0' # at-large
-            })
-
-    # also include zipcodes from US Minor and Outlying Islands
-    # which are not included in the zcta_county_rel file
-    # these are copied from govt websites as available
-    missing_islands = {
-        'AS': ['96799'],
-        'GU': ['96910', '96913', '96915', '96916', '96917', '96921', '96928', '96929', '96931', '96932'],
-        'MP': ['96950', '96951', '96952'],
-        'VI': ['00801', '00802', '00820', '00823', '00824', '00830', '00831','00841', '00840', '00850', '00851'],
-        'PR': ['00981'] # not sure why this isn't in the country_rel, because there are a bunch of others listed
-    }
 
-    for (abbr, zcta_list) in missing_islands.items():
-        for z in zcta_list:
-            zccd.append({
-                    'zcta': z,
-                    'state_fips': STATE_TO_FIPS[abbr],
-                    'state_abbr': abbr,
-                    'cd': '0', # at-large
-                })
-
-    # Include some zipcodes that have small populations (so no ZCTA) but are otherwise noteworthy
-    # from https://about.usps.com/who-we-are/postal-facts/fun-facts.htm
-    # There are ~2,500 others used exclusively by businesses, but we don't have a list.
-    missing_small_zips = {
-        'AK': {
-            '99950': '0', # Ketchikan has highest zip 
-        },
-        'AZ': {
-            '85001': '7', # Phoenix convention center
-            '85002': '7'  #
-        },
-        'NY': {
-            '00501': '1', # Holtsville has IRS processing center with lowest zip
-            '00544': '1', #
-            '11249': '7,12', # Williamsburg split in 2011, not reflected in census
-            '12301': '20', # Schenectady has GE plant with memorable zip
-            '12345': '20'
-        },
-        'TX': {
-            '78599': '15' # near US-Mexico border
-        },
-        'VA': {
-            '22350': '8' # Botanical preserve in Alexandria
-        }
-    }
+        tract = zcta_row[0]['tract']
 
-    for (abbr, zcta_cd_dict) in missing_small_zips.items():
-        for (z, cd_list) in zcta_cd_dict.items():
-            for cd in cd_list.split(','):
-                zccd.append({
-                        'zcta': z,
-                        'state_fips': STATE_TO_FIPS[abbr],
-                        'state_abbr': abbr,
-                        'cd': cd,
-                    })
+        matched_cds = cd_dict[tract]
+        matched_list = list(m['cd'] for m in matched_cds)
+        matched_unique = list(set(matched_list))
 
-    return zccd
+        for matched_cd in matched_unique:
+            new_zcta = {'zcta': zcta, 'cd': matched_cd, 'state_fips': tract[:2]}
+            log.info(new_zcta)
+            merged.append(new_zcta)
+    return merged
 
 def state_fips_to_name(zccd):
     # append state abbreviation from FIPS
-    merged = {}
     for row in zccd:
         row['state_abbr'] = FIPS_TO_STATE[row['state_fips']]
     return zccd
 
 def remove_district_padding(zccd):
     cleaned = []
     for row in zccd:
-        if row['cd'] == 'null':
-            # natl_zccd_delim includes several rows with 'null' for uninhabited areas
-            # skip them
+        if row['cd'] in ['null', '', 'ZZ']:
+            # skip empty rows
+            # ZZ means mostly water
             continue
+
+        # non-voting districts are noted as 98 in census, but 0 in other sources
+        if row['cd'] == '98':
+            row['cd'] = 0
+
         row['cd'] = str(int(row['cd']))
         # do this weird conversion to get rid of zero padding
         cleaned.append(row)
@@ -166,38 +103,33 @@ def sanity_check(zccd, incorrect_states_dict):
 if __name__ == "__main__":
     # load state FIPS codes
     FIPS_TO_STATE = load_fips('raw/state_fips.txt')
-    STATE_TO_FIPS = {v: k for k, v in FIPS_TO_STATE.iteritems()}
+    STATE_TO_FIPS = {v: k for k, v in FIPS_TO_STATE.items()}
+
+    # load national tract file
+    tract_to_zcta = load_tracts('raw/zcta520_tract20_natl.txt')
+    zccd_national = []
 
-    # load national zccd file
-    zccd_missing = load_zccd('raw/natl_zccd_delim.txt')
+    for (state,fips) in STATE_TO_FIPS.items():
+        # load statewide districts file
+        cd_to_tract = load_districts(f"raw/cd118/{fips}_{state}_CD118.txt")
 
-    # update for inter-censal changes
-    zccd_updated = replace_state_zips(zccd_missing,
-        {'CO': load_zccd('raw/zc_cd_delim_08.txt'),
-         'FL': load_zccd('raw/zc_cd_delim_12.txt'),
-         'MN': load_zccd('raw/zc_cd_delim_27.txt'),
-         'NC': load_zccd('raw/zc_cd_delim_37.txt'),
-         'PA': load_zccd('raw/zc_cd_delim_42.txt'),
-         'VA': load_zccd('raw/zc_cd_delim_51.txt'),
-        }
-    )
+        # merge by the tract geoid
+        zccd = merge_by_tract(cd_to_tract, tract_to_zcta)
 
-    # append zipcodes for at-large states
-    at_large_states = ['AK', 'DE', 'MT', 'ND', 'SD', 'VT', 'WY', 'PR', 'DC']
-    zccd_complete = append_missing_zips(zccd_updated, at_large_states)
+        # clean output
+        zccd_cleaned = remove_district_padding(zccd)
 
-    # clean output
-    zccd_cleaned = remove_district_padding(zccd_complete)
+        # insert state abbreviation column
+        zccd_named = state_fips_to_name(zccd_cleaned)
 
-    # insert state abbreviation column
-    zccd_named = state_fips_to_name(zccd_cleaned)
+        print("got %s ZCTA->CD mappings for %s" % (len(zccd_named), state))
+        zccd_national.extend(zccd_named)
 
-    # and sanity check to remove obvious outliers
-    zccd_checked = sanity_check(zccd_named, {'CO': '0'})
+    print("got %s ZCTA->CD mappings for %s" % (len(zccd_national), 'national'))
 
     # re-sort by state FIPS code
-    zccd_sorted = sorted(zccd_checked, key=lambda k: (k['state_fips'], k['zcta'], k['cd']))
-    print("got %s ZCTA->CD mappings" % len(zccd_sorted))
+    zccd_sorted = sorted(zccd_national, key=lambda k: (k['state_fips'], k['zcta'], k['cd']))
 
+        
     # write output
-    utils.csv_writer('zccd.csv', zccd_sorted, ['state_fips', 'state_abbr', 'zcta', 'cd'])
+    utils.csv_writer('zccd.csv', zccd_national, ['state_fips', 'state_abbr', 'zcta', 'cd'])
diff --git a/utils.py b/utils.py
diff --git a/zccd.csv b/zccd.csv