Skip to content

Commit 2dd9baa

Browse files
author
Josh Levinger
committed
update for 2022
uses census 2020 data source
1 parent 8054b00 commit 2dd9baa

File tree

6 files changed

+10625
-14456
lines changed

6 files changed

+10625
-14456
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Raw data are United States Government Works, and no copyright is expressed or in
22

33
Conversion scripts provided under the MIT License (MIT)
44

5-
Copyright (c) 2017, Spacedog XYZ
5+
Copyright (c) 2023, Spacedog XYZ
66

77
Permission is hereby granted, free of charge, to any person obtaining a copy
88
of this software and associated documentation files (the "Software"), to deal

Makefile

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,31 +4,20 @@ clean:
44
rm -f raw/*
55
rm -f zccd.csv
66

7-
zccd.csv: raw/natl_zccd_delim.txt raw/zcta_county_rel_10.txt raw/state_fips.txt raw/zccd_updates.txt
7+
zccd.csv: raw/cd118 raw/zcta520_tract20_natl.txt raw/state_fips.txt
88
python merge_data.py
99

1010
zccd_hud.csv: raw/hud_crosswalk.xlsx
1111
python hud_crosswalk.py
1212

13-
# Congressional districts by zip code tabulation area (ZCTA) national, comma delimited
14-
# NB: does not include at-large districts for AK, DE, MT, ND, SD, VT, WY, PR or DC
15-
raw/natl_zccd_delim.txt:
16-
curl "https://www2.census.gov/geo/relfiles/cdsld16/natl/natl_zccd_delim.txt" -o raw/natl_zccd_delim.txt
17-
18-
# inter-censal changes to congressional districts are released only for updated states
19-
# necessary for CO, FL, MN, NC, PA, VA
20-
raw/zccd_updates.txt:
21-
curl "https://www2.census.gov/geo/relfiles/cdsld18/08/zc_cd_delim_08.txt" -o raw/zc_cd_delim_08.txt
22-
curl "https://www2.census.gov/geo/relfiles/cdsld16/12/zc_cd_delim_12.txt" -o raw/zc_cd_delim_12.txt
23-
curl "https://www2.census.gov/geo/relfiles/cdsld18/27/zc_cd_delim_27.txt" -o raw/zc_cd_delim_27.txt
24-
curl "https://www2.census.gov/geo/relfiles/cdsld16/37/zc_cd_delim_37.txt" -o raw/zc_cd_delim_37.txt
25-
curl "https://www2.census.gov/geo/relfiles/cdsld18/42/zc_cd_delim_42.txt" -o raw/zc_cd_delim_42.txt
26-
curl "https://www2.census.gov/geo/relfiles/cdsld16/51/zc_cd_delim_51.txt" -o raw/zc_cd_delim_51.txt
27-
28-
# 2010 ZCTA to state & county
29-
# TODO, try to find an updated version
30-
raw/zcta_county_rel_10.txt:
31-
curl 'https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt' -o $@
13+
# Districts for 118th Congress, post redistricting
14+
raw/cd118:
15+
curl "https://www2.census.gov/programs-surveys/decennial/rdo/mapping-files/2023/118-congressional-district-bef/cd118.zip" -o raw/cd118.zip
16+
unzip raw/cd118.zip -d raw/cd118
17+
18+
# 2020 ZCTA to census block
19+
raw/zcta520_tract20_natl.txt:
20+
curl 'https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_tract20_natl.txt' -o $@
3221

3322
# FIPS State/Territory codes to names
3423
raw/state_fips.txt:

README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,16 @@ There are many commercial sources of zipcode data available, and some of them in
1616

1717
## How does this work?
1818

19-
We start with the most recent Census mapping for the 115th Congress, which includes redistricting in 2016 for FL, MN, NC and VA. It does not however include data for states and territories with at-large representation (AK, DE, MT, ND, SD, VT, WY, PR, and DC). We add all available ZCTAs for those states as well at the US Minor Outlying Islands, using 2010 data. This is unfortunately the latest available. We de-duplicate this data, ensuring not to alter ZCTAs that span state lines. We also clean it, to remove unsightly `null` strings, and obviously incorrect values in Colorado that start with `000`.
19+
We start with the most recent 2020 Census tabulation blocks, which [includes redistricting for the 118th Congress](https://www.census.gov/geographies/mapping-files/2023/dec/rdo/118-congressional-district-bef.html) as submitted on December 16, 2022. We match these to zipcodes through the ZCTA relationship. We de-duplicate these, ensuring not to alter ZCTAs that span state lines. We also clean them, to remove unsightly `null` strings, and rename at-large districts from `98` to `0`.
2020

2121
We are left with a reasonably clean dataset. When tested against older publically available ones from the [Sunlight Foundation](https://sunlightlabs.github.io/congress/#zip-codes-to-congressional-districts]) (`RIP`) and [18F](https://github.com/18F/openFEC/blob/master/data/natl_zccd_delim.csv), we show that we are not missing any ZCTAs, and have updated 1079 out of 39435 to new congressional districts. Run `make test` to see exact changes.
2222

2323
We have also included a crosswalk file [sourced from HUD](https://www.huduser.gov/portal/datasets/usps_crosswalk.html#codebook), parsed from Excel and split to match the format of the above file. This may be more complete, as it is derived from in the quarterly [USPS Vacancy Data](https://www.huduser.gov/portal/datasets/usps.html) and last updated in September 2020. It is available only for government entities and non-profit organizations related to the ["stated purpose"](https://www.huduser.gov/portal/usps/sublicense_agreement.html#statedpurpose) of the HUD Sublicensing Agreement (*measuring and forecasting neighborhood changes, assessing neighborhood needs, and measuring/assessing various HUD programs*).
2424

2525
## Data Sources
2626

27-
- [2016 US Gazetteer](https://www.census.gov/geo/maps-data/data/gazetteer2016.html)
28-
- [2010 ZCTA Relationships](https://www.census.gov/geo/maps-data/data/zcta_rel_overview.html)
29-
- [Guam Zip Codes](http://mcog.guam.gov/guam_zip_codes.html)
27+
- [2020 US Census Block Equivalency Files](https://www.census.gov/geographies/mapping-files/2023/dec/rdo/118-congressional-district-bef.html)
28+
- [2020 US Census ZIP Code Tabulation Areas (ZCTAs) Relationship Files](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#zctacomp)
3029
- [HUD USPS ZIP code Crosswalk](https://www.huduser.gov/portal/datasets/usps_crosswalk.html#data)
3130
- Checked against state overlaps noted on [GIS StackExchange](http://gis.stackexchange.com/questions/53918/determining-which-us-zipcodes-map-to-more-than-one-state-or-more-than-one-city)
3231

merge_data.py

Lines changed: 65 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,34 @@
11
import utils
22
import logging
3-
import collections
3+
import sys
44

55
log = logging.getLogger(__name__)
6-
log.addHandler(logging.StreamHandler())
6+
log.addHandler(logging.StreamHandler(sys.stdout))
7+
log.setLevel(logging.WARNING)
78

8-
def load_zccd(fn):
9+
def load_districts(fn):
910
column_map = {
10-
'State': 'state_fips',
11-
'ZCTA': 'zcta',
12-
'Congressional District': 'cd',
13-
'CongressionalDistrict': 'cd' # different spellings in natl and state specific files...
11+
'GEOID': 'tract',
12+
'CDFP': 'cd',
1413
}
1514

16-
zccd = utils.load_csv_columns(fn, column_map, skip=1)
17-
return zccd
15+
tract_list = utils.load_csv_columns(fn, column_map)
16+
# trim tract to block geoid, so we can use dict for faster lookup
17+
for tract in tract_list:
18+
tract['block'] = tract['tract'][:-4]
19+
20+
blocks = utils.list_key_values(tract_list, 'block')
21+
return blocks
22+
23+
def load_tracts(fn):
24+
column_map = {
25+
'GEOID_TRACT_20': 'tract',
26+
'GEOID_ZCTA5_20': 'zcta',
27+
}
28+
29+
tracts_list = utils.load_csv_columns(fn, column_map, delimiter='|')
30+
zcta = utils.list_key_values(tracts_list, 'zcta')
31+
return zcta
1832

1933
def load_fips(fn):
2034
column_map = {
@@ -27,120 +41,43 @@ def load_fips(fn):
2741
fips_dict[row['state_fips']] = row['state']
2842
return fips_dict
2943

30-
def replace_state_zips(zccd, state_updates):
31-
state_fips_to_update = []
32-
for s in state_updates.keys():
33-
state_fips_to_update.append(STATE_TO_FIPS[s])
34-
35-
# remove existing data for updated states
36-
zccd[:] = [z for z in zccd if z['state_fips'] not in state_fips_to_update]
37-
# works in place
38-
39-
for state_zips in state_updates.values():
40-
zccd.extend(state_zips)
41-
42-
return zccd
43-
44-
def append_missing_zips(zccd, states_list):
45-
states_fips = []
46-
for s in states_list:
47-
states_fips.append(STATE_TO_FIPS[s])
48-
49-
# load zcta_county_rel, which has full entries for each state
50-
column_map = {
51-
'ZCTA5': 'zcta',
52-
'STATE': 'state_fips'
53-
}
54-
all_zips_list = utils.load_csv_columns('raw/zcta_county_rel_10.txt', column_map)
55-
missing_zips_states = collections.defaultdict(set)
56-
57-
for z in all_zips_list:
58-
# dedupe with a defaultdict
59-
if z['state_fips'] in missing_zips_states[z['zcta']]:
60-
log.info('zcta %s already in %s' % (z['zcta'], z['state_fips']))
44+
def merge_by_tract(cd_dict, zcta_dict):
45+
merged = []
46+
for (zcta, zcta_row) in zcta_dict.items():
47+
if not zcta:
48+
# skip initial blanks
6149
continue
62-
else:
63-
missing_zips_states[z['zcta']].add(z['state_fips'])
64-
65-
if z['state_fips'] in states_fips:
66-
zccd.append({
67-
'zcta': z['zcta'],
68-
'state_fips': z['state_fips'],
69-
'cd': '0' # at-large
70-
})
71-
72-
# also include zipcodes from US Minor and Outlying Islands
73-
# which are not included in the zcta_county_rel file
74-
# these are copied from govt websites as available
75-
missing_islands = {
76-
'AS': ['96799'],
77-
'GU': ['96910', '96913', '96915', '96916', '96917', '96921', '96928', '96929', '96931', '96932'],
78-
'MP': ['96950', '96951', '96952'],
79-
'VI': ['00801', '00802', '00820', '00823', '00824', '00830', '00831','00841', '00840', '00850', '00851'],
80-
'PR': ['00981'] # not sure why this isn't in the country_rel, because there are a bunch of others listed
81-
}
8250

83-
for (abbr, zcta_list) in missing_islands.items():
84-
for z in zcta_list:
85-
zccd.append({
86-
'zcta': z,
87-
'state_fips': STATE_TO_FIPS[abbr],
88-
'state_abbr': abbr,
89-
'cd': '0', # at-large
90-
})
91-
92-
# Include some zipcodes that have small populations (so no ZCTA) but are otherwise noteworthy
93-
# from https://about.usps.com/who-we-are/postal-facts/fun-facts.htm
94-
# There are ~2,500 others used exclusively by businesses, but we don't have a list.
95-
missing_small_zips = {
96-
'AK': {
97-
'99950': '0', # Ketchikan has highest zip
98-
},
99-
'AZ': {
100-
'85001': '7', # Phoenix convention center
101-
'85002': '7' #
102-
},
103-
'NY': {
104-
'00501': '1', # Holtsville has IRS processing center with lowest zip
105-
'00544': '1', #
106-
'11249': '7,12', # Williamsburg split in 2011, not reflected in census
107-
'12301': '20', # Schenectady has GE plant with memorable zip
108-
'12345': '20'
109-
},
110-
'TX': {
111-
'78599': '15' # near US-Mexico border
112-
},
113-
'VA': {
114-
'22350': '8' # Botanical preserve in Alexandria
115-
}
116-
}
51+
tract = zcta_row[0]['tract']
11752

118-
for (abbr, zcta_cd_dict) in missing_small_zips.items():
119-
for (z, cd_list) in zcta_cd_dict.items():
120-
for cd in cd_list.split(','):
121-
zccd.append({
122-
'zcta': z,
123-
'state_fips': STATE_TO_FIPS[abbr],
124-
'state_abbr': abbr,
125-
'cd': cd,
126-
})
53+
matched_cds = cd_dict[tract]
54+
matched_list = list(m['cd'] for m in matched_cds)
55+
matched_unique = list(set(matched_list))
12756

128-
return zccd
57+
for matched_cd in matched_unique:
58+
new_zcta = {'zcta': zcta, 'cd': matched_cd, 'state_fips': tract[:2]}
59+
log.info(new_zcta)
60+
merged.append(new_zcta)
61+
return merged
12962

13063
def state_fips_to_name(zccd):
13164
# append state abbreviation from FIPS
132-
merged = {}
13365
for row in zccd:
13466
row['state_abbr'] = FIPS_TO_STATE[row['state_fips']]
13567
return zccd
13668

13769
def remove_district_padding(zccd):
13870
cleaned = []
13971
for row in zccd:
140-
if row['cd'] == 'null':
141-
# natl_zccd_delim includes several rows with 'null' for uninhabited areas
142-
# skip them
72+
if row['cd'] in ['null', '', 'ZZ']:
73+
# skip empty rows
74+
# ZZ means mostly water
14375
continue
76+
77+
# non-voting districts are noted as 98 in census, but 0 in other sources
78+
if row['cd'] == '98':
79+
row['cd'] = 0
80+
14481
row['cd'] = str(int(row['cd']))
14582
# do this weird conversion to get rid of zero padding
14683
cleaned.append(row)
@@ -166,38 +103,33 @@ def sanity_check(zccd, incorrect_states_dict):
166103
if __name__ == "__main__":
167104
# load state FIPS codes
168105
FIPS_TO_STATE = load_fips('raw/state_fips.txt')
169-
STATE_TO_FIPS = {v: k for k, v in FIPS_TO_STATE.iteritems()}
106+
STATE_TO_FIPS = {v: k for k, v in FIPS_TO_STATE.items()}
107+
108+
# load national tract file
109+
tract_to_zcta = load_tracts('raw/zcta520_tract20_natl.txt')
110+
zccd_national = []
170111

171-
# load national zccd file
172-
zccd_missing = load_zccd('raw/natl_zccd_delim.txt')
112+
for (state,fips) in STATE_TO_FIPS.items():
113+
# load statewide districts file
114+
cd_to_tract = load_districts(f"raw/cd118/{fips}_{state}_CD118.txt")
173115

174-
# update for inter-censal changes
175-
zccd_updated = replace_state_zips(zccd_missing,
176-
{'CO': load_zccd('raw/zc_cd_delim_08.txt'),
177-
'FL': load_zccd('raw/zc_cd_delim_12.txt'),
178-
'MN': load_zccd('raw/zc_cd_delim_27.txt'),
179-
'NC': load_zccd('raw/zc_cd_delim_37.txt'),
180-
'PA': load_zccd('raw/zc_cd_delim_42.txt'),
181-
'VA': load_zccd('raw/zc_cd_delim_51.txt'),
182-
}
183-
)
116+
# merge by the tract geoid
117+
zccd = merge_by_tract(cd_to_tract, tract_to_zcta)
184118

185-
# append zipcodes for at-large states
186-
at_large_states = ['AK', 'DE', 'MT', 'ND', 'SD', 'VT', 'WY', 'PR', 'DC']
187-
zccd_complete = append_missing_zips(zccd_updated, at_large_states)
119+
# clean output
120+
zccd_cleaned = remove_district_padding(zccd)
188121

189-
# clean output
190-
zccd_cleaned = remove_district_padding(zccd_complete)
122+
# insert state abbreviation column
123+
zccd_named = state_fips_to_name(zccd_cleaned)
191124

192-
# insert state abbreviation column
193-
zccd_named = state_fips_to_name(zccd_cleaned)
125+
print("got %s ZCTA->CD mappings for %s" % (len(zccd_named), state))
126+
zccd_national.extend(zccd_named)
194127

195-
# and sanity check to remove obvious outliers
196-
zccd_checked = sanity_check(zccd_named, {'CO': '0'})
128+
print("got %s ZCTA->CD mappings for %s" % (len(zccd_national), 'national'))
197129

198130
# re-sort by state FIPS code
199-
zccd_sorted = sorted(zccd_checked, key=lambda k: (k['state_fips'], k['zcta'], k['cd']))
200-
print("got %s ZCTA->CD mappings" % len(zccd_sorted))
131+
zccd_sorted = sorted(zccd_national, key=lambda k: (k['state_fips'], k['zcta'], k['cd']))
201132

133+
202134
# write output
203-
utils.csv_writer('zccd.csv', zccd_sorted, ['state_fips', 'state_abbr', 'zcta', 'cd'])
135+
utils.csv_writer('zccd.csv', zccd_national, ['state_fips', 'state_abbr', 'zcta', 'cd'])

0 commit comments

Comments
 (0)