Skip to content

Commit 9c790fd

Browse files
committed
fix conflicts
Signed-off-by: John Seekins <[email protected]>
2 parents 06162c3 + 581b8b1 commit 9c790fd

File tree

14 files changed

+579
-501
lines changed

14 files changed

+579
-501
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,9 @@ which will help with documenting the facilities appropriately. As these entries
1414
your CSV results change almost immediately.
1515

1616
You can also use `--load-existing` to leverage an existing
17-
scrape of the data from ICE.gov. This is stored in default_data.py and includes the official current addresses of facilities.
18-
(Note ICE has been renaming known "detention center" sites to "processing center", and so on.)
17+
scrape of the data from ICE.gov. This is stored in `default_data.py` and includes the official current addresses of facilities.
18+
19+
> Note ICE has been renaming known "detention center" sites to "processing center", and so on.
1920
2021
The initial scrape data also keeps a `base64` ecoded string containing the original HTML that was scraped from ice.gov about the
2122
facility. Keeping this initial data allows us to verify the resulting extracted data if we need to.
@@ -53,7 +54,7 @@ directory.
5354
uv run python main.py --load-existing --enrich --debug
5455

5556
# With custom output file
56-
uv run python main.py --load-existing --enrich --debug-wikipedia -o debug_facilities.csv
57+
uv run python main.py --load-existing --enrich --debug-wikipedia -o debug_facilities
5758
```
5859

5960
## Requirements
@@ -110,9 +111,8 @@ in hopes of finding similarly named pages but this is too aggressive, and it vee
110111
that have simpler names, like the county name instead of `county + detention center`). Use the debug mode to see what
111112
it is doing.
112113
* ICE scraping is not robustly tested. The image URL extraction needs some work. (should be able to get the detention center image URLs.)
113-
* OSM enrichment submits to OSM Nominatim API search with an extra comma between address number and street name.
114114
* The user-agent for running ice.gov scrape web requests calls itself `'User-Agent': 'ICE-Facilities-Research/1.0 (Educational Research Purpose)'`.
115-
You can change this in scraper.py and enricher.py.
115+
You can change this in `utils.py`.
116116
* It tells some pretty inaccurate percentages in the final summary - a lot of false positives, the Wikipedia debug percent
117117
seems wrong.
118118
* The remote query rate limiting is (I think) done in series but would go faster with parallel/async processing.

enricher.py

Lines changed: 0 additions & 479 deletions
This file was deleted.

enrichers/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Facility enrichment scrapers
2+
3+
These functions let us collect data about facilities from additional sources.
4+
5+
## Enrichment class
6+
7+
The base class we can build enrichment tools from. Largely ensures some consistent in functionality between enrichment tools.
8+
9+
### Available functions
10+
11+
Sub-classing `Enrichment` provides the following functions/objects:
12+
13+
* `self.resp_info`
14+
* Pre-created response object following our expected schema
15+
* `self._wait_time`
16+
* simple rate-limiting through `time.sleep()` calls, `wait_time` tells us how long we should sleep between calls to an individual API/site.
17+
* Defaults to `1` (seconds)
18+
* `self._req(...)`
19+
* Wrapper function around a call to `requests.get` (using a properly configured `session` object)
20+
* handles redirects
21+
* supports most normal requests function calls (`params`, `timeout`, `stream`, custom headers)
22+
* raises for non-2xx/3xx status
23+
* returns the entire `requests.Response` object for manipulation
24+
* `_minimal_clean_facility_name(str)`
25+
* standardizes facility name for searching
26+
* `_clean_facility_name(str)`
27+
* standardizes facility name for searching
28+
* more aggressive formatting than `_minimal_...` above
29+
30+
> All child functions should implement the `search()` function, which should return a dictionary using the `enrich_resp_schema` schema.

enrichers/__init__.py

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
"""
2+
Import order here is a touch weird, but we need it so
3+
types exist before attempting to import functions that
4+
may call them
5+
"""
6+
7+
import copy
8+
import requests
9+
from schemas import enrich_resp_schema
10+
import time
11+
from utils import (
12+
default_headers,
13+
session,
14+
)
15+
16+
17+
class Enrichment(object):
18+
_required_keys = [
19+
"facility_name",
20+
]
21+
# in seconds
22+
_wait_time: float = 1
23+
24+
def __init__(self, **kwargs):
25+
self.resp_info = copy.deepcopy(enrich_resp_schema)
26+
for k in self._required_keys:
27+
if k not in kwargs.keys():
28+
raise KeyError("Missing required key %s in %s", k, kwargs)
29+
self.search_args = copy.deepcopy(kwargs)
30+
31+
def search(self) -> dict:
32+
"""Child objects should implement this"""
33+
return {}
34+
35+
def _req(self, url: str, **kwargs) -> requests.Response:
36+
"""requests response wrapper to ensure we honor waits"""
37+
headers = kwargs.get("headers", {})
38+
# ensure we get all headers configured correctly
39+
# but manually applied headers win the argument
40+
for k, v in default_headers.items():
41+
if k in headers.keys():
42+
continue
43+
headers[k] = v
44+
45+
response = session.get(
46+
url,
47+
allow_redirects=True,
48+
timeout=kwargs.get("timeout", 10),
49+
params=kwargs.get("params", {}),
50+
stream=kwargs.get("stream", False),
51+
headers=headers,
52+
)
53+
response.raise_for_status()
54+
time.sleep(self._wait_time)
55+
return response
56+
57+
def _minimal_clean_facility_name(self, name: str) -> str:
58+
"""Minimal cleaning that preserves important context like 'County Jail'"""
59+
cleaned = name
60+
61+
# Remove pipe separators and take the main name
62+
if "|" in cleaned:
63+
parts = cleaned.split("|")
64+
cleaned = max(parts, key=len).strip()
65+
66+
# Only remove very generic suffixes, keep specific ones like "County Jail"
67+
generic_suffixes = [
68+
"Service Processing Center",
69+
"ICE Processing Center",
70+
"Immigration Processing Center",
71+
"Contract Detention Facility",
72+
"Adult Detention Facility",
73+
]
74+
75+
for suffix in generic_suffixes:
76+
if cleaned.endswith(suffix):
77+
cleaned = cleaned[: -len(suffix)].strip()
78+
break
79+
80+
return cleaned
81+
82+
def _clean_facility_name(self, name: str) -> str:
83+
"""Clean facility name for better search results"""
84+
# Remove common suffixes and prefixes that might interfere with search
85+
# This function may not be helpful - may be counterproductive.
86+
cleaned = name
87+
88+
# Remove pipe separators and take the main name
89+
if "|" in cleaned:
90+
parts = cleaned.split("|")
91+
# Take the longer part as it's likely the full name
92+
cleaned = max(parts, key=len).strip()
93+
94+
# Remove common facility type suffixes for broader search
95+
suffixes_to_remove = [
96+
"Detention Center",
97+
"Processing Center",
98+
"Correctional Center",
99+
"Correctional Facility",
100+
"Detention Facility",
101+
"Service Processing Center",
102+
"ICE Processing Center",
103+
"Immigration Processing Center",
104+
"Adult Detention Facility",
105+
"Contract Detention Facility",
106+
"Regional Detention Center",
107+
"County Jail",
108+
"County Detention Center",
109+
"Sheriff's Office",
110+
"Justice Center",
111+
"Safety Center",
112+
"Jail Services",
113+
"Correctional Complex",
114+
"Public Safety Complex",
115+
]
116+
117+
for suffix in suffixes_to_remove:
118+
if cleaned.endswith(suffix):
119+
cleaned = cleaned[: -len(suffix)].strip()
120+
break
121+
return cleaned
122+
123+
124+
from .general import enrich_facility_data # noqa: F401,E402

enrichers/general.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from concurrent.futures import ProcessPoolExecutor
2+
import copy
3+
from enrichers import (
4+
openstreetmap,
5+
wikidata,
6+
wikipedia,
7+
)
8+
from schemas import (
9+
facilities_schema,
10+
)
11+
import time
12+
from utils import logger
13+
14+
15+
def enrich_facility_data(facilities_data: dict, workers: int = 3) -> dict:
16+
"""wrapper function for multiprocessing of facility enrichment"""
17+
start_time = time.time()
18+
logger.info("Starting data enrichment with external sources...")
19+
enriched_data = copy.deepcopy(facilities_schema)
20+
total = len(facilities_data["facilities"])
21+
processed = 0
22+
23+
with ProcessPoolExecutor(max_workers=workers) as pool:
24+
for res in pool.map(_enrich_facility, facilities_data["facilities"].items()):
25+
enriched_data["facilities"][res[0]] = res[1] # type: ignore [index]
26+
processed += 1
27+
logger.info(" -> Finished %s, %s/%s completed", res[1]["name"], processed, total)
28+
29+
logger.info("Data enrichment completed!")
30+
enriched_data["enrich_runtime"] = time.time() - start_time
31+
logger.info(" Completed in %s seconds", enriched_data["enrich_runtime"])
32+
return enriched_data
33+
34+
35+
def _enrich_facility(facility_data: tuple) -> tuple:
36+
"""enrich a single facility"""
37+
facility_id, facility = facility_data
38+
facility_name = facility["name"]
39+
logger.info("Enriching facility %s...", facility_name)
40+
enriched_facility = copy.deepcopy(facility)
41+
42+
wiki_res = wikipedia.Wikipedia(facility_name=facility_name).search()
43+
wd_res = wikidata.Wikidata(facility_name=facility_name).search()
44+
osm = openstreetmap.OpenStreetMap(facility_name=facility_name, address=facility.get("address", {}))
45+
osm_res = osm.search()
46+
enriched_facility["wikipedia"]["page_url"] = wiki_res.get("url", "")
47+
enriched_facility["wikipedia"]["search_query"] = wiki_res.get("search_query_steps", "")
48+
enriched_facility["wikidata"]["page_url"] = wd_res.get("url", "")
49+
enriched_facility["wikidata"]["search_query"] = wd_res.get("search_query_steps", "")
50+
enriched_facility["osm"]["latitude"] = osm_res.get("details", {}).get("latitude", osm.default_coords["latitude"])
51+
enriched_facility["osm"]["longitude"] = osm_res.get("details", {}).get("longitude", osm.default_coords["longitude"])
52+
enriched_facility["osm"]["url"] = osm_res.get("url", "")
53+
enriched_facility["osm"]["search_query"] = osm_res.get("search_query_steps", "")
54+
55+
logger.debug(enriched_facility)
56+
return facility_id, enriched_facility

enrichers/openstreetmap.py

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
from enrichers import Enrichment
2+
from utils import logger
3+
4+
5+
class OpenStreetMap(Enrichment):
6+
# default to Washington, D.C.?
7+
default_coords: dict = {
8+
"latitude": 38.89511000,
9+
"longitude": -77.03637000,
10+
}
11+
12+
def search(self) -> dict:
13+
facility_name = self.search_args["facility_name"]
14+
address = self.search_args.get("address", {})
15+
search_name = self._clean_facility_name(facility_name)
16+
search_url = "https://nominatim.openstreetmap.org/search"
17+
self.resp_info["enrichment_type"] = "openstreetmap"
18+
data = []
19+
if not address:
20+
logger.debug("No address for %s, simply searching for name", facility_name)
21+
params = {
22+
"q": search_name,
23+
"format": "json",
24+
"limit": 5,
25+
"dedupe": 1,
26+
}
27+
logger.debug("Searching OSM for %s", search_name)
28+
self.resp_info["search_query_steps"].append(search_name) # type: ignore [attr-defined]
29+
try:
30+
response = self._req(search_url, params=params, timeout=15)
31+
logger.debug("Response: %s", response.text)
32+
data = response.json()
33+
except Exception as e:
34+
logger.debug(" OSM search error for '%s': %s", facility_name, e)
35+
self.resp_info["search_query_steps"].append(f"(Failed -> {e})") # type: ignore [attr-defined]
36+
return self.resp_info
37+
else:
38+
full_address = (
39+
f"{address['street']} {address['locality']}, {address['administrative_area']} {address['postal_code']}"
40+
)
41+
locality = f"{address['locality']}, {address['administrative_area']} {address['postal_code']}"
42+
search_url = "https://nominatim.openstreetmap.org/search"
43+
search_params = {
44+
"facility_name": {
45+
"q": f"{search_name} {full_address}",
46+
"format": "json",
47+
"limit": 5,
48+
"dedupe": 1,
49+
},
50+
"street_address": {
51+
"q": f"{full_address}",
52+
"format": "json",
53+
"limit": 5,
54+
"dedupe": 1,
55+
},
56+
"locality": {
57+
"q": f"{locality}",
58+
"format": "json",
59+
"limit": 5,
60+
"dedupe": 1,
61+
},
62+
}
63+
for search_name, params in search_params.items():
64+
logger.debug("Searching OSM for %s", params["q"])
65+
self.resp_info["search_query_steps"].append(params["q"]) # type: ignore [attr-defined]
66+
try:
67+
response = self._req(search_url, params=params, timeout=15)
68+
data = response.json()
69+
except Exception as e:
70+
logger.debug(" OSM search error for '%s': %s", facility_name, e)
71+
self.resp_info["search_query_steps"].append(f"(Failed -> {e})") # type: ignore [attr-defined]
72+
continue
73+
if not data:
74+
return self.resp_info
75+
# when the URL result is a "way" this is usually correct.
76+
# checks top five results.
77+
match_terms = ["prison", "detention", "correctional", "jail"]
78+
for result in data:
79+
osm_type = result.get("type", "").lower()
80+
lat = result.get("lat", self.default_coords["latitude"])
81+
lon = result.get("lon", self.default_coords["longitude"])
82+
display_name = result.get("display_name", "").lower()
83+
if any(term in osm_type for term in match_terms) or any(term in display_name for term in match_terms):
84+
# todo courthouse could be added, or other tags such as "prison:for=migrant" as a clear positive search result.
85+
osm_id = result.get("osm_id", "")
86+
osm_type_prefix = result.get("osm_type", "")
87+
title = result.get("display_name", "")
88+
if osm_id and osm_type_prefix:
89+
self.resp_info["url"] = f"https://www.openstreetmap.org/{osm_type_prefix}/{osm_id}"
90+
self.resp_info["details"]["latitude"] = lat # type: ignore [index]
91+
self.resp_info["details"]["longitude"] = lon # type: ignore [index]
92+
self.resp_info["title"] = title
93+
return self.resp_info
94+
# fallback to first result
95+
first_result = data[0]
96+
logger.debug("Address searches didn't directly find anything, just using the first result: %s", first_result)
97+
title = first_result.get("display_name", "")
98+
lat = first_result.get("lat", self.default_coords["latitude"])
99+
lon = first_result.get("lon", self.default_coords["longitude"])
100+
self.resp_info["search_query_steps"].append(f"{lat}&{lon}") # type: ignore [attr-defined]
101+
if lat and lon:
102+
self.resp_info["url"] = f"https://www.openstreetmap.org/?mlat={lat}&mlon={lon}&zoom=15"
103+
self.resp_info["details"]["latitude"] = lat # type: ignore [index]
104+
self.resp_info["details"]["longitude"] = lon # type: ignore [index]
105+
self.resp_info["title"] = title
106+
return self.resp_info

0 commit comments

Comments
 (0)