Skip to content

Commit 4791b09

Browse files
committed
Initial commit - ICE Detention Facilities Mapper.
0 parents  commit 4791b09

File tree

8 files changed

+1111
-0
lines changed

8 files changed

+1111
-0
lines changed

LICENSE.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 - Dan Feidt
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# ICE Detention Facilities Scraper
2+
3+
_ICE Detention Facilities Data Scraper and Enricher_, a Python script managed by the [Open Security Mapping Project](https://github.com/Open-Security-Mapping-Project).
4+
5+
In short this will help identify the online profile of each ICE detention facility. Please see the [project home page](https://github.com/Open-Security-Mapping-Project)
6+
for more about mapping these facilities and other detailed info sources.
7+
8+
This script scrapes ICE detention facility data from ICE.gov and enriches it
9+
with information from [Wikipedia](https://en.wikipedia.org), [Wikidata](https://wikidata.org), and
10+
[OpenStreetMap](https://openstreetmap.org).
11+
12+
The main purpose right now is to identify if the detention facilities have data on Wikipedia, Wikidata and OpenStreetMap,
13+
which will help with documenting the facilities appropriately. As these entries get fixed up, you should be able to see
14+
your CSV results change almost immediately.
15+
16+
You can also use `--load-existing` to leverage an existing
17+
scrape of the data from ICE.gov. This is stored in data_loader.py and includes the official current addresses of facilities.
18+
(Note ICE has been renaming known "detention center" sites to "processing center", and so on.)
19+
20+
It also shows the ICE "field office" managing each detention facility.
21+
22+
On the OpenStreetMap (OSM) CSV results, if the URL includes a "way" then it has probably identified the correctly tagged
23+
polygon. If you visit that URL you should see the courthouse or "prison grounds" way / area info. (This info can always
24+
be improved, but at least it exists.)
25+
26+
On Wikipedia results the result will tend to be the first hit on the list of suggested pages, if it can't find the page
27+
directly.
28+
29+
The script is MIT license, please feel free to fork it and/or submit patches.
30+
31+
The script should be compliant with these websites' rate limiting for queries.
32+
33+
At this point of development you probably want "enable all debugging" to see the results below.
34+
35+
## Usage:
36+
37+
Run the script and by default it will put a CSV file called `ice_detention_facilities_enriched.csv` in the same
38+
directory.
39+
40+
```
41+
python main.py --scrape # Scrape fresh data from ICE website
42+
python main.py --enrich # Enrich existing data with external sources
43+
python main.py --scrape --enrich # Do both operations
44+
python main.py --help # Show help
45+
46+
# Enable Wikipedia debugging (extra col in CSV)
47+
python main.py --load-existing --enrich --debug-wikipedia
48+
# Enable all debugging (extra cols in CSV) - this is recommended right now:
49+
python main.py --load-existing --enrich --debug-wikipedia --debug-wikidata --debug-osm
50+
51+
# With custom output file
52+
python main.py --load-existing --enrich --debug-wikipedia -o debug_facilities.csv
53+
```
54+
## Requirements:
55+
```
56+
pip install requests beautifulsoup4 lxml
57+
# or for globally managed environments, (e.g. Debian and Ubuntu)
58+
sudo apt install python3-requests python3-bs4 python3-lxml
59+
```
60+
61+
## Todo / Known Issues:
62+
63+
* The enrichment on both Wikidata and Wikipedia is pretty messy & inaccurate right now. It tries to truncate common words
64+
in hopes of finding similarly named pages but this is too aggressive, and it veers way off. (That is, it's looking for places
65+
that have simpler names, like the county name instead of `county + detention center`). Use the debug mode to see what
66+
it is doing.
67+
* ICE scraping is not robustly tested. The image URL extraction needs some work. (should be able to get the detention center image URLs.)
68+
* OSM enrichment submits to OSM Nominatim API search with an extra comma between address number and street name.
69+
* The user-agent for running ice.gov scrape web requests calls itself `'User-Agent': 'ICE-Facilities-Research/1.0 (Educational Research Purpose)'`.
70+
You can change this in scraper.py and enricher.py.
71+
* It tells some pretty inaccurate percentages in the final summary - a lot of false positives, the Wikipedia debug percent
72+
seems wrong.
73+
* The remote query rate limiting is (I think) done in series but would go faster with parallel/async processing.
74+
* This is only targeted at English (EN) Wikipedia currently, but multi-lingual page checks would help a wider audience.
75+
76+
## Credit:
77+
78+
Original version by Dan Feidt ([@HongPong](https://github.com/HongPong)), with assistance from various AI gizmos. (My
79+
first real Python program, please clap.)
80+
81+
## License:
82+
83+
MIT License.

csv_utils.py

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
import csv
2+
#
3+
# # CSVHandler class for CSV export and reporting
4+
5+
6+
class CSVHandler:
7+
@staticmethod
8+
def export_to_csv(facilities_data, filename="ice_detention_facilities_enriched.csv"):
9+
if not facilities_data:
10+
print("No data to export!")
11+
return None
12+
13+
base_fields = ["name", "field_office", "address", "phone", "image_url"]
14+
enrichment_fields = ["wikipedia_page_url", "wikidata_page_url", "osm_result_url"]
15+
debug_fields = ["wikipedia_search_query", "wikidata_search_query", "osm_search_query"]
16+
17+
fieldnames = base_fields.copy()
18+
19+
if any(field in facilities_data[0] for field in enrichment_fields):
20+
fieldnames.extend(enrichment_fields)
21+
22+
if any(field in facilities_data[0] for field in debug_fields):
23+
fieldnames.extend(debug_fields)
24+
25+
try:
26+
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
27+
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
28+
writer.writeheader()
29+
for facility in facilities_data:
30+
row_data = {field: facility.get(field, '') for field in fieldnames}
31+
writer.writerow(row_data)
32+
33+
print(f"CSV file '{filename}' created successfully with {len(facilities_data)} facilities.")
34+
return filename
35+
36+
except Exception as e:
37+
print(f"Error writing CSV file: {e}")
38+
return None
39+
40+
@staticmethod
41+
def print_summary(facilities_data):
42+
"""Print summary statistics about the facilities"""
43+
if not facilities_data:
44+
print("No data to summarize!")
45+
print(f"\n=== ICE Detention Facilities Scraper: Run completed ===")
46+
return
47+
48+
total_facilities = len(facilities_data)
49+
print(f"\n=== ICE Detention Facilities Scraper Summary ===")
50+
print(f"Total facilities: {total_facilities}")
51+
52+
# Count by field office
53+
field_offices = {}
54+
for facility in facilities_data:
55+
office = facility.get('field_office', 'Unknown')
56+
field_offices[office] = field_offices.get(office, 0) + 1
57+
58+
print(f"\nFacilities by Field Office:")
59+
for office, count in sorted(field_offices.items(), key=lambda x: x[1], reverse=True):
60+
print(f" {office}: {count}")
61+
62+
# Check enrichment data if available
63+
if 'wikipedia_page_url' in facilities_data[0]:
64+
wiki_found = sum(1 for f in facilities_data if f.get('wikipedia_page_url') and f['wikipedia_page_url'] != False)
65+
wikidata_found = sum(1 for f in facilities_data if f.get('wikidata_page_url') and f['wikidata_page_url'] != False)
66+
osm_found = sum(1 for f in facilities_data if f.get('osm_result_url') and f['osm_result_url'] != False)
67+
68+
print(f"\n=== External Data Enrichment Results ===")
69+
print(f"Wikipedia pages found: {wiki_found}/{total_facilities} ({wiki_found / total_facilities * 100:.1f}%)")
70+
print(f"Wikidata entries found: {wikidata_found}/{total_facilities} ({wikidata_found / total_facilities * 100:.1f}%)")
71+
print(f"OpenStreetMap results found: {osm_found}/{total_facilities} ({osm_found / total_facilities * 100:.1f}%)")
72+
73+
# Debug information if available
74+
if 'wikipedia_search_query' in facilities_data[0]:
75+
print(f"\n=== Wikipedia Debug Information ===")
76+
false_positives = 0
77+
errors = 0
78+
for facility in facilities_data:
79+
query = facility.get('wikipedia_search_query', '')
80+
if 'REJECTED' in query:
81+
false_positives += 1
82+
elif 'ERROR' in query:
83+
errors += 1
84+
85+
print(f"False positives detected and rejected: {false_positives}")
86+
print(f"Search errors encountered: {errors}")
87+
print(f"Note: Review 'wikipedia_search_query' column for detailed search information")
88+
89+
if 'wikidata_search_query' in facilities_data[0]:
90+
print(f"Note: Review 'wikidata_search_query' column for detailed search information")
91+
92+
if 'osm_search_query' in facilities_data[0]:
93+
print(f"Note: Review 'osm_search_query' column for detailed search information")
94+
95+
print(f"\n=== ICE Detention Facilities Scraper: Run completed ===")

0 commit comments

Comments
 (0)