Skip to content

Conversation

@johnseekins
Copy link
Contributor

vera.org has a list of >1300 facilities we may be able to leverage to expand our dataset:
https://github.com/vera-institute/ice-detention-trends/blob/main/metadata/facilities.csv

The qualifier being that their license is somewhat restrictive: https://github.com/vera-institute/ice-detention-trends/blob/main/License.md

@johnseekins johnseekins marked this pull request as ready for review September 24, 2025 21:15
@johnseekins
Copy link
Contributor Author

johnseekins commented Sep 24, 2025

This isn't done, but I thought it'd be worth exploring more.

Currently matching 107 facilities, so getting that up to 191/192 would be nice.

@HongPong
Copy link
Contributor

I added the Susupe, Saipan place to openstreetmap, it was not marked at all. let's see if the alt_name attribute can help carry this forward. Nominatim does return the way now on query "Vicente T Seman Bldg Civic Center"

I made alt_name using semicolon separators here - that seems to be the right approach https://wiki.openstreetmap.org/wiki/Key:alt_name

johnseekins and others added 9 commits September 24, 2025 17:30
Signed-off-by: John Seekins <[email protected]>
Signed-off-by: John Seekins <[email protected]>
Signed-off-by: John Seekins <[email protected]>
Signed-off-by: John Seekins <[email protected]>
Signed-off-by: John Seekins <[email protected]>
Signed-off-by: John Seekins <[email protected]>
@johnseekins
Copy link
Contributor Author

This may help in addressing #47 , although it won't completely solve it.

@HongPong
Copy link
Contributor

HongPong commented Sep 29, 2025

can we add a skip option for the vera data? probably a CLI toggle like --skip-vera to avoid, i think running this command.

     facilities_data = collect_vera_facility_data(facilities_data, keep_sheet, force_download)

it is taking quite a long time to go thru 1419 facilities in the broader american prison industrial complex.

also maybe add a switch like --vera-ice-facilities-only which would skip over all the non ICE managed ones (that is, limit the processing only to the ones that we already have grabbed via the other methods.

I am running the test now. it will take some time to see how the results come out.

also using ctrl-C i was not able to get a clean break from the processing in the midst of it (i think around 600 processed facilities). which is the first time i have had this issue so far.

@johnseekins
Copy link
Contributor Author

You're talking about skipping Vera data during enrichment, aren't you? I had only tested it during the initial scrape, which was nice and fast.

Maybe we don't want to add this just yet? Not hard to leave a PR as a thought for later...

The ctrl-c not working actually makes sense. I used multiprocessing to speed up enrichment, which actually forks subprocesses, so you have to actually cancel all the subprocesses to fully exit.

It's a pain, and we could probably look at using threads instead (although historically that's been messy in python), but it seemed a pretty reasonable trade-off for the speed up.

can we add a skip option for the vera data? probably a CLI toggle like --skip-vera to avoid, i think running this command.

     facilities_data = collect_vera_facility_data(facilities_data, keep_sheet, force_download)

it is taking quite a long time to go thru 1419 facilities in the broader american prison industrial complex.

also maybe add a switch like --vera-ice-facilities-only which would skip over all the non ICE managed ones (that is, limit the processing only to the ones that we already have grabbed via the other methods.

I am running the test now. it will take some time to see how the results come out.

also using ctrl-C i was not able to get a clean break from the processing in the midst of it (i think around 600 processed facilities). which is the first time i have had this issue so far.

Signed-off-by: John Seekins <[email protected]>
@johnseekins
Copy link
Contributor Author

@HongPong With the changes I've made, enrichment performance is decent again:

Data enrichment completed!
 Completed in 637.2328317165375 seconds

Not amazing (that's about 10 minutes?), but significantly better than before when enriching all Vera data.

Signed-off-by: John Seekins <[email protected]>
@johnseekins
Copy link
Contributor Author

Using the (just added) --skip-vera switch gives about the same performance:

Data enrichment completed!
 Completed in 678.2765390872955 seconds

@johnseekins
Copy link
Contributor Author

Interestingly, there's diminishing returns on the worker count.

7 workers (up from the default of 3), only nets an extra minute or so:

Data enrichment completed!
 Completed in 600.4133110046387 seconds

@HongPong HongPong merged commit 5b81f51 into Open-Security-Mapping-Project:main Oct 26, 2025
@HongPong
Copy link
Contributor

HongPong commented Oct 26, 2025

alright i got it in thank you!!! sorry about the delay on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants