-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I needed to look at the core data for Chicago and Austin, as well as the extended data for Austin, and discovered an odd issue -- where the same code seemed to produce different datasets on different days. After further investigation, I think the program gets confused sometimes when you download multiple datasets from the same city in a session.
In this particular example, the problem doesn't seem to happen when I have all three datasets cache. If I recall correctly, I started using cache = FALSE to begin with to try to fix this problem, so I'm not sure what role that plays.
Anyhow, if you run this code in a fresh R session, you should see that while AustinExtended1 and AustinExtended2 are produced by the exact same code, AustinExtended2 has more observations.
library(crimedata)
library(tidyverse)
AustinExtended1 <- get_crime_data(cities = c("Austin"), years = 2010:2022, type = "extended", cache = FALSE)
ChiAustCore <- get_crime_data(cities = c("Chicago", "Austin"), years = 2010:2022, type = "core", cache = FALSE )
AustinExtended2 <- get_crime_data(cities = c("Austin"), years = 2010:2022, type = "extended", cache = FALSE)
Specifically, AustinExtended2 appears identical to the larger ChiAustCore data.
all.equal(ChiAustCore, AustinExtended2)
This might be okay if you could get the right data by limiting to Austin, but in fact the number of Austin observations is larger in AustinExtended2 (and thus also in ChiAustCore). The problems seem to start with ChiAustCore, the second download, which somehow has more Austin crimes than the "extended" data we started with did.
AustinExtended1 |> filter(city_name == "Austin") |> count()
AustinExtended2 |> filter(city_name == "Austin") |> count()
As best I can tell, what's happening is that both ChiAustCore and AustinExtended2 have "phantom" observations in some years where every real observation is paired with a duplicate. ChiAustCore also seems to have addresses, which I don't believe are usually in the core datasets.
You can see here that in some years but not others, the number of observations in AustinExtended2 (and thus also the identical ChiAustCore) is exactly double what we originally got in AustinExtended1.
AustinExtended1$year <- substr(AustinExtended1$date_single, start = 1, stop = 4)
AustinExtended2$year <- substr(AustinExtended2$date_single, start = 1, stop = 4)
AustinExtended1 |> filter(city_name == "Austin") |> group_by(year) |> count()
AustinExtended2 |> filter(city_name == "Austin") |> group_by(year) |> count()
In the duplicate observations, it also appears to be replacing the variables starting with date_year with NAs. One example:
AustinExtended1 |> filter(uid == 476781) |> View()
AustinExtended2 |> filter(uid == 476781) |> View()