Skip to content

"Phantom" Observations When Downloading Multiple Datasets from Same City #19

@robertv4311

Description

@robertv4311

I needed to look at the core data for Chicago and Austin, as well as the extended data for Austin, and discovered an odd issue -- where the same code seemed to produce different datasets on different days. After further investigation, I think the program gets confused sometimes when you download multiple datasets from the same city in a session.

In this particular example, the problem doesn't seem to happen when I have all three datasets cache. If I recall correctly, I started using cache = FALSE to begin with to try to fix this problem, so I'm not sure what role that plays.

Anyhow, if you run this code in a fresh R session, you should see that while AustinExtended1 and AustinExtended2 are produced by the exact same code, AustinExtended2 has more observations.

library(crimedata)
library(tidyverse)
AustinExtended1 <- get_crime_data(cities = c("Austin"), years = 2010:2022, type = "extended", cache = FALSE)
ChiAustCore <- get_crime_data(cities = c("Chicago", "Austin"), years = 2010:2022, type = "core", cache = FALSE  )
AustinExtended2 <- get_crime_data(cities = c("Austin"), years = 2010:2022, type = "extended", cache = FALSE)

Specifically, AustinExtended2 appears identical to the larger ChiAustCore data.

all.equal(ChiAustCore, AustinExtended2)

This might be okay if you could get the right data by limiting to Austin, but in fact the number of Austin observations is larger in AustinExtended2 (and thus also in ChiAustCore). The problems seem to start with ChiAustCore, the second download, which somehow has more Austin crimes than the "extended" data we started with did.

AustinExtended1 |> filter(city_name == "Austin") |> count()
AustinExtended2 |> filter(city_name == "Austin")  |> count()

As best I can tell, what's happening is that both ChiAustCore and AustinExtended2 have "phantom" observations in some years where every real observation is paired with a duplicate. ChiAustCore also seems to have addresses, which I don't believe are usually in the core datasets.

You can see here that in some years but not others, the number of observations in AustinExtended2 (and thus also the identical ChiAustCore) is exactly double what we originally got in AustinExtended1.

AustinExtended1$year <- substr(AustinExtended1$date_single, start = 1, stop = 4)
AustinExtended2$year <- substr(AustinExtended2$date_single, start = 1, stop = 4)
AustinExtended1 |> filter(city_name == "Austin") |> group_by(year) |> count()
AustinExtended2 |> filter(city_name == "Austin") |> group_by(year) |> count()

In the duplicate observations, it also appears to be replacing the variables starting with date_year with NAs. One example:

AustinExtended1 |> filter(uid == 476781) |> View()
AustinExtended2 |> filter(uid == 476781) |> View()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions