Skip to content

Explicitly handling missingness in join columns #2499

@adkabo

Description

@adkabo

Julia has a great philosophy of taking missingness seriously. For example, unlike in Pandas or Postgres, sum([1,2,missing]) gives missing. However, this philosophy hasn't yet been applied to all of the functions in the JuliaData ecosystem. I'll give an example to illustrate.

My goal is to find the relationship between age and salary. To find out, I will combine observations from two datasets.

One with age,

DataFrame([(name="John", employer=missing, age=40)])

and one with salary.

DataFrame([(name="John", employer="Julia Computing", salary=9999)])

In a complete-data environment, if these observations correspond to the same person, I want one row in the joined dataframe; on the other hand, if they correspond to different people, I want zero rows in the join.

Given the missingness in the "employer" column, we don't know if that's the same person or not. So when I join them on ("name", "employer"), we cannot know the right answer. Yet

innerjoin(
    DataFrame([(name="John", employer=missing, age=40)]),
    DataFrame([(name="John", employer="Julia Computing", salary=9999)]), 
    on=[:name, :employer]
)

makes a decision implicitly, returning an empty dataframe. If the observations correspond to the same person, this result -- failing to match the two observations -- is a false negative.

To avoid drawing mistaken conclusions from analysis, I would like to extend the practice of enforcing explicit handling of missing values. So I would like to get an error message in this case by default, and to actively tell innerjoin() how it should handle missing values.

The passmissing() and skipmissing() patterns used elsewhere in JuliaData are a great reassurance that Julia is looking out for missing data problems. When applied to joins, I would like to consider:

  • In an outer join, should rows with missingness all be dropped from the output or all be kept in?
  • In an inner join, should they be considered matches or nonmatches?
  • What if there is missingness on both sides or one side only?

I'm not sure if something like passmissing(innerjoin)(a,b) would work, or if it should be more like innerjoin(a,b, missingrule=:drop) or something else. But I do want to start the conversation about it.


#2243 has some discussion of missingness and joins, mainly focused on the a.fillna(b) use case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    breakingThe proposed change is breaking.feature

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions