Skip to content

LinkedDataFrame: Handle overwriting of columns with linkages / linking on linkages more explicitly #5

@aclarry

Description

@aclarry

Overview

When linking LinkedDataFrames, it is possible to effectively overwrite columns without explicitly doing so. Combined with the fact that LDFs will naïvely try to to link with a linkage "column" specified as an on attribute, the resulting error can be non-obvious.

This caused a bug in the GGHM demand model, presumably as a change from previous LinkedDataFrame/pandas version, when two LDFs were being linked on each other.

Example

In cheval 0.2 with pandas 1.4:

df1 = LinkedDataFrame(pd.DataFrame({"df2": [1, 2, 3, 1, 2, 3]}))
df2 = LinkedDataFrame(pd.DataFrame({"col1": ["a", "b", "c"]}))

df1.link_to(df2, "df2", on_self="df2") # The original df1["df2"] column is inaccessible
df2.link_to(df1, "df1", on_other="df2") # AttributeError: to_numpy

Here, the original column df1["df2"] which provided the index to join on df2 is inaccessible from:

  • Item lookup df1["df2"]
  • Attribute access df1.df2
  • Future linkages df2.link_to(df1, "df1", on_other="df2")

The last item is the cause of the specific issue in the GGHM model - it produced an error because df2 was trying to use the linkage df1.df2 as an index.

Explanation

In normal pandas usage, it is impossible to "accidentally" mutate/overwrite a column, whereas in LinkedDataFrames, "columns" are created implicitly by link_to. LinkedDataFrames will handle linkages as columns everywhere, including in link_to calls, which results in an error which may be non-obvious in the source, and an non-specific error message raised from pandas.

Proposed Solutions

  • Issue a warning when a linkage is created which supersedes an existing column (or have an explicit overwrite kwarg in link_to)
  • Allow LDFs to be linked back on an linkage (df2.link_to(df1, "df1", on_other="df2")), since this is realistically the only place where this issue would come up. This could either refer back to the original linkage column, or use the linkage itself to provide the index for the linkage.
  • Check if the LDF is trying to link using a linkage as an "on" instead of a normal pd.Series, and raise an explicit Exception if it does so.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions