-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Overview
When linking LinkedDataFrames, it is possible to effectively overwrite columns without explicitly doing so. Combined with the fact that LDFs will naïvely try to to link with a linkage "column" specified as an on attribute, the resulting error can be non-obvious.
This caused a bug in the GGHM demand model, presumably as a change from previous LinkedDataFrame/pandas version, when two LDFs were being linked on each other.
Example
In cheval 0.2 with pandas 1.4:
df1 = LinkedDataFrame(pd.DataFrame({"df2": [1, 2, 3, 1, 2, 3]}))
df2 = LinkedDataFrame(pd.DataFrame({"col1": ["a", "b", "c"]}))
df1.link_to(df2, "df2", on_self="df2") # The original df1["df2"] column is inaccessible
df2.link_to(df1, "df1", on_other="df2") # AttributeError: to_numpyHere, the original column df1["df2"] which provided the index to join on df2 is inaccessible from:
- Item lookup
df1["df2"] - Attribute access
df1.df2 - Future linkages
df2.link_to(df1, "df1", on_other="df2")
The last item is the cause of the specific issue in the GGHM model - it produced an error because df2 was trying to use the linkage df1.df2 as an index.
Explanation
In normal pandas usage, it is impossible to "accidentally" mutate/overwrite a column, whereas in LinkedDataFrames, "columns" are created implicitly by link_to. LinkedDataFrames will handle linkages as columns everywhere, including in link_to calls, which results in an error which may be non-obvious in the source, and an non-specific error message raised from pandas.
Proposed Solutions
- Issue a warning when a linkage is created which supersedes an existing column (or have an explicit
overwritekwarg inlink_to) - Allow LDFs to be linked back on an linkage (
df2.link_to(df1, "df1", on_other="df2")), since this is realistically the only place where this issue would come up. This could either refer back to the original linkage column, or use the linkage itself to provide the index for the linkage. - Check if the LDF is trying to link using a linkage as an "on" instead of a normal pd.Series, and raise an explicit Exception if it does so.