-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Open
Labels
EnhancementNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further action
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
When comparing pandas dataframes with floating point numbers, it can be extremely useful to compare with an absolute tolerance (atol
) as we see in pandas.testing.assert_frame_equal.
Feature Description
I propose we add an argument to the function signature of pd.DataFrame.compare() as follows:
class DataFrame(NDFrame, OpsMixin):
def __init__(...)
...
def compare(self, ..., atol: float = None)
# implement code to compare numeric comparison with tolerance
Alternative Solutions
This is some workaround code that works for my specific use case, but is most definitely not general
def deep_compare(
df1: pd.DataFrame, df2: pd.DataFrame, atol: float
) -> pd.DataFrame:
"""Compare two pandas dataframes at a deep level. This will
return a dataframe with the differences between the two frames
explicitly shown.
Args:
df1 (pd.DataFrame): The left dataframe
df2 (pd.DataFrame): The right dataframe
atol (float): Absolute tolerance
Returns:
pd.DataFrame: A dataframe with the differences between the two frames
"""
diff_df = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in df1.columns:
if check_cols_are_numeric(df1, df2, col):
diff_df[col] = tolerance_compare(df1, df2, atol, col)
else:
diff_df[col] = exact_compare(df1, df2, col)
diff_df = remove_rows_cols_all_na(diff_df)
diff_colums = diff_df.columns
right_df = df2[diff_colums]
diff_df = diff_df.merge(
right_df, left_index=True, right_index=True, suffixes=("_pg", "_snf")
)
return diff_df
def exact_compare(
df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> np.ndarray:
return np.where(df1[col] != df2[col], df1[col], np.nan)
def tolerance_compare(
df1: pd.DataFrame, df2: pd.DataFrame, atol: float, col: str
) -> np.ndarray:
return np.where(np.abs(df1[col] - df2[col]) > atol, df1[col], np.nan)
def remove_rows_cols_all_na(diff_df: pd.DataFrame) -> pd.DataFrame:
diff_df = diff_df.dropna(how="all")
diff_df = diff_df.dropna(axis=1, how="all")
return diff_df
def check_cols_are_numeric(
df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> bool:
return pd.api.types.is_numeric_dtype(
df1[col]
) and pd.api.types.is_numeric_dtype(df2[col])
Additional Context
No response
Metadata
Metadata
Assignees
Labels
EnhancementNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further action