Skip to content

ENH: add atol to pd.DataFrame.compare() #54677

@JonahBreslow

Description

@JonahBreslow

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When comparing pandas dataframes with floating point numbers, it can be extremely useful to compare with an absolute tolerance (atol) as we see in pandas.testing.assert_frame_equal.

Feature Description

I propose we add an argument to the function signature of pd.DataFrame.compare() as follows:

class DataFrame(NDFrame, OpsMixin):
    def __init__(...)
...
    def compare(self, ..., atol: float = None)
        # implement code to compare numeric comparison with tolerance

Alternative Solutions

This is some workaround code that works for my specific use case, but is most definitely not general

def deep_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float
) -> pd.DataFrame:
    """Compare two pandas dataframes at a deep level. This will
    return a dataframe with the differences between the two frames
    explicitly shown.

    Args:
        df1 (pd.DataFrame): The left dataframe
        df2 (pd.DataFrame): The right dataframe
        atol (float): Absolute tolerance

    Returns:
        pd.DataFrame: A dataframe with the differences between the two frames
    """
    diff_df = pd.DataFrame(index=df1.index, columns=df1.columns)
    for col in df1.columns:
        if check_cols_are_numeric(df1, df2, col):
            diff_df[col] = tolerance_compare(df1, df2, atol, col)
        else:
            diff_df[col] = exact_compare(df1, df2, col)

    diff_df = remove_rows_cols_all_na(diff_df)
    diff_colums = diff_df.columns
    right_df = df2[diff_colums]

    diff_df = diff_df.merge(
        right_df, left_index=True, right_index=True, suffixes=("_pg", "_snf")
    )

    return diff_df

def exact_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> np.ndarray:
    return np.where(df1[col] != df2[col], df1[col], np.nan)


def tolerance_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float, col: str
) -> np.ndarray:
    return np.where(np.abs(df1[col] - df2[col]) > atol, df1[col], np.nan)


def remove_rows_cols_all_na(diff_df: pd.DataFrame) -> pd.DataFrame:
    diff_df = diff_df.dropna(how="all")
    diff_df = diff_df.dropna(axis=1, how="all")
    return diff_df


def check_cols_are_numeric(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> bool:
    return pd.api.types.is_numeric_dtype(
        df1[col]
    ) and pd.api.types.is_numeric_dtype(df2[col])

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions