Skip to content

ENH: Allow custom aggregation functions with multiple return values. #59781

@noppelmax

Description

@noppelmax

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I want to aggregate multiple columns with bootstrapping (scipy.stats.bootstrap). However, this aggregation produces multiple scalar outputs, e.g., the lower and upper bound of the confidence interval, and the mean (= at least 3 columns). I want to apply this aggregation to multiple columns independently (hence 'apply is not really suitable). However, agg can only handle aggregation function with one scalar output. Bootstrapping is a 101 task in data analysis and I don't really understand why this is so complicated to implement in pandas. I know that theoretically I could write one aggregate function for the lower, upper bound and the mean separately. but that would 1. take three times longer and 2. would mathematically be questionable as the bound and the mean might come from different random processes. Fixing the randomness would be one way, but still three times longer is not good...

Feature Description

Here is some demo code

import pandas as pd
import numpy as np

def custom_aggregate(data):
    # return 1  # Works
    return pd.Series({
        'mean_ci_lower': [0.3], # hardcoded for demo
        'mean_ci_upper': [0.5], # hardcoded for demo
        'real_mean': np.mean(data),
    })

def main():
    data = {
    'acctrain': [0.496070, 0.579231, 0.1, 0.3],
    'acctest':  [0.455256, 0.147513, 0.1, 0.5],
    'experimentname': ['experimentA', 'experimentB', 'experimentA', 'experimentB']
    }

    df = pd.DataFrame(data)
    print(df)
    print("Aggregated: ")
    df2 = df.groupby(["experimentname"]).agg({
        "acctrain": custom_aggregate,
        "acctest": custom_aggregate,
    }).reset_index()
    print(df2)

if __name__ == '__main__':
    main()

I would expect to get multi-indexed data frame like

                acctrain                                               acctest                                                   
                bs_mean_ci_lower    bs_mean_ci_upper    real_mean      bs_mean_ci_lower    bs_mean_ci_upper    real_mean    
experimentname                                                                                                              
experimentA     0.3                 0.4                 0.298035       0.3                 0.4                 0.277628     
experimentB     0.3                 0.4                 0.439616       0.3                 0.4                 0.323757     

Alternative Solutions

The hacky workaround I use right now is something along the lines of

def custom_aggregate(data):
    return (1,2)

and later

df2[['acctrain_ci_lower', 'acctrain_ci_upper']] = pd.DataFrame(df2['acctrain'].tolist(), index=df2.index)
df2[['acctest_ci_lower', 'acctest_ci_upper']] = pd.DataFrame(df2['acctest'].tolist(), index=df2.index)

So I return a tuple and then extract the tuple later into multiple columns. This requires a lot hardcoding...

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions