-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I want to aggregate multiple columns with bootstrapping (scipy.stats.bootstrap). However, this aggregation produces multiple scalar outputs, e.g., the lower and upper bound of the confidence interval, and the mean (= at least 3 columns). I want to apply this aggregation to multiple columns independently (hence 'apply is not really suitable). However, agg
can only handle aggregation function with one scalar output. Bootstrapping is a 101 task in data analysis and I don't really understand why this is so complicated to implement in pandas. I know that theoretically I could write one aggregate function for the lower, upper bound and the mean separately. but that would 1. take three times longer and 2. would mathematically be questionable as the bound and the mean might come from different random processes. Fixing the randomness would be one way, but still three times longer is not good...
Feature Description
Here is some demo code
import pandas as pd
import numpy as np
def custom_aggregate(data):
# return 1 # Works
return pd.Series({
'mean_ci_lower': [0.3], # hardcoded for demo
'mean_ci_upper': [0.5], # hardcoded for demo
'real_mean': np.mean(data),
})
def main():
data = {
'acctrain': [0.496070, 0.579231, 0.1, 0.3],
'acctest': [0.455256, 0.147513, 0.1, 0.5],
'experimentname': ['experimentA', 'experimentB', 'experimentA', 'experimentB']
}
df = pd.DataFrame(data)
print(df)
print("Aggregated: ")
df2 = df.groupby(["experimentname"]).agg({
"acctrain": custom_aggregate,
"acctest": custom_aggregate,
}).reset_index()
print(df2)
if __name__ == '__main__':
main()
I would expect to get multi-indexed data frame like
acctrain acctest
bs_mean_ci_lower bs_mean_ci_upper real_mean bs_mean_ci_lower bs_mean_ci_upper real_mean
experimentname
experimentA 0.3 0.4 0.298035 0.3 0.4 0.277628
experimentB 0.3 0.4 0.439616 0.3 0.4 0.323757
Alternative Solutions
The hacky workaround I use right now is something along the lines of
def custom_aggregate(data):
return (1,2)
and later
df2[['acctrain_ci_lower', 'acctrain_ci_upper']] = pd.DataFrame(df2['acctrain'].tolist(), index=df2.index)
df2[['acctest_ci_lower', 'acctest_ci_upper']] = pd.DataFrame(df2['acctest'].tolist(), index=df2.index)
So I return a tuple and then extract the tuple later into multiple columns. This requires a lot hardcoding...
Additional Context
No response