-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
When estimating person correlation Pandas computes the sample mean of the involved variables. Although this is a reasonable approach it may lead to biases when one knows the true mean.
For instance, if I generate n
number of random series and compute their autocorrelations, I see a bias due to the mean estimation and the correlation between covariance and variance.
I would like Pandas to let me pass the true mean.
Feature Description
I am not sure how to insert this into the existing API. Maybe:
def corr(method='pearson', x_mean=None, y_mean=None):
if method == 'pearon':
if x_mean is None:
x_mean = compute_sample_mean_of_x
if y_mean is None:
y_mean = compute_sample_mean_of_y
Alternative Solutions
def pearson(x, y, x_mean=None, y_mean=None):
if x_mean is None:
x_mean = x.mean()
if y_mean is None:
y_mean = y.mean()
x_ = x - x_mean
y_ = y - y_mean
cov = (x_ * y_).mean()
den = ((x_ ** 2).mean() * (y_ ** 2).mean())
return cov / np.sqrt(den)
Additional Context
I would like to point out that this is not the only bias in my problem, but I don't understand why Pandas does not let me pass the population means of my variables when I know them, expecting me to have enough data for the sample mean estimation to be unbiased.