Skip to content

ENH: pearson correlation with population mean #55145

@Xylambda

Description

@Xylambda

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When estimating person correlation Pandas computes the sample mean of the involved variables. Although this is a reasonable approach it may lead to biases when one knows the true mean.

For instance, if I generate n number of random series and compute their autocorrelations, I see a bias due to the mean estimation and the correlation between covariance and variance.

I would like Pandas to let me pass the true mean.

Feature Description

I am not sure how to insert this into the existing API. Maybe:

def corr(method='pearson', x_mean=None, y_mean=None):
    if method == 'pearon':
        if x_mean is None:
            x_mean = compute_sample_mean_of_x
        if y_mean is None:
            y_mean = compute_sample_mean_of_y

Alternative Solutions

def pearson(x, y, x_mean=None, y_mean=None):
    
    if x_mean is None:
        x_mean = x.mean()
    
    if y_mean is None:
        y_mean = y.mean()

    x_ = x - x_mean
    y_ = y - y_mean
    
    cov = (x_ * y_).mean()
    den = ((x_ ** 2).mean() * (y_ ** 2).mean())
    return cov / np.sqrt(den)

Additional Context

I would like to point out that this is not the only bias in my problem, but I don't understand why Pandas does not let me pass the population means of my variables when I know them, expecting me to have enough data for the sample mean estimation to be unbiased.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions