Skip to content

Estimating mean and variance / standard deviation #226

@npiguet

Description

@npiguet

I was wondering if there was a way to calculate the mean and standard deviation of the distribution summarized by a T-Digest?

Disclaimer: I don't really understand all the clever math behind T-Digest

Mean:
It seems that calculating the average of all centroids weighted by their respective count should get me close enough.

Variance:

I was thinking of:

  1. Calculate the mean
  2. For each centroid, calculate (centroid-mean)^2 * count, and then sum the result from all centroids
  3. Then the variance is equal to the value calculated at point 2, divided by sum(centroid.count)

This is obviously wrong, since it makes the assumption that all the sample represented by a centroid have the same value, but I can't really figure out a reasonable way to get more accurate values.

Do you know a better way to calculate the mean and variance? Would it be possible to add the corresponding methods to the TDigest class?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions