Performance optimizations of various statistics

**Is your feature request related to a problem? Please describe.**

Not really a problem, more like a potential optimization (I haven't worked out the details to see if it actually works).

So, if I understood how the algorithm works under the hood, we basically move the points in the dataset, one at a time, then, at each iteration, compute the mean, the standard deviation, and the correlation coefficient of this new dataset.
One thing that stands out performance-wise is that we currently use _all_ of the points to compute the statistics at each step, which seems a bit wasteful.

**Describe the solution you'd like**

Instead of computing the statistics of the whole dataset, which requires at least iterating over all $n$ points (even more ops for the stdev/corrcoef), we can use the fact that we are only moving one point, and rewrite the new statistics in terms of old ones + a perturbation. For instance, for the new value of the mean statistic, we get:

$$
\langle X' \rangle = \langle X \rangle + \frac{\delta}{n}
$$

where $\delta = x'_i - x_i$, and $n$ is the number of points in the dataset. Analogous formulas can be derived for the variance, which is the square of the stdev anyway (it's possible some tweaking of the denominators is needed when taking into account the Bessel correction):

$$
\text{Var}(X') = \text{Var}(X) + 2 \frac{\delta}{n}(x_i - \langle X \rangle) + \frac{\delta^2}{n} - \frac{\delta^2}{n^2}
$$

and probably for the correlation coefficient (or better, its square) as well. This would allow us to compute all of the statistics in basically $O(1)$ time, instead of $O(n)$ or larger.

There's at least one problem which I haven't worked out yet: is this numerically stable? Since numerical accuracy is paramount for the code to work properly, if the above has a large loss of accuracy, then it's not very useful, but if it's stable, it could be worthwhile to explore implementing it.

Some references that could be of use (regarding both the computation and numerical stability):

- https://changyaochen.github.io/welford/
- https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
- https://stackoverflow.com/questions/5147378/rolling-variance-algorithm

**Describe alternatives you've considered**

None.

**Additional context**

None.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance optimizations of various statistics #201

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Performance optimizations of various statistics #201

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions