More stable algorithm for variance, standard deviation #456

jemmajeffree · 2025-07-18T06:23:37Z

Updated algorithm for nanvar, to use an adapted version of the Schubert and Gertz (2018) paper mentioned in #386, following discussion in #422

for more information, see https://pre-commit.ci

flox/aggregations.py

flox/aggregate_flox.py

flox/_version.py

flox/aggregations.py

flox/aggregate_flox.py

dcherian · 2025-07-18T15:25:28Z

flox/aggregate_flox.py

+
+    def __init__(self, arrays):
+        self.arrays = arrays  # something else needed here to be more careful about types (not sure what)
+        # Do we want to co-erce arrays into a tuple and make sure it's immutable? Do we want it to be immutable?


this is fine as-is

dcherian · 2025-07-18T15:28:07Z

flox/aggregate_flox.py

+        return MULTIARRAY_HANDLED_FUNCTIONS[func](*args, **kwargs)
+
+    # Shape is needed, seems likely that the other two might be
+    # Making some strong assumptions here that all the arrays are the same shape, and I don't really like this


yeah this data structure isn't useful in general, and is only working around some limitations in the design where we need to pass in multiple intermediates to the combine function. So there will be some ugliness. You have good instincts.

dcherian · 2025-07-18T15:32:59Z

flox/aggregate_flox.py

+
+    sum_squared_deviations = sum(
+        group_idx,
+        (array - array_means[..., group_idx]) ** 2,


👏 👏🏾

dcherian · 2025-07-18T15:50:05Z

tests/test_core.py

@@ -235,7 +235,7 @@ def gen_array_by(size, func):
 @pytest.mark.parametrize("size", [(1, 12), (12,), (12, 9)])
 @pytest.mark.parametrize("nby", [1, 2, 3])
 @pytest.mark.parametrize("add_nan_by", [True, False])
-@pytest.mark.parametrize("func", ALL_FUNCS)
+@pytest.mark.parametrize("func", ["nanvar"])


we will revert before merging, but this is the test we need to make work first. It runs a number of complex cases.

dcherian · 2025-07-18T15:50:28Z

flox/aggregations.py

@@ -343,12 +343,106 @@ def _mean_finalize(sum_, count):
 )


+def var_chunk(group_idx, array, *, engine: str, axis=-1, size=None, fill_value=None, dtype=None):


I moved this here, so that we can generalize to "all" engines. it has some ugliness (notice that it now takes the engine kwarg)

dcherian · 2025-07-18T15:50:55Z

flox/aggregations.py

+    array_sums = generic_aggregate(
+        group_idx,
+        array,
+        func="nansum",


This will need to be "sum" for "var".

My first thought is to pass through some kind of "are NaNs okay" boolean variable through to var_chunk and var_combine. Is this what xarray's skipna does? Or I think I've seen it done as a string "propogate" or "ignore"? And then to call the var_chunk and var_combine as a partial.

yes the way I do this in flox is create a var_chunk = partial(_var_chunk, skipna=False) and _nanvar_chunk=partial(_var_chunk, skipna=True) you can stick this in the Aggregation constructor I think

dcherian · 2025-07-18T15:52:25Z

flox/core.py

@@ -1251,7 +1252,8 @@ def chunk_reduce(
    # optimize that out.
    previous_reduction: T_Func = ""
    for reduction, fv, kw, dt in zip(funcs, fill_values, kwargss, dtypes):
-        if empty:
+        # UGLY! but this is because the `var` breaks our design assumptions
+        if empty and reduction is not var_chunk:


this code path is an "optimization" for chunks that don't contain any valid groups. so group_idx is all -1.
We will need to override full in MultiArray. Look up what the like kwarg does here, it dispatches to the appropriate array type.

The next issue will be that fill_value is a scalar like np.nan but that doesn't work for all our intermediates (e.g. the "count").

My first thought is that MultiArray will need to track a default fill_value per array. For var, this can be initialized to (None, None, 0). If None we use the fill_value passed in; else the default.

The other way would be to hardcode some behaviour in _initialize_aggregation so that agg.fill_value["intermediate"] = ( (fill_value, fill_value, 0), ), and then multi-array can receive that tuple and do the "right thing".

The other place this will matter is in reindex_numpy, which is executed at the combine step. I suspect the second tuple approach is the best.

This bit is hairy, and ill-defined. Let me know if you want me to work through it.

I'm partway through implementing something to work here.

How do I trigger this code pathway without brute force overwriting if empty: with if True:

When np.full is called, like is a np array not a MultiArray, because it's (I think) the chunk data and bypassing var_chunk (could also be an artefact of the if True override above?). In a pinch, I guess I could add an elif that catches the empty and reduction is var_chunk and co-erce that into a MultiArray, but it's also ugly so I'm hoping you might have better ideas

Thinking some more, I may have misinterpreted what fill_value is used for. When is it needed for intermediates?

dcherian · 2025-07-18T16:02:06Z

This is great progress! Now we reach some much harder parts. I pushed a commit to show where I think the "chunk" function should go and left a few comments. I think the next steps should be to

address those comments;
add a new test to test_core.py with your reproducer (though modified to work with pure numpy arrays);
implement np.full for MultiArray.
dig a bit more in to the "fill value" bits. You'll see that test_groupby_reduce_all fails in a couple of places to do with fillvalue and setitem. This will take some work to fix, but basically it has to do with adding a "fill value" for groups that have no value up to this point.
There's another confusing failure where the MultiArray only has 2 arrays instead of 3. I don't understand how that happens.

Co-authored-by: Deepak Cherian <[email protected]>

for more information, see https://pre-commit.ci

…_algorithm

for more information, see https://pre-commit.ci

jemmajeffree · 2025-07-28T07:22:44Z

Do you think it likely that MultiArray would ever be used for anything else? I'm tempted to rename it VarChunkArray or somesuch, so the line between "expected behaviour" and "something's not right here" can be more clearly defined. Not sure it really changes the code's behaviour right now, but it would allow some more checks in and more defensive code.

By "add a new test to test_core.py with your reproducer (though modified to work with pure numpy arrays)", do you mean add the failing code from my original issue to the end of test_core.py, along the lines of:

@requires_dask
@pytest.mark.parametrize("func", ("nanvar",)) # Expect to expand this to other functions once written
@pytest.mark.parametrize("engine",("flox",)) # Expect to expand this to other engines once written
# May also want labels parametrized in here?
def test_std_var_precision(func,engine, etc):
    # Generate a dataset with small variance and big mean
    # Check that func with engine gives you the same answer as numpy

with internals mostly modelled on a trimmed down version of test_groupby_reduce_all?

dcherian · 2025-07-28T15:43:01Z

Do you think it likely that MultiArray would ever be used for anything else?

Possibly, but only within flox.

so the line between "expected behaviour" and "something's not right here" can be more clearly defined.

We can liberally make use of comments, assert statements, and NotImplementedError exceptions to document choices you have made for var specifically. We do not have to build the most general version now

By "add a new test to test_core.py with your reproducer (though modified to work with pure numpy arrays)", do you mean add the failing code from my original issue to the end of test_core.py, along the lines of:

Yes, a simple-ish one would be fine. groupby_reduce_all is quite the monster. More recently, I am using property tests to boost coverage; for which var and std are now skipped:

flox/tests/strategies.py

Line 111 in 8cfd999

SKIPPED_FUNCS = ["var", "std", "nanvar", "nanstd"]

Hopefully your changes will let us delete that line

…_algorithm

for more information, see https://pre-commit.ci

tests/test_core.py

Co-authored-by: Deepak Cherian <[email protected]>

for more information, see https://pre-commit.ci

dcherian · 2025-08-05T13:21:35Z

I pushed a commit. Your changes are looking good! I constructed an expected result from numpy and it matches! I'm not sure what the expectation should be for no_offset vs with_offset. Does the original paper make some claims for this kind of comparison?

Lastly, I noticed that you basically have a "property" test here (which is quite cool) - this is a "metamorphic relation" grouped_nanvar(array) == grouped_nanvar(array + arbitrary_offset) though at the moment arbitrary_offset is within some bounded range. I think you'll find it fun to write that as a Hypothesis

To get the existing test suite to start passing, you'll have to add support for the ddof kwarg as in numpy

jemmajeffree · 2025-08-05T21:59:31Z

I pushed a commit. Your changes are looking good! I constructed an expected result from numpy and it matches! I'm not sure what the expectation should be for no_offset vs with_offset. Does the original paper make some claims for this kind of comparison?

I can't much anything looking at this method specifically, just one entry in a table and I'm not sure how it'd generalise. It seems like the variant of the derivation I used is more or less neglected in the speed/precision evaluation later in the paper, though the naming of various algorithms is a little tricky to follow and I might have missed it. I think as a ballpark estimate from the one line in a table we'd expect to gain another 4-6 decimal points on the $E[x^2]-E[x]^2$ method but not quite catch up to the $E[(x-E[x])^2]$ method that involves touching data twice (which I think numpy uses?).

Lastly, I noticed that you basically have a "property" test here (which is quite cool) - this is a "metamorphic relation" grouped_nanvar(array) == grouped_nanvar(array + arbitrary_offset) though at the moment arbitrary_offset is within some bounded range. I think you'll find it fun to write that as a Hypothesis

I'd expect it to fail for sufficiently big offsets, just to do better than the old algorithm. Does this present a problem?

To get the existing test suite to start passing, you'll have to add support for the ddof kwarg as in numpy

Oops, I can fix that. I think the finalize function expects a ddof kwarg but I probably didn't pass it through.

for more information, see https://pre-commit.ci

dcherian · 2025-08-06T00:54:23Z

I'd expect it to fail for sufficiently big offsets, just to do better than the old algorithm. Does this present a problem?

Absolutely not! This is a massive improvement for common cases.

For now we can skip the failing comparison for large offsets. As long as it's close to numpy I'm happy. One thing to do would be to find the minimum tolerance for which we match numpy across that range of offsets.

jemmajeffree and others added 2 commits July 18, 2025 16:09

update to nanvar to use more stable algorithm if engine is flox

0f29529

[pre-commit.ci] auto fixes from pre-commit.com hooks

1fbf5f8

for more information, see https://pre-commit.ci

jemmajeffree mentioned this pull request Jul 18, 2025

Suggested change to std/var preprocessing to improve precision #422

Open