Statistics.countnans: Fix sparse implementation and add axis support by pavlin-policar · Pull Request #2558 · biolab/orange3

pavlin-policar · 2017-09-04T09:04:38Z

Issue

The statistics module implementation of countnans did not support the axis argument. Moreover, it appeared to have been computing number of NaNs incorrectly on sparse data in general.

Description of changes

Add implementation of countnans which correctly counts NaNs and supports the axis keyword.

Includes

Code changes
Tests
Documentation

pavlin-policar · 2017-09-04T11:13:55Z

@nikicc It appears that the existing code was not counting the number of NaNs, but rather the number of zero elements.

>>> x = np.array([1, 1, 0, 0, 2, np.nan, 0, 0]).reshape(2, -1)
array([[  1.,   1.,   0.,   0.],
       [  2.,  nan,   0.,   0.]])
>>> _count_nans_per_row_sparse(csr_matrix(x), None)
4.0

Apparently, this functionality is used in Table._compute_distributions (hence my failing test). I haven't looked any further than that, but it appears that either the distributions in the table must be being computed incorrectly, or the function name is wrong.

nikicc

Kudos for finding and correcting the bug 👍 IMO countnans should count the number of nans and not zeros like it currently does.

About the failing tests, I quickly looked at the first one and the test is incorrect as was the previous countnans implementation. Could you also correct the tests for _compute_distributions according to the new implementation?

nikicc · 2017-09-08T08:03:38Z

Orange/tests/test_statistics.py

+            countnans(csr_matrix(x), axis=1), [1, 1],
+            'Countnans fails on sparse data with `axis=1`')
+
+    def test_countnans_with_2d_weights(self):


Could you also add a test for weights on sparse matrices?

This didn't work before and I wasn't sure whether or not it was even worth the hassle, since I assume we only use sparse matrices when the data is very big, and forming a dense 2d weight matrix of that size kind of defeats the purpose...

However, I've implemented it now. In any case sparse weight matrices weren't supported before and aren't supported now. I don't know if it's a requirement, but it would require more work.

You're right, we shouldn't worry about weights on sparse matrices.

nikicc · 2017-09-08T08:06:47Z

Orange/statistics/util.py

            isnan = isnan * weights
+
+        # In order to keep return types consistent with sparse vectors, we will
+        # handle `axis=1` given a regular 1d numpy array equivallently as


I suggest we drop this compatibility for 1D arrays. The only case I can think of where this would be handy is in cases like:

for row in data.X: countnans(row, axis=1)

But we probably shouldn't be passing 1D vectors with axis=1 anyway, but should calculate nans on whole matrix.

I would drop this compatibility for 1D and we will worry about this if this ever becomes a problem.

Thinking a bit further, if we really want 100% compatibility, we could make it to also raise an error for countnans with sparse 1D arrays (e.g. [[1 2 3]]) and axis=1. We treat them as 1D array anyhow so I don't see a benefit of supporting both axis=0 and axis=1 if both return the same value.

I agree, this does seem to be the most sensible solution. Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.

Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.

But on the other hand, changing user-provided input of axis=1 to axis=0 just because axis=1 would result in an error is IMO worse than some slight incompatibility between sparse and dense. If I could come up with at least one use-case where this incompatibility would really hurt us I might be less against it.

IMO if you want to stick to 100% same behavior, I would prefer if we also raise an error when a sparse row is provided and user passes axis=1. Hence we could essentially treat a sparse matrix of shape (1, X) as 1D array and get exactly the same behaviour as with 1D dense matrices.

Yes, I agree, changing the user input was a bad idea. The error is much better.

nikicc · 2017-09-08T09:54:27Z

Orange/tests/test_statistics.py

+    def test_shape_matches_dense_and_sparse_given_array_and_axis_1(self):
+        dense = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1])
+        sparse = csr_matrix(dense)
+        np.testing.assert_equal(countnans(dense, axis=1), countnans(sparse, axis=1))


I don't like this. For example, numpy crashes if you want to sum with axis=1 and only have 1D array.

>>> y = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1]) >>> y.sum(axis=0) nan >>> y.sum(axis=1) Traceback (most recent call last): File "<input>", line 1, in <module> File "/Users/Niko/anaconda/envs/orange3/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _sum return umr_sum(a, axis, dtype, out, keepdims) ValueError: 'axis' entry is out of bounds

Considering the idea from above comments, I would probably just make both of these calls to raise an error.

nikicc · 2017-09-08T10:05:24Z

Orange/statistics/util.py


    Works kind of like np.bincount(), except that it also supports floating
    arrays with nans.
+


Would you be willing to also document the max_val argument? This doesn't seem to be a numpy argument, and it's not obvious what it is for.

I've added thorough documentation.

nikicc · 2017-09-08T10:18:54Z

Orange/statistics/util.py

+        # Since `csr_matrix.values` only contain non-zero values, we must count
+        # those separately and set the appropriate bin
+        if sp.issparse(X_):
+            bc[0] = np.prod(X_.shape) - X_.nnz


Should we maybe make np.prod(X_.shape) - X_.nnz as a separate function, something like sparse_count_zeros? We already have _sparse_has_zeros and we will need to add the number of zero elements to the Table widget sooner or later.

I think this is a good idea.

Perhaps it would also make sense to implement sparse_num_elements which would essentially be np.prod(x.shape) since it's also used a lot. This would make the functionality much clearer, but I'm hesitant because it seems far too excessive.

nikicc · 2017-09-08T10:20:11Z

Orange/tests/test_statistics.py

+        np.testing.assert_equal(bincount(dense), expected)
+        np.testing.assert_equal(bincount(sparse), expected)
+
+        hist, n_nans = bincount([0., 1., 3], max_val=3)


What exactly does max_val=3 does here?

This was a remnant of the old tests. I've removed it and added an appropriate test.

nikicc · 2017-09-08T10:26:01Z

Orange/tests/test_statistics.py

+        np.testing.assert_equal(countnans(x, weights=w, axis=1), [1, 2])
+
+
+class TestBincount(unittest.TestCase):


Perhaps add one test with weights?

pavlin-policar · 2017-09-08T19:46:27Z

At first I was puzzled why zeros were counted as NaNs in sparse matrices. This has no apparent advantages at all and can be confusing when first dealing with it...

This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything). This doesn't make much sense to me, because if we separate the two, we can easily determine which data is missing and which is actually zero. Pandas does differentiate between the explicit and implicit NaNs, but then again, pandas is far more flexible than plain sparse matrices.

…orator

…ous distributions

codecov-io · 2017-09-09T16:33:04Z

Codecov Report

Merging #2558 into master will decrease coverage by 0.8%.
The diff coverage is 95.18%.

@@            Coverage Diff             @@
##           master    #2558      +/-   ##
==========================================
- Coverage   75.83%   75.03%   -0.81%     
==========================================
  Files         338      327      -11     
  Lines       59532    57645    -1887     
==========================================
- Hits        45145    43252    -1893     
- Misses      14387    14393       +6

The previous implementation didn't return zero counts for sparse matrices and treated them as NaNs. This was changed and the tests have been updated.

…sparse

nikicc · 2017-09-12T21:21:04Z

This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything).

Not sure where this came from, but it was incorrect. The current assumption is that values not stored in the sparse matrix correspond to zeros and not missing values! Missing values are explicitly stored in sparse matrices as np.nan. If there still exist functions that don't obey this rule, they should be corrected.

pavlin-policar · 2017-10-14T15:30:45Z

@nikicc Is there anything left to do with PR or do you think it could be merged?

nikicc

@pavlin-policar sorry that this took so long 🙈

Overall it looks OK, there are only a few things left:

fix bincount to consider explicit zeros,
make sure that x is of correct type whenever you call x.indices (just cast it to csr beforehand)

Also, when discovering the problem of explicit zeros I was wondering if we could hack the dense_sparse decorator to also inset at least one explicit zero? Doing so, we should have caught the bincount problem.

nikicc · 2017-10-18T13:18:55Z

Orange/tests/test_distribution.py

+             [0, 0, 0, 1,      0,      2, np.nan, 0, 0, 0, 0, 0, 0,   0, 0, 0, 0,      0, 0, 0],
+             [0, 0, 0, 0,      0,      0,      0, 0, 0, 0, 0, 0, 0,   0, 0, 0, 0,      0, 0, 0],
+             [0, 0, 2, 0,      0,      0,      1, 0, 0, 0, 0, 0, 0, 1.1, 0, 0, 0,      0, 0, 0]]
+        )


This test doesn't quite work for explicit zeros. E.g. if you add X[0,0] = 0 here, the test should pass and the results should be the same — but it fails. IMO the problem is in the bincount, check the comment above.

Also, once this is corrected, please add some explicit zeros to this example.

nikicc · 2017-10-18T13:33:40Z

Orange/statistics/util.py

+
+        if weights is not None:
+            zero_weights = weights[zero_indices].sum()
+            weights = weights[x.indices]


Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.

nikicc · 2017-10-18T13:35:01Z

Orange/statistics/util.py

+
+    if weights.ndim == 1:
+        n_items = np.prod(x.shape)
+        zero_indices = np.setdiff1d(np.arange(n_items), x.indices, assume_unique=True)


Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.

nikicc · 2017-10-18T13:43:51Z

Orange/statistics/util.py

+        # Since `csr_matrix.values` only contain non-zero values, we must count
+        # those separately and set the appropriate bin
+        if sp.issparse(x_original):
+            bc[0] = zero_weights


This should probably be bc[0] = bc[0] + zero_weights to account for explicit zeros stored in x.data.

nikicc · 2017-10-18T13:44:42Z

Orange/statistics/util.py

+    return np.fromiter((np.isnan(row.data).sum() for row in X), dtype=dtype)
+
+
+def sparse_count_zeros(x):


Should we perhaps rename this to sparse_count_implicit_zeros to make sure explicit zeros aren't counted?

nikicc · 2017-10-18T13:45:05Z

Orange/statistics/util.py

+    return np.prod(x.shape) - x.nnz
+
+
+def sparse_has_zeros(x):


Should we perhaps rename this to sparse_has_implicit_zeros to make sure explicit zeros aren't considered?

nikicc · 2017-10-18T14:06:27Z

Orange/statistics/util.py



-def bincount(X, max_val=None, weights=None, minlength=None):
+def sparse_zero_weights(x, weights):


Is this only meant to be used when both x and weights are of the same shape, i.e. one dimensional? If so, should we add the check for this?

Adding a check here causes more problems than it's worth, because this can be called in two ways that are valid:

x has shape (1, n)

x has shape (n, 1)
These are basically equivalent since they are both "1d", but this would make the check more complicated.

nikicc · 2017-10-20T13:55:20Z

These commits were moved to PR #2698.

pavlin-policar force-pushed the statistics-util-countnans branch from 0359a98 to 57c6e28 Compare September 4, 2017 09:10

pavlin-policar mentioned this pull request Sep 5, 2017

[ENH] OWFeatureStatistics biolab/orange3-prototypes#72

Merged

pavlin-policar force-pushed the statistics-util-countnans branch 2 times, most recently from 811338f to 87e9276 Compare September 7, 2017 17:11

nikicc suggested changes Sep 8, 2017

View reviewed changes

pavlin-policar force-pushed the statistics-util-countnans branch 3 times, most recently from 23a7433 to cd71ba6 Compare September 8, 2017 19:28

pavlin-policar mentioned this pull request Sep 9, 2017

[FIX] Statistics.unique: Fix Sparse Return Order For Negative Numbers #2572

Merged

3 tasks

pavlin-policar added 11 commits September 9, 2017 17:26

Statistics.countnans: Fix sparse implementation and add axis support

b39db6e

Statistics.bincount: Fix sparse implementation

1d2bee0

Statistics.tests: Implement dense_sparse decorator

ef2ba73

Statistics.countnans: Support 2d weights for sparse matrices

ee8634b

Statistics.digitize: Move tests to own class and use dense_sparse dec…

941bd2b

…orator

Statistics.bincount: Add weight support to sparse, add docstring

ea74b94

Statistics: Implement sparse_count_zeros

ab5cc8b

Statistics.countnans: Add dtype param support to sparse

b4eb25a

Table._compute_distributions: Fix parameter ordering to bincount call

ca4c80f

Statistics.sparse_has_zeros: Make public

09ddc33

Table._compute_distributions: Correctly count zeros in sparse continu…

0057143

…ous distributions

pavlin-policar force-pushed the statistics-util-countnans branch from 2485608 to 2a23396 Compare September 9, 2017 15:26

pavlin-policar force-pushed the statistics-util-countnans branch 2 times, most recently from 9aff785 to 70354e7 Compare September 10, 2017 06:29

pavlin-policar added 4 commits September 12, 2017 08:55

DomainDistribution: Change tests to check for true zero counts

a21af1a

The previous implementation didn't return zero counts for sparse matrices and treated them as NaNs. This was changed and the tests have been updated.

TestNormalize: Fix failing test due to previous handling of zeros in …

d7d91c8

…sparse

Statistics.countnans: Fix copy=False param from coo.tocsr call

afa3df8

Pylint: Add pylint ignores to more human-friendly formatted matrices

6f12808

pavlin-policar force-pushed the statistics-util-countnans branch from 70354e7 to 6f12808 Compare September 12, 2017 06:55

nikicc suggested changes Oct 18, 2017

View reviewed changes

nikicc self-assigned this Oct 18, 2017

nikicc reviewed Oct 18, 2017

View reviewed changes

pavlin-policar added 3 commits October 20, 2017 13:59

Statistics.countnans: Support csc_matrices

dd516a7

Statistics: Rename sparse_zeros to sparse_implicit_zeros

e515f30

Statistics.tests: Inject explicit zeros into dense_sparse decorator

e4206e2

pavlin-policar force-pushed the statistics-util-countnans branch 2 times, most recently from b5b4cdc to e4206e2 Compare October 20, 2017 12:16

pavlin-policar closed this Oct 20, 2017

pavlin-policar deleted the statistics-util-countnans branch October 20, 2017 13:26

pavlin-policar mentioned this pull request Oct 20, 2017

[FIX] Statistics.countnans/bincount: Fix NaN Counting, Consider Implicit Zeros #2698

Merged

3 tasks


		Works kind of like np.bincount(), except that it also supports floating
		arrays with nans.

		np.testing.assert_equal(countnans(x, weights=w, axis=1), [1, 2])


		class TestBincount(unittest.TestCase):

		return np.fromiter((np.isnan(row.data).sum() for row in X), dtype=dtype)


		def sparse_count_zeros(x):



		def bincount(X, max_val=None, weights=None, minlength=None):
		def sparse_zero_weights(x, weights):

Uh oh!

Conversation

pavlin-policar commented Sep 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Description of changes

Includes

Uh oh!

pavlin-policar commented Sep 4, 2017

Uh oh!

nikicc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pavlin-policar commented Sep 8, 2017

Uh oh!

codecov-io commented Sep 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nikicc commented Sep 12, 2017

Uh oh!

pavlin-policar commented Oct 14, 2017

Uh oh!

nikicc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikicc Oct 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pavlin-policar commented Sep 4, 2017 •

edited

Loading

codecov-io commented Sep 9, 2017 •

edited

Loading

nikicc left a comment •

edited

Loading

nikicc Oct 18, 2017 •

edited

Loading