Statistics.countnans: Fix sparse implementation and add axis support#2558
Statistics.countnans: Fix sparse implementation and add axis support#2558pavlin-policar wants to merge 18 commits intobiolab:masterfrom
Conversation
0359a98 to
57c6e28
Compare
|
@nikicc It appears that the existing code was not counting the number of NaNs, but rather the number of zero elements. Apparently, this functionality is used in |
811338f to
87e9276
Compare
nikicc
left a comment
There was a problem hiding this comment.
Kudos for finding and correcting the bug 👍 IMO countnans should count the number of nans and not zeros like it currently does.
About the failing tests, I quickly looked at the first one and the test is incorrect as was the previous countnans implementation. Could you also correct the tests for _compute_distributions according to the new implementation?
Orange/tests/test_statistics.py
Outdated
| countnans(csr_matrix(x), axis=1), [1, 1], | ||
| 'Countnans fails on sparse data with `axis=1`') | ||
|
|
||
| def test_countnans_with_2d_weights(self): |
There was a problem hiding this comment.
Could you also add a test for weights on sparse matrices?
There was a problem hiding this comment.
This didn't work before and I wasn't sure whether or not it was even worth the hassle, since I assume we only use sparse matrices when the data is very big, and forming a dense 2d weight matrix of that size kind of defeats the purpose...
However, I've implemented it now. In any case sparse weight matrices weren't supported before and aren't supported now. I don't know if it's a requirement, but it would require more work.
There was a problem hiding this comment.
You're right, we shouldn't worry about weights on sparse matrices.
Orange/statistics/util.py
Outdated
| isnan = isnan * weights | ||
|
|
||
| # In order to keep return types consistent with sparse vectors, we will | ||
| # handle `axis=1` given a regular 1d numpy array equivallently as |
There was a problem hiding this comment.
I suggest we drop this compatibility for 1D arrays. The only case I can think of where this would be handy is in cases like:
for row in data.X:
countnans(row, axis=1)But we probably shouldn't be passing 1D vectors with axis=1 anyway, but should calculate nans on whole matrix.
I would drop this compatibility for 1D and we will worry about this if this ever becomes a problem.
There was a problem hiding this comment.
Thinking a bit further, if we really want 100% compatibility, we could make it to also raise an error for countnans with sparse 1D arrays (e.g. [[1 2 3]]) and axis=1. We treat them as 1D array anyhow so I don't see a benefit of supporting both axis=0 and axis=1 if both return the same value.
There was a problem hiding this comment.
I agree, this does seem to be the most sensible solution. Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.
There was a problem hiding this comment.
Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.
But on the other hand, changing user-provided input of axis=1 to axis=0 just because axis=1 would result in an error is IMO worse than some slight incompatibility between sparse and dense. If I could come up with at least one use-case where this incompatibility would really hurt us I might be less against it.
IMO if you want to stick to 100% same behavior, I would prefer if we also raise an error when a sparse row is provided and user passes axis=1. Hence we could essentially treat a sparse matrix of shape (1, X) as 1D array and get exactly the same behaviour as with 1D dense matrices.
There was a problem hiding this comment.
Yes, I agree, changing the user input was a bad idea. The error is much better.
Orange/tests/test_statistics.py
Outdated
| def test_shape_matches_dense_and_sparse_given_array_and_axis_1(self): | ||
| dense = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1]) | ||
| sparse = csr_matrix(dense) | ||
| np.testing.assert_equal(countnans(dense, axis=1), countnans(sparse, axis=1)) |
There was a problem hiding this comment.
I don't like this. For example, numpy crashes if you want to sum with axis=1 and only have 1D array.
>>> y = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1])
>>> y.sum(axis=0)
nan
>>> y.sum(axis=1)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/Niko/anaconda/envs/orange3/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
ValueError: 'axis' entry is out of boundsConsidering the idea from above comments, I would probably just make both of these calls to raise an error.
|
|
||
| Works kind of like np.bincount(), except that it also supports floating | ||
| arrays with nans. | ||
|
|
There was a problem hiding this comment.
Would you be willing to also document the max_val argument? This doesn't seem to be a numpy argument, and it's not obvious what it is for.
There was a problem hiding this comment.
I've added thorough documentation.
Orange/statistics/util.py
Outdated
| # Since `csr_matrix.values` only contain non-zero values, we must count | ||
| # those separately and set the appropriate bin | ||
| if sp.issparse(X_): | ||
| bc[0] = np.prod(X_.shape) - X_.nnz |
There was a problem hiding this comment.
Should we maybe make np.prod(X_.shape) - X_.nnz as a separate function, something like sparse_count_zeros? We already have _sparse_has_zeros and we will need to add the number of zero elements to the Table widget sooner or later.
There was a problem hiding this comment.
I think this is a good idea.
Perhaps it would also make sense to implement sparse_num_elements which would essentially be np.prod(x.shape) since it's also used a lot. This would make the functionality much clearer, but I'm hesitant because it seems far too excessive.
Orange/tests/test_statistics.py
Outdated
| np.testing.assert_equal(bincount(dense), expected) | ||
| np.testing.assert_equal(bincount(sparse), expected) | ||
|
|
||
| hist, n_nans = bincount([0., 1., 3], max_val=3) |
There was a problem hiding this comment.
What exactly does max_val=3 does here?
There was a problem hiding this comment.
This was a remnant of the old tests. I've removed it and added an appropriate test.
| np.testing.assert_equal(countnans(x, weights=w, axis=1), [1, 2]) | ||
|
|
||
|
|
||
| class TestBincount(unittest.TestCase): |
There was a problem hiding this comment.
Perhaps add one test with weights?
23a7433 to
cd71ba6
Compare
|
At first I was puzzled why zeros were counted as NaNs in sparse matrices. This has no apparent advantages at all and can be confusing when first dealing with it... This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything). This doesn't make much sense to me, because if we separate the two, we can easily determine which data is missing and which is actually zero. Pandas does differentiate between the explicit and implicit NaNs, but then again, pandas is far more flexible than plain sparse matrices. |
…ous distributions
2485608 to
2a23396
Compare
Codecov Report
@@ Coverage Diff @@
## master #2558 +/- ##
==========================================
- Coverage 75.83% 75.03% -0.81%
==========================================
Files 338 327 -11
Lines 59532 57645 -1887
==========================================
- Hits 45145 43252 -1893
- Misses 14387 14393 +6 |
9aff785 to
70354e7
Compare
The previous implementation didn't return zero counts for sparse matrices and treated them as NaNs. This was changed and the tests have been updated.
70354e7 to
6f12808
Compare
Not sure where this came from, but it was incorrect. The current assumption is that values not stored in the sparse matrix correspond to zeros and not missing values! Missing values are explicitly stored in sparse matrices as |
|
@nikicc Is there anything left to do with PR or do you think it could be merged? |
There was a problem hiding this comment.
@pavlin-policar sorry that this took so long 🙈
Overall it looks OK, there are only a few things left:
- fix
bincountto consider explicit zeros, - make sure that
xis of correct type whenever you callx.indices(just cast it tocsrbeforehand)
Also, when discovering the problem of explicit zeros I was wondering if we could hack the dense_sparse decorator to also inset at least one explicit zero? Doing so, we should have caught the bincount problem.
| [0, 0, 0, 1, 0, 2, np.nan, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], | ||
| [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], | ||
| [0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1.1, 0, 0, 0, 0, 0, 0]] | ||
| ) |
There was a problem hiding this comment.
This test doesn't quite work for explicit zeros. E.g. if you add X[0,0] = 0 here, the test should pass and the results should be the same — but it fails. IMO the problem is in the bincount, check the comment above.
There was a problem hiding this comment.
Also, once this is corrected, please add some explicit zeros to this example.
|
|
||
| if weights is not None: | ||
| zero_weights = weights[zero_indices].sum() | ||
| weights = weights[x.indices] |
There was a problem hiding this comment.
Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.
|
|
||
| if weights.ndim == 1: | ||
| n_items = np.prod(x.shape) | ||
| zero_indices = np.setdiff1d(np.arange(n_items), x.indices, assume_unique=True) |
There was a problem hiding this comment.
Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.
Orange/statistics/util.py
Outdated
| # Since `csr_matrix.values` only contain non-zero values, we must count | ||
| # those separately and set the appropriate bin | ||
| if sp.issparse(x_original): | ||
| bc[0] = zero_weights |
There was a problem hiding this comment.
This should probably be bc[0] = bc[0] + zero_weights to account for explicit zeros stored in x.data.
Orange/statistics/util.py
Outdated
| return np.fromiter((np.isnan(row.data).sum() for row in X), dtype=dtype) | ||
|
|
||
|
|
||
| def sparse_count_zeros(x): |
There was a problem hiding this comment.
Should we perhaps rename this to sparse_count_implicit_zeros to make sure explicit zeros aren't counted?
Orange/statistics/util.py
Outdated
| return np.prod(x.shape) - x.nnz | ||
|
|
||
|
|
||
| def sparse_has_zeros(x): |
There was a problem hiding this comment.
Should we perhaps rename this to sparse_has_implicit_zeros to make sure explicit zeros aren't considered?
Orange/statistics/util.py
Outdated
|
|
||
|
|
||
| def bincount(X, max_val=None, weights=None, minlength=None): | ||
| def sparse_zero_weights(x, weights): |
There was a problem hiding this comment.
Is this only meant to be used when both x and weights are of the same shape, i.e. one dimensional? If so, should we add the check for this?
There was a problem hiding this comment.
Adding a check here causes more problems than it's worth, because this can be called in two ways that are valid:
xhas shape(1, n)xhas shape(n, 1)
These are basically equivalent since they are both "1d", but this would make the check more complicated.
b5b4cdc to
e4206e2
Compare
|
These commits were moved to PR #2698. |
Issue
The statistics module implementation of
countnansdid not support theaxisargument. Moreover, it appeared to have been computing number of NaNs incorrectly on sparse data in general.Description of changes
Add implementation of
countnanswhich correctly counts NaNs and supports theaxiskeyword.Includes