[FIX] Impute: sparse#2357
Conversation
|
@jerneju I already did some debugging about this on Friday and IMO the problem is in the file |
Orange/tests/test_util.py
Outdated
| """ | ||
| x = np.array([[0], [np.nan], [9]]) | ||
| x = sp.csr_matrix(x) | ||
| self.assertEqual(stats(x)[0][2], 3.) |
Orange/statistics/util.py
Outdated
|
|
||
| n_values = np.prod(x.shape) - np.sum(np.isnan(x.data)) | ||
| return np.nansum(x.data) / n_values | ||
| x.data = np.nan_to_num(x.data) |
There was a problem hiding this comment.
nan_to_num converts np.nans to zeros, which causes mean to also treat them as zeros. E.g. for the sparse array of [np.nan, np.nan, 1] this implementation returns 0.33 instead of 1.
What's wrong with the previous implementation?
Orange/preprocess/impute.py
Outdated
| if not sp.issparse(c): | ||
| c = np.array(c, copy=True) | ||
| else: | ||
| c = c.copy() |
There was a problem hiding this comment.
Why do we need a copy? Doesn't toarray() already takes care of this?
Orange/preprocess/impute.py
Outdated
| c = np.array(c, copy=True) | ||
| else: | ||
| c = c.copy() | ||
| c = c.toarray().flatten() |
There was a problem hiding this comment.
Should we use ravel instead that doesn't necessarily make an other copy?
Codecov Report
@@ Coverage Diff @@
## master #2357 +/- ##
==========================================
- Coverage 73.41% 73.38% -0.04%
==========================================
Files 317 317
Lines 55653 55664 +11
==========================================
- Hits 40859 40850 -9
- Misses 14794 14814 +20 |
|
|
||
|
|
||
| def nanmean(x): | ||
| def nanmean(x, axis=None): |
There was a problem hiding this comment.
What about:
def nanmean(x, axis=None):
""" Equivalent of np.nanmean that supports sparse or dense matrices. """
def nanmean_sparse(x):
n_values = np.prod(x.shape) - np.sum(np.isnan(x.data))
return np.nansum(x.data) / n_values
if not sp.issparse(x):
return np.nanmean(x, axis=axis)
if axis is None:
return nanmean_sparse(x)
if axis in [0, 1]:
arr = x if axis == 1 else x.T
return np.array([nanmean_sparse(row) for row in arr])
else:
raise NotImplementedErrorThere was a problem hiding this comment.
Well, I did some speed testing. The results are interesting and are listed below:
Ratio for axis 0 : 1.558
Ratio for axis 1 : 0.664
| if not sp.issparse(c): | ||
| c = np.array(c, copy=True) | ||
| else: | ||
| c = c.toarray().ravel() |
There was a problem hiding this comment.
What about if we take only c.data here and we would need to density the whole column? Consequently, we would need to set only c.data in L314.

Issue
Fixes #2349.
Description of changes
Work in progress.
Includes