[ENH] Improved Sparsity Handling#2341
Conversation
4b65c4e to
80475e7
Compare
|
That PR should fix #2359 . |
jerneju
left a comment
There was a problem hiding this comment.
When someone wants to transform two columns from sparse X to dense Y she or he gets only one Y column which has got double the number of rows.
table = Table("iris")
table.X = sp.csr_matrix(table.X)
attributes = table.domain.attributes[:2] + table.domain.class_vars
class_vars = table.domain.attributes[2:4]
newtable = table.transform(Domain(attributes=attributes, class_vars=class_vars))
self.assertEqual(newtable.Y.shape[0], table.Y.shape[0]) |
125a142 to
71fc394
Compare
|
@jerneju nice catch. Should be fixed not, please check. |
|
|
||
| self.assertLessEqual(time() - start, 1) | ||
|
|
||
| def test_domain_conversion_sparsity(self): |
There was a problem hiding this comment.
I usually add something like this:
"""
Description.
GH-2341
"""942f527 to
120f92e
Compare
Orange/data/variable.py
Outdated
| .. attribute:: sparse | ||
|
|
||
| A flag about sparsity of the variable. When set, the variable suggests | ||
| it is should be stored in a sparse matrix. |
4d20017 to
c23bf8d
Compare
Codecov Report
@@ Coverage Diff @@
## master #2341 +/- ##
=========================================
+ Coverage 76.09% 76.1% +0.01%
=========================================
Files 338 338
Lines 59723 59760 +37
=========================================
+ Hits 45444 45479 +35
- Misses 14279 14281 +2 |
8488448 to
68594d1
Compare
3daedd2 to
0584b67
Compare
ac4ae07 to
34f78a7
Compare
Orange/data/domain.py
Outdated
| source.index(var) if var in source | ||
| else var.compute_value for var in destination.metas] | ||
|
|
||
| def is_sparse(feats): |
There was a problem hiding this comment.
That shoulds better 👍 Fixed.
Orange/data/table.py
Outdated
|
|
||
| def get_columns(row_indices, src_cols, n_rows, dtype=np.float64, | ||
| is_sparse=False): | ||
| def array_transform(x, to_sparse): |
There was a problem hiding this comment.
Could these functions be moved someplace else? They seem like general functions that ensure array/column is sparse/dense. Maybe change to a more expressive name.
Orange/data/domain.py
Outdated
| def is_sparse(feats): | ||
| """ For a matrix to be stored in sparse, more than 2/3 of columns | ||
| should be marked as sparse and there should be no string columns | ||
| since Scipy's sparse matrices don't support dtype=object.""" |
There was a problem hiding this comment.
At least the closing triple quotes of a multi-line docstring should be in their own line (PEP257).
(for opening quotes try to mimic other docstrings nearby)
There was a problem hiding this comment.
Changed to match the style as in nerby docstrings.
Orange/data/domain.py
Outdated
| """ For a matrix to be stored in sparse, more than 2/3 of columns | ||
| should be marked as sparse and there should be no string columns | ||
| since Scipy's sparse matrices don't support dtype=object.""" | ||
| fraction_sparse = sum(f.sparse for f in feats)/max(len(feats), 1) |
|
@astaric, @markotoplak I split the functionality of I haven't moved the functions yet, though. I will do this once we settle on the best approach. Perhaps to some util library? |
DomainConversion now has three more attributes `sparse_X`, `sparse_Y` and `sparse_metas`, which suggest whether the resulting matrix should be sparse or dense.
Table.from_table now sets the sparsity of X, Y and metas as determined by DomainConversion.
When we return subarryas, the flag `is_sparse` wasn't considered, but we simpy returned the subarray in it's original format. Also, make sure subarrays aren't flattened to 1d, as it is required for columns.
Calling len on scipy sparse matrices causes the error: `TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]`.
... to alleviate interdependencies between tests in differente files. Before that, one tests could mark some attributes of iris as sparse which could cause tests in different files to crash due to variable "reusing".
Issue
Fixes #2322.
Blocked on #2734. Blocked on #2736.
Description of changes
DomainConversionnow has three extra attributes telling whether X, Y, or metas should be sparse according to the features inside of them. Metas can only be sparse if no StringVariables are present.Table.from_tabledetermines the density of X, Y, and metas according to domain conversion.to_sparseandto_densemethods toTable.Includes