[FIX] Scatter Plot: dealing with scipy sparse matrix#2152
[FIX] Scatter Plot: dealing with scipy sparse matrix#2152nikicc merged 1 commit intobiolab:masterfrom jerneju:value-scatterplot
Conversation
Orange/widgets/utils/scaling.py
Outdated
| Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T | ||
| self.original_data = np.hstack((data.X, Y)).T | ||
| self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X)) | ||
| else data.X.toarray(), |
There was a problem hiding this comment.
We shouldn't transform sparse matrix to dense here but should rather adapt the method to work on sparse matrices.
If the matrix is sparse, we can use scipy's hstack instead of numpy's and then whoever uses self.original_data should expect that it can also be sparse.
There was a problem hiding this comment.
Yes, firstly I used scipy's hstack. But as you warned me, then there are problems with self.original_data.
Orange/widgets/utils/scaling.py
Outdated
|
|
||
| Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T | ||
| self.original_data = np.hstack((data.X, Y)).T | ||
| self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X)) |
There was a problem hiding this comment.
To check the type it's better to use issinstance than checking if string is inside type's string representation. However, in this case checking for csr_matrix is not sufficient since this doesn't work for other sparse matrix formats (like csc or dok).
The best way to check for sparsity is to use scipy's issparse For example:
import scipy as sp
sp.issparse(data.X)
nikicc
left a comment
There was a problem hiding this comment.
Please refactor to that sparse matrices wont'd be transformed to dense.
Codecov Report
@@ Coverage Diff @@
## master #2152 +/- ##
=========================================
+ Coverage 72.37% 72.4% +0.02%
=========================================
Files 319 319
Lines 55014 55055 +41
=========================================
+ Hits 39819 39862 +43
+ Misses 15195 15193 -2 |
|
@jerneju I haven't checked much of the code but this still fails for me on sparse: |
|
@nikicc: I will write the code again and that time I am planning to use new methods added in util.py. |
| self.unconditional_commit() | ||
| self.corpus_to_table() | ||
|
|
||
| def corpus_to_table(self): |
There was a problem hiding this comment.
Could you rename this to something else? In theory we could have sparse table also in Orange, nut just in Text.
nikicc
left a comment
There was a problem hiding this comment.
Two things still don't work for me:
- colouring doesn't work if I want to colour by a discrete variable stored in metas.
- giving a subset of the data points doesn't colour the selection on the graph.
| # called when all signals are received, so the graph is updated only once | ||
| def handleNewSignals(self): | ||
| self.graph.new_data(self.data_metas_X, self.subset_data) | ||
| if self.data is None or (self.data and not sp.issparse(self.data.X)):# table.issparse() !!! |
There was a problem hiding this comment.
Couldn't this simply be if self.data is None or not sp.issparse(self.data.X)?
There was a problem hiding this comment.
Or maybe even better:
if self.data is not None and self.data.is_sparse():
self.sparse_to_dense()
else:
self.graph.new_data(self.data_metas_X, self.subset_data)
Doing so you also don't need that if at the beginning of sparse_to_dense method.
| self.attr_x = self.attribute_selection_list[0] | ||
| self.attr_y = self.attribute_selection_list[1] | ||
| self.attribute_selection_list = None | ||
| #if not sp.issparse(self.data.X): |
There was a problem hiding this comment.
This is probably leftover from some debugging. Can you please remove it?
| return | ||
| keys = [] | ||
| for i, attr in enumerate(self.data.domain): | ||
| if attr in set([self.attr_x, self.attr_y, self.graph.attr_color]): |
There was a problem hiding this comment.
I'm probably nitpicking here, but could you move this outside the loop into a variable so you do not create a new set in each iteration?
| if self.data is None or not self.data.is_sparse(): | ||
| return | ||
| keys = [] | ||
| for i, attr in enumerate(self.data.domain): |
There was a problem hiding this comment.
This only iterates over features and class attributes, but not metas. What happens if we color by discrete variable inside metas?
| dmx = Table.from_table(new_domain, self.data) | ||
| dmx = self.move_primitive_metas_to_X(dmx) | ||
| dmx.X = dmx.X.toarray() | ||
| if sp.issparse(dmx.Y): |
There was a problem hiding this comment.
Can you please add a todo here? Something like:
#TODO: remove once we make sure Y is always dense.
| dmx.X = dmx.X.toarray() | ||
| if sp.issparse(dmx.Y): | ||
| dmx.Y = dmx.Y.toarray() | ||
| self.subset_data = None |
There was a problem hiding this comment.
What if someone provide sparse subset data?
|
Code updated. |
| self.update_graph() | ||
|
|
||
| def sparse_to_dense(self, input_data=None): | ||
| if input_data is None: |
There was a problem hiding this comment.
This can be simplified to:
if input_data is None or not input_data.is_sparse():
return input_data
| self.graph.new_data(self.data_metas_X, self.subset_data) | ||
| self.graph.new_data(self.sparse_to_dense(self.data_metas_X), | ||
| self.sparse_to_dense(self.subset_data) | ||
| ) |
There was a problem hiding this comment.
Can you move this ) at the end of previous line?
| self.graph.new_data(self.sparse_to_dense(self.data_metas_X), | ||
| self.sparse_to_dense(self.subset_data), | ||
| new=False | ||
| ) |
There was a problem hiding this comment.
Can you move this ) at the end of previous line?
| self.sparse_to_dense(self.subset_data), | ||
| new=False | ||
| ) | ||
| self.update_graph() |
There was a problem hiding this comment.
Is this update graph really needed?
| return input_data | ||
| self.vizrank_button.setVisible(False) # not for subset data? | ||
| keys = [] | ||
| attrs = set([self.attr_x, |
There was a problem hiding this comment.
My PyCharm suggests: function call can be replaces with set literal.
| return None | ||
| if not input_data.is_sparse(): | ||
| return input_data | ||
| self.vizrank_button.setVisible(False) # not for subset data? |
There was a problem hiding this comment.
Maybe it's better to use setEnabled here.
There was a problem hiding this comment.
Also, once this gets disabled it is not re-enabled if dense data comes in. I suggest we add something like:
self.vizrank_button.setEnabled(not (self.data and self.data.is_sparse()))
| return shape_data | ||
|
|
||
| def update_shapes(self): | ||
| self.master.prepare_data() |
There was a problem hiding this comment.
Could we also add something like if self.attr_shape not in self.data.domain?
| table.Y = sp.csr_matrix(table._Y) # pylint: disable=protected-access | ||
| self.assertTrue(sp.issparse(table.Y)) | ||
| self.send_signal("Data", table) | ||
| self.widget.set_subset_data(table[0:30]) |
There was a problem hiding this comment.
This can be just table[:30]
|
Fixed. |

Issue
Orange add-on "Text mining" sends "Corpus" instead of "Table". "Corpus" uses sparse matrices and numpy hstack does not know how to handle them. That is why scipy sparse matrix is converted into numpy array.
https://sentry.io/biolab/orange3/issues/243775748/
Fixes #2157.
Description of changes
Includes