[FIX] Scatter Plot: dealing with scipy sparse matrix by jerneju · Pull Request #2152 · biolab/orange3

jerneju · 2017-03-29T15:19:06Z

Issue

Orange add-on "Text mining" sends "Corpus" instead of "Table". "Corpus" uses sparse matrices and numpy hstack does not know how to handle them. That is why scipy sparse matrix is converted into numpy array.
https://sentry.io/biolab/orange3/issues/243775748/
Fixes #2157.

Description of changes

Includes

Code changes
Tests
Documentation

jerneju · 2017-03-29T15:19:57Z

https://sentry.io/biolab/orange3/issues/243775748/

nikicc · 2017-03-29T18:04:35Z

Orange/widgets/utils/scaling.py

        Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T
-        self.original_data = np.hstack((data.X, Y)).T
+        self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X))
+                                        else data.X.toarray(),


We shouldn't transform sparse matrix to dense here but should rather adapt the method to work on sparse matrices.

If the matrix is sparse, we can use scipy's hstack instead of numpy's and then whoever uses self.original_data should expect that it can also be sparse.

Yes, firstly I used scipy's hstack. But as you warned me, then there are problems with self.original_data.

nikicc · 2017-03-29T18:14:10Z

Orange/widgets/utils/scaling.py


        Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T
-        self.original_data = np.hstack((data.X, Y)).T
+        self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X))


To check the type it's better to use issinstance than checking if string is inside type's string representation. However, in this case checking for csr_matrix is not sufficient since this doesn't work for other sparse matrix formats (like csc or dok).

The best way to check for sparsity is to use scipy's issparse For example:

import scipy as sp sp.issparse(data.X)

nikicc

Please refactor to that sparse matrices wont'd be transformed to dense.

codecov-io · 2017-03-30T08:11:56Z

Codecov Report

Merging #2152 into master will increase coverage by 0.02%.
The diff coverage is 92.59%.

@@            Coverage Diff            @@
##           master   #2152      +/-   ##
=========================================
+ Coverage   72.37%   72.4%   +0.02%     
=========================================
  Files         319     319              
  Lines       55014   55055      +41     
=========================================
+ Hits        39819   39862      +43     
+ Misses      15195   15193       -2

nikicc · 2017-04-07T08:20:50Z

@jerneju I haven't checked much of the code but this still fails for me on sparse:

jerneju · 2017-04-07T08:26:07Z

@nikicc: I will write the code again and that time I am planning to use new methods added in util.py.

nikicc · 2017-04-07T09:11:59Z

Orange/widgets/visualize/owscatterplot.py

        self.unconditional_commit()
+        self.corpus_to_table()
+
+    def corpus_to_table(self):


Could you rename this to something else? In theory we could have sparse table also in Orange, nut just in Text.

nikicc

Two things still don't work for me:

colouring doesn't work if I want to colour by a discrete variable stored in metas.
giving a subset of the data points doesn't colour the selection on the graph.

nikicc · 2017-04-12T14:20:45Z

Orange/widgets/visualize/owscatterplot.py

    # called when all signals are received, so the graph is updated only once
    def handleNewSignals(self):
-        self.graph.new_data(self.data_metas_X, self.subset_data)
+        if self.data is None or (self.data and not sp.issparse(self.data.X)):#  table.issparse() !!!


Couldn't this simply be if self.data is None or not sp.issparse(self.data.X)?

Or maybe even better:

if self.data is not None and self.data.is_sparse(): self.sparse_to_dense() else: self.graph.new_data(self.data_metas_X, self.subset_data)

Doing so you also don't need that if at the beginning of sparse_to_dense method.

nikicc · 2017-04-12T14:21:46Z

Orange/widgets/visualize/owscatterplot.py

            self.attr_x = self.attribute_selection_list[0]
            self.attr_y = self.attribute_selection_list[1]
        self.attribute_selection_list = None
+        #if not sp.issparse(self.data.X):


This is probably leftover from some debugging. Can you please remove it?

nikicc · 2017-04-12T14:28:43Z

Orange/widgets/visualize/owscatterplot.py

+            return
+        keys = []
+        for i, attr in enumerate(self.data.domain):
+            if attr in set([self.attr_x, self.attr_y, self.graph.attr_color]):


I'm probably nitpicking here, but could you move this outside the loop into a variable so you do not create a new set in each iteration?

nikicc · 2017-04-12T14:30:24Z

Orange/widgets/visualize/owscatterplot.py

+        if self.data is None or not self.data.is_sparse():
+            return
+        keys = []
+        for i, attr in enumerate(self.data.domain):


This only iterates over features and class attributes, but not metas. What happens if we color by discrete variable inside metas?

nikicc · 2017-04-12T14:32:38Z

Orange/widgets/visualize/owscatterplot.py

+        dmx = Table.from_table(new_domain, self.data)
+        dmx = self.move_primitive_metas_to_X(dmx)
+        dmx.X = dmx.X.toarray()
+        if sp.issparse(dmx.Y):


Can you please add a todo here? Something like:
#TODO: remove once we make sure Y is always dense.

nikicc · 2017-04-12T14:35:40Z

Orange/widgets/visualize/owscatterplot.py

+        dmx.X = dmx.X.toarray()
+        if sp.issparse(dmx.Y):
+            dmx.Y = dmx.Y.toarray()
+        self.subset_data = None


What if someone provide sparse subset data?

jerneju · 2017-04-21T11:33:17Z

Code updated.

nikicc · 2017-04-21T12:16:09Z

Orange/widgets/visualize/owscatterplot.py

+        self.update_graph()
+
+    def sparse_to_dense(self, input_data=None):
+        if input_data is None:


This can be simplified to:

if input_data is None or not input_data.is_sparse(): return input_data

nikicc · 2017-04-21T12:24:02Z

Orange/widgets/visualize/owscatterplot.py

-        self.graph.new_data(self.data_metas_X, self.subset_data)
+        self.graph.new_data(self.sparse_to_dense(self.data_metas_X),
+                            self.sparse_to_dense(self.subset_data)
+                           )


Can you move this ) at the end of previous line?

nikicc · 2017-04-21T12:29:27Z

Orange/widgets/visualize/owscatterplot.py

+        self.graph.new_data(self.sparse_to_dense(self.data_metas_X),
+                            self.sparse_to_dense(self.subset_data),
+                            new=False
+                           )


Can you move this ) at the end of previous line?

nikicc · 2017-04-21T12:31:42Z

Orange/widgets/visualize/owscatterplot.py

+                            self.sparse_to_dense(self.subset_data),
+                            new=False
+                           )
+        self.update_graph()


Is this update graph really needed?

Probably not.

nikicc · 2017-04-21T12:47:10Z

Orange/widgets/visualize/owscatterplot.py

+            return input_data
+        self.vizrank_button.setVisible(False)  # not for subset data?
+        keys = []
+        attrs = set([self.attr_x,


My PyCharm suggests: function call can be replaces with set literal.

nikicc · 2017-04-21T12:52:33Z

Orange/widgets/visualize/owscatterplot.py

+            return None
+        if not input_data.is_sparse():
+            return input_data
+        self.vizrank_button.setVisible(False)  # not for subset data?


Maybe it's better to use setEnabled here.

Also, once this gets disabled it is not re-enabled if dense data comes in. I suggest we add something like:

self.vizrank_button.setEnabled(not (self.data and self.data.is_sparse()))

nikicc · 2017-04-21T12:57:39Z

Orange/widgets/visualize/owscatterplotgraph.py

        return shape_data

    def update_shapes(self):
+        self.master.prepare_data()


Could we also add something like if self.attr_shape not in self.data.domain?

nikicc · 2017-04-21T12:58:57Z

Orange/widgets/visualize/tests/test_owscatterplot.py

+        table.Y = sp.csr_matrix(table._Y)  # pylint: disable=protected-access
+        self.assertTrue(sp.issparse(table.Y))
+        self.send_signal("Data", table)
+        self.widget.set_subset_data(table[0:30])


This can be just table[:30]

jerneju · 2017-04-21T13:13:10Z

Fixed.

https://sentry.io/biolab/orange3/issues/243775748/

nikicc reviewed Mar 29, 2017

View reviewed changes

nikicc suggested changes Mar 29, 2017

View reviewed changes

astaric added this to the 3.4.2 milestone Apr 7, 2017

astaric assigned nikicc Apr 7, 2017

nikicc suggested changes Apr 7, 2017

View reviewed changes

nikicc modified the milestones: future, 3.4.2 Apr 7, 2017

nikicc suggested changes Apr 12, 2017

View reviewed changes

jerneju changed the title ~~[FIX] Scatter Plot: dealing with scipy sparse matrix~~ [WIP][FIX] Scatter Plot: dealing with scipy sparse matrix Apr 14, 2017

astaric modified the milestone: future Apr 19, 2017

nikicc suggested changes Apr 21, 2017

View reviewed changes

nikicc approved these changes Apr 21, 2017

View reviewed changes

nikicc changed the title ~~[WIP][FIX] Scatter Plot: dealing with scipy sparse matrix~~ [FIX] Scatter Plot: dealing with scipy sparse matrix Apr 21, 2017

[FIX] Scatter Plot: dealing with scipy sparse matrix

cceeee3

https://sentry.io/biolab/orange3/issues/243775748/

nikicc merged commit 9274a20 into biolab:master Apr 21, 2017

jerneju deleted the value-scatterplot branch April 24, 2017 07:12

Uh oh!

Conversation

jerneju commented Mar 29, 2017 • edited by nikicc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Description of changes

Includes

Uh oh!

jerneju commented Mar 29, 2017

Uh oh!

nikicc Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikicc left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Mar 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nikicc commented Apr 7, 2017

Uh oh!

jerneju commented Apr 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikicc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerneju commented Apr 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerneju commented Apr 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerneju commented Mar 29, 2017 •

edited by nikicc

Loading

nikicc Mar 29, 2017 •

edited

Loading

codecov-io commented Mar 30, 2017 •

edited

Loading