[FIX] Distances: prevent inf numbers by jerneju · Pull Request #2380 · biolab/orange3

jerneju · 2017-06-06T09:17:49Z

Issue

---------------------------- ValueError Exception -----------------------------
Traceback (most recent call last):
File "pyqtgraph/graphicsItems/AxisItem.py", line 525, in paint specs = self.generateDrawSpecs(painter)
File "pyqtgraph/graphicsItems/AxisItem.py", line 819, in generateDrawSpecs tickLevels = self.tickValues(self.range[0], self.range[1], lengthInPixels)
File "pyqtgraph/graphicsItems/AxisItem.py", line 691, in tickValues num = int((maxVal-start) / spacing) + 1
ValueError: cannot convert float NaN to integer

Description of changes

Includes

Code changes
Tests
Documentation

jerneju · 2017-06-06T09:25:54Z

https://sentry.io/biolab/orange3/issues/280206866/

codecov-io · 2017-06-09T08:22:10Z

Codecov Report

❗ No coverage uploaded for pull request base (master@942877a). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #2380   +/-   ##
=========================================
  Coverage          ?   74.25%           
=========================================
  Files             ?      320           
  Lines             ?    55836           
  Branches          ?        0           
=========================================
  Hits              ?    41459           
  Misses            ?    14377           
  Partials          ?        0

jerneju · 2017-06-09T08:22:16Z

Rebased.

kernc · 2017-06-09T09:13:00Z

Orange/widgets/unsupervised/owdistances.py

            return
+
+        if np.isinf(met).any():
+            met = np.nan_to_num(met)


Without the conditional, this would make one linear pass instead of three.

janezd · 2017-06-09T15:55:40Z

Orange/widgets/unsupervised/owdistances.py

            self.Error.distances_memory_error()
            return
+
+        met = np.nan_to_num(met)


No. This is just a quick local patch, not a solution.

It reminds me of a guy (whom we all know and) who - years ago - programmed some machine learning algorithm. He occasionally got probabilities above 1, so he fixed the problem by adding

if (p > 1) { p = 1; }

If we get an error report, we shouldn't find the simplest way to just fix it somewhere. If some distances are infinite, they are infinite. We can't just set them to 42 and pretend everything is cool. If we believe that distance matrix should never contain infinite values, we should raise an exception. If we decide that infinite values can arise and are legal, other functions/widgets should be able to handle those.

In this case, I'd say that infinite values are allowed and hierarchical clustering should do something about them. Replacing them with large values (perhaps len(data) * np.nanmax(data)) could make sense within the hierarchical clustering. Or it may be even more complex since it depends upon linkage.

So: if there are infinite distances, we need to think what would we like the clustering to do with them, not what is the easiest way to get rid of them.

perhaps len(data) * np.nanmax(data)

If you opt for this suggestion, please, just don't use np.nanmax(data). We should never use np.nanmax(data) since it doesn't work on sparse data! We have our own implementation of nanmax in Orange/statistics/util.py.

I was about to write that while I totally agree with @janezd, most solutions for inf distances will probably be equivalent to substitution with large numbers. In which case maybe we should consider if we want every widget to duplicate this substitution.
But there is always the possibility that somewhere inf numbers would be illegal (and we would want a warning). And even worse - this substitution will probably not allow downstream methods to ignore the problem, since you still cannot multiply the new "large values", or add two of them etc. which could happen in average linkage and so on.

@janezd, Distance Transformation widget could have a check box where user could select to impute infinite distances. @lanzagar suggested something like that.

This sounds more like clipping. With lower and upper range?

janezd · 2017-06-23T14:05:23Z

Orange/widgets/unsupervised/owhierarchicalclustering.py

    basic_annotations = ["None", "Enumeration"]

+    class Error(widget.OWWidget.Error):
+        not_finite_distances = Msg("Not all distances are finite")


Rephrase: Some distances are infinite.

Why some error messages have dot at the end and some have not?

You're right, we have to decide. @BlazZupan, @astaric, @lanzagar, @ajdapretnar? I don't have a clear preference.

I vote for dots in sentences (verb present).

Good point. No verb, no dot. Verb, dot. Period. :)

However, some error messages are not sentences, like "Out of memory" or "Error during optimization". What about those?

Message without verb bad, message with verb good? Makes them longer, though. Sometimes long not good, when widget narrow.

Oh no, I would keep those dot-less. Short, too.

janezd · 2017-06-23T14:09:31Z

Orange/widgets/unsupervised/owhierarchicalclustering.py

        if matrix is not None:
+            if not np.all(np.isfinite(matrix)):
+                self.Error.not_finite_distances()
+                return


I may have overlooked something, but I think this is wrong: it aborts the function without reseting the items (and related models). Note that the check above sets matrix to None. In line with this, you should probably have something like:

if matrix is not None: if not np.all(np.isfinite(matrix)): self.Error.not_finite_distances() matrix = None self.matrix = matrix if matrix is not None: self._set_items(matrix.row_items, matrix.axis) else: self._set_items(None)

The line self.matrix = matrix is moved from above to the place after the check for infinite values.

janezd · 2017-06-23T14:35:14Z

Orange/widgets/unsupervised/owhierarchicalclustering.py

            if N < 2:
                self.error("Empty distance matrix")
                matrix = None
+            if not np.all(np.isfinite(matrix)):


No, not like that. If it comes here, matrix can be set to None in the line above. Another if matrix is not None is needed here.

This is Ok now.

janezd · 2017-06-23T15:08:12Z

Orange/widgets/unsupervised/owhierarchicalclustering.py


    def _invalidate_clustering(self):
+        if self.matrix is not None and not np.all(np.isfinite(self.matrix)):
+            return


First, this check may be redundant since np.all(np.isfinite(self.matrix)) is already checked above and self.matrix is set to None in this case. Hence, the situation in which this conjunction is true cannot arise, I guess.

Second -- if this condition can be fulfilled -- is skipping these calls really OK?

_update() clears the plot and then performs further cleanup if self.matrix is None. I guess this must be done if the widget receives invalid data.

_update_labels() checks for self.root (which is set to None in _update()) and works correctly when matrix is None.

_invalidate_output() has to be called pass None to outputs.

I believe that these two lines have to be removed. But please check whether there are any other side-effects.

These two lines are now removed.

janezd · 2017-06-30T09:56:13Z

Orange/widgets/unsupervised/owhierarchicalclustering.py

            if N < 2:
                self.error("Empty distance matrix")
                matrix = None
+            if not np.all(np.isfinite(matrix)):


This is Ok now.

lanzagar self-assigned this Jun 9, 2017

kernc reviewed Jun 9, 2017

View reviewed changes

janezd requested changes Jun 9, 2017

View reviewed changes

jerneju changed the title ~~[FIX] Distances: prevent inf numbers~~ [WIP][FIX] Distances: prevent inf numbers Jun 12, 2017

lanzagar removed their assignment Jun 13, 2017

Hierarchical Clustering: numpy -> np

60ca105

janezd requested changes Jun 23, 2017

View reviewed changes

janezd reviewed Jun 23, 2017

View reviewed changes

[FIX] Hierarchical Clustering: infinite distances

510901e

jerneju changed the title ~~[WIP][FIX] Distances: prevent inf numbers~~ [FIX] Distances: prevent inf numbers Jun 28, 2017

janezd approved these changes Jun 30, 2017

View reviewed changes

janezd merged commit 47647c2 into biolab:master Jun 30, 2017

jerneju deleted the inf-distances branch June 30, 2017 09:58

Uh oh!

Conversation

jerneju commented Jun 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Description of changes

Includes

Uh oh!

jerneju commented Jun 6, 2017

Uh oh!

codecov-io commented Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jerneju commented Jun 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jerneju commented Jun 6, 2017 •

edited

Loading

codecov-io commented Jun 9, 2017 •

edited

Loading