[FIX] Distances: prevent inf numbers#2380
[FIX] Distances: prevent inf numbers#2380janezd merged 2 commits intobiolab:masterfrom jerneju:inf-distances
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2380 +/- ##
=========================================
Coverage ? 74.25%
=========================================
Files ? 320
Lines ? 55836
Branches ? 0
=========================================
Hits ? 41459
Misses ? 14377
Partials ? 0 |
|
Rebased. |
| return | ||
|
|
||
| if np.isinf(met).any(): | ||
| met = np.nan_to_num(met) |
There was a problem hiding this comment.
Without the conditional, this would make one linear pass instead of three.
| self.Error.distances_memory_error() | ||
| return | ||
|
|
||
| met = np.nan_to_num(met) |
There was a problem hiding this comment.
No. This is just a quick local patch, not a solution.
It reminds me of a guy (whom we all know and) who - years ago - programmed some machine learning algorithm. He occasionally got probabilities above 1, so he fixed the problem by adding
if (p > 1) {
p = 1;
}
If we get an error report, we shouldn't find the simplest way to just fix it somewhere. If some distances are infinite, they are infinite. We can't just set them to 42 and pretend everything is cool. If we believe that distance matrix should never contain infinite values, we should raise an exception. If we decide that infinite values can arise and are legal, other functions/widgets should be able to handle those.
In this case, I'd say that infinite values are allowed and hierarchical clustering should do something about them. Replacing them with large values (perhaps len(data) * np.nanmax(data)) could make sense within the hierarchical clustering. Or it may be even more complex since it depends upon linkage.
So: if there are infinite distances, we need to think what would we like the clustering to do with them, not what is the easiest way to get rid of them.
There was a problem hiding this comment.
perhaps len(data) * np.nanmax(data)
If you opt for this suggestion, please, just don't use np.nanmax(data). We should never use np.nanmax(data) since it doesn't work on sparse data! We have our own implementation of nanmax in Orange/statistics/util.py.
There was a problem hiding this comment.
I was about to write that while I totally agree with @janezd, most solutions for inf distances will probably be equivalent to substitution with large numbers. In which case maybe we should consider if we want every widget to duplicate this substitution.
But there is always the possibility that somewhere inf numbers would be illegal (and we would want a warning). And even worse - this substitution will probably not allow downstream methods to ignore the problem, since you still cannot multiply the new "large values", or add two of them etc. which could happen in average linkage and so on.
There was a problem hiding this comment.
This sounds more like clipping. With lower and upper range?
| basic_annotations = ["None", "Enumeration"] | ||
|
|
||
| class Error(widget.OWWidget.Error): | ||
| not_finite_distances = Msg("Not all distances are finite") |
There was a problem hiding this comment.
Rephrase: Some distances are infinite.
There was a problem hiding this comment.
Why some error messages have dot at the end and some have not?
There was a problem hiding this comment.
You're right, we have to decide. @BlazZupan, @astaric, @lanzagar, @ajdapretnar? I don't have a clear preference.
There was a problem hiding this comment.
I vote for dots in sentences (verb present).
There was a problem hiding this comment.
Good point. No verb, no dot. Verb, dot. Period. :)
However, some error messages are not sentences, like "Out of memory" or "Error during optimization". What about those?
Message without verb bad, message with verb good? Makes them longer, though. Sometimes long not good, when widget narrow.
There was a problem hiding this comment.
Oh no, I would keep those dot-less. Short, too.
| if matrix is not None: | ||
| if not np.all(np.isfinite(matrix)): | ||
| self.Error.not_finite_distances() | ||
| return |
There was a problem hiding this comment.
I may have overlooked something, but I think this is wrong: it aborts the function without reseting the items (and related models). Note that the check above sets matrix to None. In line with this, you should probably have something like:
if matrix is not None:
if not np.all(np.isfinite(matrix)):
self.Error.not_finite_distances()
matrix = None
self.matrix = matrix
if matrix is not None:
self._set_items(matrix.row_items, matrix.axis)
else:
self._set_items(None)
The line self.matrix = matrix is moved from above to the place after the check for infinite values.
| if N < 2: | ||
| self.error("Empty distance matrix") | ||
| matrix = None | ||
| if not np.all(np.isfinite(matrix)): |
There was a problem hiding this comment.
No, not like that. If it comes here, matrix can be set to None in the line above. Another if matrix is not None is needed here.
|
|
||
| def _invalidate_clustering(self): | ||
| if self.matrix is not None and not np.all(np.isfinite(self.matrix)): | ||
| return |
There was a problem hiding this comment.
First, this check may be redundant since np.all(np.isfinite(self.matrix)) is already checked above and self.matrix is set to None in this case. Hence, the situation in which this conjunction is true cannot arise, I guess.
Second -- if this condition can be fulfilled -- is skipping these calls really OK?
_update()clears the plot and then performs further cleanup ifself.matrix is None. I guess this must be done if the widget receives invalid data._update_labels()checks forself.root(which is set toNonein_update()) and works correctly whenmatrixisNone._invalidate_output()has to be called passNoneto outputs.
I believe that these two lines have to be removed. But please check whether there are any other side-effects.
There was a problem hiding this comment.
These two lines are now removed.
| if N < 2: | ||
| self.error("Empty distance matrix") | ||
| matrix = None | ||
| if not np.all(np.isfinite(matrix)): |
Issue
Description of changes
Includes