[FIX] normalize: Adjust number_of_decimals after scaling#4779
[FIX] normalize: Adjust number_of_decimals after scaling#4779lanzagar merged 4 commits intobiolab:masterfrom
Conversation
Actually, I don't think zero centering should impact the num. of decimals. Imagine small values (e.g. below 1e-5) that are just centered - they are shifted around 0, but still small and need more decimals. And this can be done in the widget as well! |
|
If we want to be even more fancy and complicate things: @markotoplak, @janezd - What do you think about just setting to 3 after rescaling or computing the needed number of decimals? |
aebe793 to
8188592
Compare
|
I've added this line here which automatically adjusts the number of decimals based on the standard deviation. I've also had to add to this because otherwise, on standardized iris, I would get a mix of 0.000 and 0.0000 (4 zeros), which looked strange. |
f3536d0 to
4a8fbfc
Compare
|
Let me know if you like this solution, or in what other way this could be solved. If we want to go with this, I'll fix the tests. |
lanzagar
left a comment
There was a problem hiding this comment.
I like the solution, but it needs some more thought and polishing. I made a few line comments, but my suggestions still don't produce the results I would be completely satisfied with.
You can take a look at feature statistics on housing + preprocessing to immediately see some problems that still need solving.
Orange/preprocess/normalize.py
Outdated
| if self.center: | ||
| compute_val = Norm(var, avg, 1 / sd) | ||
| num_decimals = 3 | ||
| else: | ||
| compute_val = Norm(var, 0, 1 / sd) | ||
| return var.copy(compute_value=compute_val) | ||
| num_decimals = None | ||
| num_decimals += int(-np.floor(np.log10(sd))) |
There was a problem hiding this comment.
As I said previously, I don't think the if else about centering should have anything to do with num_decimals.
So probably the computation should be just:
num_decimals = var.number_of_decimals + correction
And it also looks like your correction of int(-np.floor(np.log10(sd))) was not right - for sd=100 it should increase by 2 decimals not decrease...
I think the correct formula is (and someone should doublecheck this!):
num_decimals = var.number_of_decimals + int(np.log10(sd))
Currently, you didn't change normalize_by_span, but I expect the final solution should have the same correction there (just using diff instead of sd)
| var = self.normalize_by_sd(dist, var) | ||
| elif self.norm_type == Normalize.NormalizeBySpan: | ||
| var = self.normalize_by_span(dist, var) | ||
| var.number_of_decimals = None |
There was a problem hiding this comment.
After removing the reset of num. dec. here, you can delete the next line and just return directly in the elifs above...
... is what I wanted to write, then I saw that there is no else and lint would probably complain :/
(so I guess just leave it as is)
4a8fbfc to
cf5e816
Compare
Codecov Report
@@ Coverage Diff @@
## master #4779 +/- ##
==========================================
+ Coverage 83.84% 84.17% +0.32%
==========================================
Files 281 277 -4
Lines 56745 56500 -245
==========================================
- Hits 47577 47558 -19
+ Misses 9168 8942 -226 |
64c5b5e to
f7ff577
Compare
Issue
Data, when explicitly centered through the preprocess widget, would show up as having mean 1-e16 in the Feature Statistics widget (and potentially elsewhere). E.g. File (iris) > Preprocess (normalize to have mean=0) > Feature Statistics
Description of changes
number_of_decimals=3.-0.000because of rounding a tiny negative number e.g. 1e-16 to zero. Zeros are now always positive.Possible issues
This is impossible to do through the UI, but someone could potentially call
normalizewith ansd>>1, so all the values would be scaled to something tiny, and thenstr_valwould print 0 for everything. Again, this cannot happen in Orange, because there is no way to manually specify the standard deviation. Let me know if you think this is worth fixing.Includes