[FIX] Gini impurity: formula and docstring fixed.#1495
[FIX] Gini impurity: formula and docstring fixed.#1495lanzagar merged 1 commit intobiolab:masterfrom
Conversation
57c4b71 to
e518b73
Compare
|
Tests are now fixed as well. |
Current coverage is 88.86% (diff: 100%)@@ master #1495 diff @@
==========================================
Files 78 78
Lines 8099 8099
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 7197 7197
Misses 902 902
Partials 0 0
|
Orange/preprocess/score.py
Outdated
| P = D / np.sum(D, axis=0) | ||
| return sum((np.ones(1 if len(D.shape) == 1 else D.shape[1]) - np.sum(np.square(P), axis=0)) | ||
| * 0.5 * np.sum(D, axis=0) / np.sum(D)) | ||
| * np.sum(D, axis=0) / np.sum(D)) |
There was a problem hiding this comment.
Among the several equivalent representations of the proposed equation, this one seems nicest:
return 1 - (P*P).sum()There was a problem hiding this comment.
I agree it looks the nicest and most concise. Can maybe @lanzagar also comment?
There was a problem hiding this comment.
At first glance it looks to me like at least the second line is needed to get the weighted impurity we want.
It is possible that 1 can be used instead of the complications with np.ones (should be tested).
There was a problem hiding this comment.
This is out of my league. Perhaps merge this PR to fix division by two and make another one with improved formula? Otherwise I'll just copy-paste whatever you'll tell me to. 😀
There was a problem hiding this comment.
Copy paste:
P = np.asarray(D / np.sum(D, axis=0))
return np.sum((1 - np.sum(P**2, axis=0)) *
np.sum(D, axis=0) / np.sum(D))
2b4a24e to
81b4360
Compare
81b4360 to
c839687
Compare
Gini coefficient and Gini impurity are not the same. Docstring fixed to proper Wikipedia source.
As observed in many sources, the formula for Gini impurity never contains division by 2. @BlazZupan, please check whether this is the right formula or not.
Also, "Gini" was a misleading attribute name in Rank widget. Here we're actually measuring Gini decrease. So even perhaps "Gini Gain" might not be the right name for this. Comments? @lanzagar @astaric
Probably some fixes to names should be made elsewhere, too.