[ENH] PCA: Preserve f32s & reduce memory footprint when computing means#3582
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3582 +/- ##
==========================================
+ Coverage 83.98% 83.98% +<.01%
==========================================
Files 370 370
Lines 66976 66981 +5
==========================================
+ Hits 56249 56254 +5
Misses 10727 10727 |
|
Codecov makes no sense. I added code with no tests, yet still somehow managed to improve coverage. What? |
Orange/projection/pca.py
Outdated
| if sp.issparse(A): | ||
| means, _ = mean_variance_axis(A, axis=0) | ||
| else: | ||
| means = np.mean(A, axis=0) |
There was a problem hiding this comment.
We already have Orange.statistics.util.mean which is supposed to handle dense and sparse matrices, but it does not have the axis parameter (unlike most other functions in that module). Maybe you could improve that function instead and use it here?
There was a problem hiding this comment.
That's a much better idea. Unfortunately, mean_variance_axis just ignores NaNs and provides no feedback if the data had any NaNs. This means that mean wouldn't properly handle NaNs and we'd have no good way of knowing where they occurred.
However, changing nanmean to use this is completely fine since this is exactly the behaviour we want there. So I did that.
|
If you add new code with no new tests it can still be covered by existing tests, thus increasing the coverage... ;) |
e714b48 to
65b43b2
Compare
65b43b2 to
d974ef4
Compare
Issue
While working towards getting the improved PCA merged into scikit-learn, I've found two improvements.
Description of changes
np.float32are now preservedx.meanmethod isn't the most memory efficient, and scikit-learn's utility functionmean_variance_axisis much better (see benchmarks here)Includes