Skip to content

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means#3582

Merged
lanzagar merged 1 commit intobiolab:masterfrom
pavlin-policar:randomized-pca-improvements
Feb 15, 2019
Merged

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means#3582
lanzagar merged 1 commit intobiolab:masterfrom
pavlin-policar:randomized-pca-improvements

Conversation

@pavlin-policar
Copy link
Collaborator

Issue

While working towards getting the improved PCA merged into scikit-learn, I've found two improvements.

Description of changes
  1. np.float32 are now preserved
  2. Apparently, scipys sparse x.mean method isn't the most memory efficient, and scikit-learn's utility function mean_variance_axis is much better (see benchmarks here)
Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

Merging #3582 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
+ Coverage   83.98%   83.98%   +<.01%     
==========================================
  Files         370      370              
  Lines       66976    66981       +5     
==========================================
+ Hits        56249    56254       +5     
  Misses      10727    10727

@pavlin-policar
Copy link
Collaborator Author

Codecov makes no sense. I added code with no tests, yet still somehow managed to improve coverage. What?

if sp.issparse(A):
means, _ = mean_variance_axis(A, axis=0)
else:
means = np.mean(A, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have Orange.statistics.util.mean which is supposed to handle dense and sparse matrices, but it does not have the axis parameter (unlike most other functions in that module). Maybe you could improve that function instead and use it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a much better idea. Unfortunately, mean_variance_axis just ignores NaNs and provides no feedback if the data had any NaNs. This means that mean wouldn't properly handle NaNs and we'd have no good way of knowing where they occurred.

However, changing nanmean to use this is completely fine since this is exactly the behaviour we want there. So I did that.

@lanzagar
Copy link
Contributor

lanzagar commented Feb 5, 2019

If you add new code with no new tests it can still be covered by existing tests, thus increasing the coverage... ;)

@pavlin-policar pavlin-policar force-pushed the randomized-pca-improvements branch 2 times, most recently from e714b48 to 65b43b2 Compare February 11, 2019 12:23
@pavlin-policar pavlin-policar force-pushed the randomized-pca-improvements branch from 65b43b2 to d974ef4 Compare February 11, 2019 12:55
@lanzagar lanzagar merged commit 0430bae into biolab:master Feb 15, 2019
@pavlin-policar pavlin-policar deleted the randomized-pca-improvements branch February 15, 2019 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants