PERF/REF: groupby sample #42233

mzeitlin11 · 2021-06-25T18:22:21Z

closes REF: Move sampling logic into algorithms.py #34483
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Benchmarks:


       before           after         ratio
     [d61ace50]       [7fb839ce]
     <master~4>       <gb_sample>
-         116±5ms         29.8±3ms     0.26  groupby.Sample.time_sample
-        808±20ms         75.9±1ms     0.09  groupby.Sample.time_sample_weights

mzeitlin11 · 2021-06-25T18:23:30Z

pandas/core/algorithms.py

+# ------ #
+
+
+def preprocess_weights(obj: FrameOrSeries, weights, axis: int) -> np.ndarray:


This was pretty much moved as is, with the only change being to convert to an ndarray earlier for better performance on later validation steps

im really not a fan of Series/DataFrame methods living in this file. are there any other natural homes for this?

util/_validators is almost a good fit, but while this validates, it also returns a modified input. Could go in core/common? But still learning where everything lives, happy to move if anyone has a better location

looks like this function is the only one that really depends on Series/DataFrame; could it just stay inside the NDFrame method? the others could go in e.g. core.array_algos.sample

That would be nicer, but the issue is that the weights processing also needs to be called from groupby sample as well.

Another option would be to implement something like a core.methods directory for Series/DataFrame methods that have been refactored to their own files (e.g. describe). I think algos.SelectN might make sense in something like that

I like that idea - have moved to core/sample.py (on same level as describe) for now. If others like this organization, I can follow up moving describe and sample (and maybe others like you mention) to core/methods.

mzeitlin11 · 2021-06-25T18:24:26Z

pandas/core/algorithms.py

+    return weights
+
+
+def process_sampling_size(


Conditionals here were refactored for mypy (but IMO clearer to follow as well)

pandas/core/groupby/groupby.py

jreback · 2021-07-02T00:05:17Z

very nice @mzeitlin11

mzeitlin11 added 12 commits June 21, 2021 21:21

wip

5f6c210

Merge remote-tracking branch 'upstream/master' into gb_sample

1ad13c4

WIP

6dc1485

Add asv

45e0fe1

Avoid concat

e7c7d75

Clean dead code

ca26efb

WIP

f147052

Merge remote-tracking branch 'upstream/master' into gb_sample

79e6b61

Add whatsnew, fix some typing

3834f0c

Add docstrings

a702870

Merge remote-tracking branch 'upstream/master' into gb_sample

4ff88c9

Improve some variable names

7fb839c

mzeitlin11 added Groupby Performance Memory or execution speed performance Refactor Internal refactoring of code labels Jun 25, 2021

mzeitlin11 commented Jun 25, 2021

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

mzeitlin11 added 3 commits June 27, 2021 20:37

Merge remote-tracking branch 'upstream/master' into gb_sample

202c3c1

Move to sample.py

994384d

Add module comment

fe9b028

jreback added this to the 1.4 milestone Jul 2, 2021

jreback merged commit fee2b87 into pandas-dev:master Jul 2, 2021

mzeitlin11 deleted the gb_sample branch July 2, 2021 00:07

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF/REF: groupby sample (pandas-dev#42233)

5fca6e4

mzeitlin11 mentioned this pull request Aug 4, 2021

REGR: sample modifying weights inplace #42843

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF/REF: groupby sample #42233

PERF/REF: groupby sample #42233

Uh oh!

mzeitlin11 commented Jun 25, 2021

Uh oh!

mzeitlin11 Jun 25, 2021

Uh oh!

jbrockmendel Jun 25, 2021

Uh oh!

mzeitlin11 Jun 25, 2021

Uh oh!

jbrockmendel Jun 25, 2021

Uh oh!

mzeitlin11 Jun 25, 2021

Uh oh!

jbrockmendel Jun 26, 2021

Uh oh!

mzeitlin11 Jun 28, 2021

Uh oh!

mzeitlin11 Jun 25, 2021

Uh oh!

Uh oh!

jreback commented Jul 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# ------ #


		def preprocess_weights(obj: FrameOrSeries, weights, axis: int) -> np.ndarray:

Uh oh!

PERF/REF: groupby sample #42233

PERF/REF: groupby sample #42233

Uh oh!

Conversation

mzeitlin11 commented Jun 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback commented Jul 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants