Skip to content

Conversation

mzeitlin11
Copy link
Member

Benchmarks:


       before           after         ratio
     [d61ace50]       [7fb839ce]
     <master~4>       <gb_sample>
-         116±5ms         29.8±3ms     0.26  groupby.Sample.time_sample
-        808±20ms         75.9±1ms     0.09  groupby.Sample.time_sample_weights

@mzeitlin11 mzeitlin11 added Groupby Performance Memory or execution speed performance Refactor Internal refactoring of code labels Jun 25, 2021
# ------ #


def preprocess_weights(obj: FrameOrSeries, weights, axis: int) -> np.ndarray:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty much moved as is, with the only change being to convert to an ndarray earlier for better performance on later validation steps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im really not a fan of Series/DataFrame methods living in this file. are there any other natural homes for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util/_validators is almost a good fit, but while this validates, it also returns a modified input. Could go in core/common? But still learning where everything lives, happy to move if anyone has a better location

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this function is the only one that really depends on Series/DataFrame; could it just stay inside the NDFrame method? the others could go in e.g. core.array_algos.sample

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nicer, but the issue is that the weights processing also needs to be called from groupby sample as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to implement something like a core.methods directory for Series/DataFrame methods that have been refactored to their own files (e.g. describe). I think algos.SelectN might make sense in something like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea - have moved to core/sample.py (on same level as describe) for now. If others like this organization, I can follow up moving describe and sample (and maybe others like you mention) to core/methods.

return weights


def process_sampling_size(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionals here were refactored for mypy (but IMO clearer to follow as well)

@jreback jreback added this to the 1.4 milestone Jul 2, 2021
@jreback jreback merged commit fee2b87 into pandas-dev:master Jul 2, 2021
@jreback
Copy link
Contributor

jreback commented Jul 2, 2021

very nice @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the gb_sample branch July 2, 2021 00:07
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Performance Memory or execution speed performance Refactor Internal refactoring of code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

REF: Move sampling logic into algorithms.py

3 participants