Skip to content

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Aug 5, 2024

I just copied the sort docstring of Dataframe.value_counts since the latter is mirrored to behaved like it.

image

@KevsterAmp KevsterAmp requested a review from rhshadrach as a code owner August 5, 2024 04:33
@KevsterAmp
Copy link
Contributor Author

Seems like I have changed the wrong value_count() function. I have changed pandas.core.groupby.SeriesGroupBy.value_counts rather than pandas.core.groupby.DataFrameGroupBy.value_counts. Applying fix in a bit

@KevsterAmp
Copy link
Contributor Author

Fixed it
image

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Return proportions rather than frequencies.
sort : bool, default True
Sort by frequencies.
Sort by frequencies when True. Sort by DataFrame column values when False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to modify the docstring on SeriesGroupBy.value_counts. Can you make the same adjustment there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

@rhshadrach rhshadrach added Docs Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 5, 2024
@KevsterAmp
Copy link
Contributor Author

@rhshadrach - are there any extra changes needed for this PR? Thanks

@sfc-gh-joshi
Copy link

sfc-gh-joshi commented Aug 8, 2024

@KevsterAmp I left an additional ? on the parent issue: #59307 (comment)

It seems like when the groupby's sort flag and value_counts sort flag are both false, no sorting is done. Is this a behavioral bug, or just something else to add to documentation?

edit: It seems also that when value_counts has sort=True, the result frame is still being sorted on the DataFrame column values:

>>> data0 = {
    "by": ["c", "b", "a", "a", "b", "b", "c", "a"],
    "value1": ["ee", "aa", "bb", "aa", "bb", "cc", "dd", "aa"],
    "value2": [1, 2, 3, 1, 1, 3, 2, 1],
}
>>> pd.DataFrame(data0).groupby(by=["by"]).value_counts(sort=True)
by  value1  value2
a   aa      1         2
    bb      3         1
b   aa      2         1
    bb      1         1
    cc      3         1
c   dd      2         1
    ee      1         1

In the input frame, the row c ee 1 (which appears once) appears before c dd 1 (also appearing once), but the result ends up being sorted on value1 as well as the counts. I think it's worth clarifying whether value_counts(sort=True) will also sort on the DataFrame's values, or if it's just that the algorithm used to sort the counts column is unstable.

@KevsterAmp
Copy link
Contributor Author

Fixed the doc, thanks.

image

@KevsterAmp KevsterAmp requested a review from rhshadrach August 12, 2024 01:53
@KevsterAmp
Copy link
Contributor Author

@KevsterAmp I left an additional ? on the parent issue: #59307 (comment)

It seems like when the groupby's sort flag and value_counts sort flag are both false, no sorting is done. Is this a behavioral bug, or just something else to add to documentation?

edit: It seems also that when value_counts has sort=True, the result frame is still being sorted on the DataFrame column values:

>>> data0 = {
    "by": ["c", "b", "a", "a", "b", "b", "c", "a"],
    "value1": ["ee", "aa", "bb", "aa", "bb", "cc", "dd", "aa"],
    "value2": [1, 2, 3, 1, 1, 3, 2, 1],
}
>>> pd.DataFrame(data0).groupby(by=["by"]).value_counts(sort=True)
by  value1  value2
a   aa      1         2
    bb      3         1
b   aa      2         1
    bb      1         1
    cc      3         1
c   dd      2         1
    ee      1         1

In the input frame, the row c ee 1 (which appears once) appears before c dd 1 (also appearing once), but the result ends up being sorted on value1 as well as the counts. I think it's worth clarifying whether value_counts(sort=True) will also sort on the DataFrame's values, or if it's just that the algorithm used to sort the counts column is unstable.

Not sure if I will address this in this PR, since the scope of this one is to just modify the doc. Thanks

@KevsterAmp
Copy link
Contributor Author

@rhshadrach - are there any extra changes needed for this PR? Thanks

@rhshadrach
Copy link
Member

@KevsterAmp - I am hesitant to make this change since certain cases that are correct now become incorrect. When you do df.groupby(..., sort=False).value_counts(sort=False), pandas is preserving the order of the input. See #59307 (comment)

@rhshadrach
Copy link
Member

@KevsterAmp - I've opened #59307 based on the discussion in the linked issue. Unfortunately that supersedes this PR. Going to close here - but thanks for working on this!

@rhshadrach rhshadrach closed this Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Docs Groupby

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows

4 participants