Skip to content
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v2.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ Other enhancements
when using ``np.array()`` or ``np.asarray()`` on pandas objects) has been
updated to work correctly with NumPy >= 2 (:issue:`57739`)
- The :meth:`~Series.sum` reduction is now implemented for ``StringDtype`` columns (:issue:`59853`)
-
- Median percentile is only included in :meth:`~Series.describe` when a blank
list is passed (:issue:`60550`).

.. ---------------------------------------------------------------------------
.. _whatsnew_230.notable_bug_fixes:
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10795,7 +10795,8 @@ def describe(
The percentiles to include in the output. All should
fall between 0 and 1. The default is
``[.25, .5, .75]``, which returns the 25th, 50th, and
75th percentiles.
75th percentiles. If a blank list is passed, then returns
only the 50th percentile value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this should be the behavior. If I pass an empty list, I should not get any percentiles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to preserve the default percentile behavior. I will change this so that percentiles are omitted if a blank list is passed.

Copy link
Member

@asishm asishm Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove the 50th percentile without at least a deprecation warning. Reporting the median by default is incredibly helpful (by default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should not be changing the default behavior. Recommend setting no_default as the default value and maintaining the default behavior when it is not changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so we will add an argument no_default : bool = False as the argument and preserve the behavior with control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I got confused. the default for percentiles is already None. not sure if it's worth changing to lib.no_default.

@ZenithClown I believe what @rhshadrach means is to change the behavior so that the 50th percentile is removed if a user explicitly passes in percentiles=[] instead of the default None (or the suggestion to change to no_default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asishm - me too! Agreed leaving the default as None is fine.

include : 'all', list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored
for ``Series``. Here are the options:
Expand Down
11 changes: 6 additions & 5 deletions pandas/core/methods/describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ def describe_ndframe(
percentiles : list-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1.
The default is ``[.25, .5, .75]``, which returns the 25th, 50th, and
75th percentiles.
75th percentiles. If a blank list is passed, then returns only the
50th percentile value.

Returns
-------
Expand Down Expand Up @@ -351,13 +352,13 @@ def _refine_percentiles(
# explicit conversion of `percentiles` to list
percentiles = list(percentiles)

# median should be included only if blank iterable is passed
if len(percentiles) == 0:
return np.array([0.5])

# get them all to be in [0, 1]
validate_percentile(percentiles)

# median should always be included
if 0.5 not in percentiles:
percentiles.append(0.5)

percentiles = np.asarray(percentiles)

# sort and check for duplicates
Expand Down
31 changes: 31 additions & 0 deletions pandas/tests/frame/methods/test_describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,3 +413,34 @@ def test_describe_exclude_pa_dtype(self):
dtype=pd.ArrowDtype(pa.float64()),
)
tm.assert_frame_equal(result, expected)

def test_refine_percentiles(self):
# GH#60550
df = DataFrame({"a" : np.arange(0, 10, 1)})

# the default behavior is to return [0.25, 0.5, 0.75]
result = df.describe()
expected = DataFrame(
{"a" : [10, df.a.mean(), df.a.std(), 0, 2.25, 4.5, 6.75, 9]},
index=["count", "mean", "std", "min", "25%", "50%", "75%", "max"]
)

tm.assert_frame_equal(result, expected)

# if an empty list is passed, it should return [0.5]
result = df.describe(percentiles=[])
expected = DataFrame(
{"a" : [10, df.a.mean(), df.a.std(), 0, 4.5, 9]},
index=["count", "mean", "std", "min", "50%", "max"]
)

tm.assert_frame_equal(result, expected)

# if a list is passed, it should return with the same values
result = df.describe(percentiles=[0.2])
expected = DataFrame(
{"a" : [10, df.a.mean(), df.a.std(), 0, 1.8, 9]},
index=["count", "mean", "std", "min", "20%", "max"]
)

tm.assert_frame_equal(result, expected)
Loading