Skip to content

Commit 7c5f8c7

Browse files
committed
PDEP-11: Change default of dropna to False
1 parent fbcbdaf commit 7c5f8c7

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# PDEP-11: dropna default in pandas
2+
3+
- Created: 4 May 2023
4+
- Status: Under discussion
5+
- Discussion: [PR ??](https://github.com/pandas-dev/pandas/pull/??)
6+
- Authors: [Richard Shadrach](https://github.com/rhshadrach)
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
Throughout pandas, almost all of the methods that have a `dropna` argument default
12+
to `True`. Being the default, this can cause NA values to be silently dropped.
13+
This PDEP proposes to deprecate the current default value of `True` and change it
14+
to `False` in the next major release of pandas.
15+
16+
## Motivation and Scope
17+
18+
Upon seeing the output for a Series `ser`:
19+
20+
```python
21+
print(ser.value_counts())
22+
23+
1 3
24+
2 1
25+
dtype: Int64
26+
```
27+
28+
users may be surprised that the Series can contain NA values. By then operating
29+
on data under the assumption NA values are not present, erroroneous results can
30+
arise. The same issue can occur with `groupby`, which can also be used to produce
31+
detailed summary statistics of data. We think it is not unreasonable that an
32+
experienced pandas user seeing the code
33+
34+
df[["a", "b"]].groupby("a").sum()
35+
36+
would describe this operation as something like the following.
37+
38+
> For each unique value in column `a`, compute the sum of corresponding values
39+
> in column `b` and return the results in a DataFrame indexed by the unique
40+
> values of `a`.
41+
42+
This is correct, except that NA values in the column `a` will be dropped from
43+
the computation. That pandas is taking this additional step in the computation
44+
is not apparent from the code, and can surprise users.
45+
46+
## Detailed Description
47+
48+
We propose to deprecate the current default of `dropna` and change it to
49+
`False` across all applicable methods. The following methods have a dropna
50+
argument, those marked with a `*` already default to `False`.
51+
52+
```python
53+
Series.groupby
54+
Series.mode
55+
Series.nunique
56+
*Series.to_hdf
57+
Series.value_counts
58+
DataFrame.groupby
59+
DataFrame.mode
60+
DataFrame.nunique
61+
DataFrame.pivot_table
62+
DataFrame.stack
63+
*DataFrame.to_hdf
64+
DataFrame.value_counts
65+
SeriesGroupBy.nunique
66+
SeriesGroupBy.value_counts
67+
DataFrameGroupBy.nunique
68+
DataFrameGroupBy.value_counts
69+
```
70+
71+
## Timeline
72+
73+
If accepted, the current `dropna` default would be deprecated as part of pandas
74+
2.x and this deprecation would be enforced in pandas 3.0.
75+
76+
## PDEP History
77+
78+
- 4 May 2023: Initial draft

0 commit comments

Comments
 (0)