Skip to content

Commit d7dcbdb

Browse files
rename file, rephrase main proposal points, temporarily remove most other content
1 parent 3b54a73 commit d7dcbdb

File tree

2 files changed

+107
-212
lines changed

2 files changed

+107
-212
lines changed

web/pandas/pdeps/0015-ice-cream-agreement.md

Lines changed: 0 additions & 212 deletions
This file was deleted.
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# PDEP-16: Consistent missing value handling (with a single NA scalar)
2+
3+
- Created: March 2024
4+
- Status: Under discussion
5+
- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265)
6+
- Author: [Patrick Hoefler](https://github.com/phofl)
7+
[Joris Van den Bossche](https://github.com/jorisvandenbossche)
8+
- Revision: 1
9+
10+
## Abstract
11+
12+
...
13+
14+
## Background
15+
16+
Currently, pandas handles missing data differently for different data types. We
17+
use different types to indicate that a value is missing: ``np.nan`` for
18+
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
19+
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike
20+
data. Some other data types, such as integer and bool, cannot store missing data
21+
or are cast to float or object dtype. In addition, pandas 1.0 introduced a new
22+
missing value sentinel, ``pd.NA``, which is being used for the experimental
23+
nullable integer, float, boolean, and string data types, and more recently also
24+
for the pyarrow-backed data types.
25+
26+
These different missing values also have different behaviors in user-facing
27+
operations. Specifically, we introduced different semantics for the nullable
28+
data types for certain operations (e.g. propagating in comparison operations
29+
instead of comparing as False).
30+
31+
The nullable extension dtypes and the `pd.NA` scalar were originally designed to
32+
solve these problems and to provide consistent missing value behavior between
33+
different dtypes. Historically those are used as 1D arrays, which hinders usage
34+
of those dtypes in certain scenarios that rely on the 2D block structure of the
35+
pandas internals for fast operations (``axis=1`` operations, transposing, etc.).
36+
37+
Long term, we want to introduce consistent missing data handling for all data
38+
types. This includes consistent behavior in all operations (indexing, arithmetic
39+
operations, comparisons, etc.) and using a missing value scalar that behaves
40+
consistently.
41+
42+
## Proposal
43+
44+
This proposal aims to unify the missing value handling across all dtypes. This
45+
proposal is not meant to address implementation details, rather to provide a
46+
high level way forward.
47+
48+
1. All data types support missing values and use `pd.NA` exclusively as the
49+
user-facing missing value indicator.
50+
51+
2. All data types implement consistent missing value "semantics" corresponding
52+
to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in
53+
comparisons, see below for details).
54+
55+
3. As a consequence, pandas will move to nullable extension arrays by default
56+
for all data types, instead of using the NumPy dtypes that are currently the
57+
default. To preserve the default 2D block structure of the DataFrame internals,
58+
the ExtensionArray interface will be extended to support 2D arrays.
59+
60+
4. For backwards compatibility, existing missing value indicators like `NaN` and
61+
`NaT` will be interpreted as `pd.NA` when introduced in user input, IO or
62+
through operations (to ensure it keeps being considered as missing).
63+
Specifically for floating dtypes, in practice this means a float column can
64+
for now only contain NA values. Potentially distinguishing NA and NaN is left
65+
for a separate discussion.
66+
67+
This will ensure that all dtypes have consistent missing value handling and there
68+
is no need to upcast if a missing value is inserted into integers or booleans. Those
69+
nullability semantics will be mostly consistent with how PyArrow treats nulls and thus
70+
make switching between both set of dtypes easier. Additionally, it allows the usage of
71+
other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...).
72+
73+
In practice, this means solidifying the existing integer, float, boolean and
74+
string nullable data types that already exist, and implementing (variants of)
75+
the categorical, datetimelike and interval data types using `pd.NA`. The
76+
proposal leaves the exact implementation details (e.g. whether to use a mask or
77+
a sentinel (where the best strategy might vary by data type depending on
78+
existing code), or whether to use byte masks vs bitmaps, or whether to use
79+
PyArrow under the hood like the string dtype, etc) out of scope.
80+
81+
This PDEP also does not define the exact API for dtype constructors or
82+
propose a new consistent interface; this is left for a separate discussion
83+
(PDEP-13).
84+
85+
### The `NA` scalar
86+
87+
...
88+
89+
### Missing value semantics
90+
91+
92+
...
93+
94+
## Backward compatibility
95+
96+
...
97+
98+
## Timeline
99+
100+
...
101+
102+
### PDEP History
103+
104+
- March 2024: Initial draft
105+
106+
Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265)
107+
that concerns this topic.

0 commit comments

Comments
 (0)