|
| 1 | +# PDEP-16: Consistent missing value handling (with a single NA scalar) |
| 2 | + |
| 3 | +- Created: March 2024 |
| 4 | +- Status: Under discussion |
| 5 | +- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265) |
| 6 | +- Author: [Patrick Hoefler](https://github.com/phofl) |
| 7 | + [Joris Van den Bossche](https://github.com/jorisvandenbossche) |
| 8 | +- Revision: 1 |
| 9 | + |
| 10 | +## Abstract |
| 11 | + |
| 12 | +... |
| 13 | + |
| 14 | +## Background |
| 15 | + |
| 16 | +Currently, pandas handles missing data differently for different data types. We |
| 17 | +use different types to indicate that a value is missing: ``np.nan`` for |
| 18 | +floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically |
| 19 | +strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike |
| 20 | +data. Some other data types, such as integer and bool, cannot store missing data |
| 21 | +or are cast to float or object dtype. In addition, pandas 1.0 introduced a new |
| 22 | +missing value sentinel, ``pd.NA``, which is being used for the experimental |
| 23 | +nullable integer, float, boolean, and string data types, and more recently also |
| 24 | +for the pyarrow-backed data types. |
| 25 | + |
| 26 | +These different missing values also have different behaviors in user-facing |
| 27 | +operations. Specifically, we introduced different semantics for the nullable |
| 28 | +data types for certain operations (e.g. propagating in comparison operations |
| 29 | +instead of comparing as False). |
| 30 | + |
| 31 | +The nullable extension dtypes and the `pd.NA` scalar were originally designed to |
| 32 | +solve these problems and to provide consistent missing value behavior between |
| 33 | +different dtypes. Historically those are used as 1D arrays, which hinders usage |
| 34 | +of those dtypes in certain scenarios that rely on the 2D block structure of the |
| 35 | +pandas internals for fast operations (``axis=1`` operations, transposing, etc.). |
| 36 | + |
| 37 | +Long term, we want to introduce consistent missing data handling for all data |
| 38 | +types. This includes consistent behavior in all operations (indexing, arithmetic |
| 39 | +operations, comparisons, etc.) and using a missing value scalar that behaves |
| 40 | +consistently. |
| 41 | + |
| 42 | +## Proposal |
| 43 | + |
| 44 | +This proposal aims to unify the missing value handling across all dtypes. This |
| 45 | +proposal is not meant to address implementation details, rather to provide a |
| 46 | +high level way forward. |
| 47 | + |
| 48 | +1. All data types support missing values and use `pd.NA` exclusively as the |
| 49 | + user-facing missing value indicator. |
| 50 | + |
| 51 | +2. All data types implement consistent missing value "semantics" corresponding |
| 52 | + to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in |
| 53 | + comparisons, see below for details). |
| 54 | + |
| 55 | +3. As a consequence, pandas will move to nullable extension arrays by default |
| 56 | + for all data types, instead of using the NumPy dtypes that are currently the |
| 57 | + default. To preserve the default 2D block structure of the DataFrame internals, |
| 58 | + the ExtensionArray interface will be extended to support 2D arrays. |
| 59 | + |
| 60 | +4. For backwards compatibility, existing missing value indicators like `NaN` and |
| 61 | + `NaT` will be interpreted as `pd.NA` when introduced in user input, IO or |
| 62 | + through operations (to ensure it keeps being considered as missing). |
| 63 | + Specifically for floating dtypes, in practice this means a float column can |
| 64 | + for now only contain NA values. Potentially distinguishing NA and NaN is left |
| 65 | + for a separate discussion. |
| 66 | + |
| 67 | +This will ensure that all dtypes have consistent missing value handling and there |
| 68 | +is no need to upcast if a missing value is inserted into integers or booleans. Those |
| 69 | +nullability semantics will be mostly consistent with how PyArrow treats nulls and thus |
| 70 | +make switching between both set of dtypes easier. Additionally, it allows the usage of |
| 71 | +other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...). |
| 72 | + |
| 73 | +In practice, this means solidifying the existing integer, float, boolean and |
| 74 | +string nullable data types that already exist, and implementing (variants of) |
| 75 | +the categorical, datetimelike and interval data types using `pd.NA`. The |
| 76 | +proposal leaves the exact implementation details (e.g. whether to use a mask or |
| 77 | +a sentinel (where the best strategy might vary by data type depending on |
| 78 | +existing code), or whether to use byte masks vs bitmaps, or whether to use |
| 79 | +PyArrow under the hood like the string dtype, etc) out of scope. |
| 80 | + |
| 81 | +This PDEP also does not define the exact API for dtype constructors or |
| 82 | +propose a new consistent interface; this is left for a separate discussion |
| 83 | +(PDEP-13). |
| 84 | + |
| 85 | +### The `NA` scalar |
| 86 | + |
| 87 | +... |
| 88 | + |
| 89 | +### Missing value semantics |
| 90 | + |
| 91 | + |
| 92 | +... |
| 93 | + |
| 94 | +## Backward compatibility |
| 95 | + |
| 96 | +... |
| 97 | + |
| 98 | +## Timeline |
| 99 | + |
| 100 | +... |
| 101 | + |
| 102 | +### PDEP History |
| 103 | + |
| 104 | +- March 2024: Initial draft |
| 105 | + |
| 106 | +Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265) |
| 107 | +that concerns this topic. |
0 commit comments