Skip to content

Commit af9da97

Browse files
committed
Updates for PDEP-14 and PDEP-16
1 parent 258178d commit af9da97

File tree

1 file changed

+44
-46
lines changed

1 file changed

+44
-46
lines changed

web/pandas/pdeps/0013-logical-type-system.md

Lines changed: 44 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
- Status: Under discussion
55
- Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141)
66
- Author: [Will Ayd](https://github.com/willayd),
7-
- Revision: 2
7+
- Revision: 3
88

99
## Abstract
1010

@@ -30,22 +30,37 @@ dtype=object
3030
dtype=str
3131
dtype="string"
3232
dtype=pd.StringDtype()
33+
dtype=pd.StringDtype("python", na_value=np.nan)
34+
dtype=pd.StringDtype("python", na_value=pd.NA)
3335
dtype=pd.StringDtype("pyarrow")
3436
dtype="string[pyarrow]"
35-
dtype="string[pyarrow_numpy]"
37+
dtype="string[pyarrow_numpy]" # added in 2.1, deprecated in 2.3
3638
dtype=pd.ArrowDtype(pa.string())
3739
dtype=pd.ArrowDtype(pa.large_string())
3840
```
3941

40-
``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent memory, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321).
42+
``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent releases, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321).
4143

42-
While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with the ``pd.NA`` sentinel introduced a few years back. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging.
44+
While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with ``pd.NA``. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging.
4345

44-
To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has been to add another string dtype. ``string[pyarrow_numpy]`` was an attempt to use pyarrow strings but adhere to NumPy nullability semantics, under the assumption that the latter offers maximum backwards compatibility. However, being the exclusive data type that uses pyarrow for storage but NumPy for nullability handling, this data type just adds more inconsistency to how we handle missing data, a problem we have been attempting to solve back since discussions around pandas2. The name ``string[pyarrow_numpy]`` is not descriptive to end users, and unless it is inferred requires users to explicitly ``.astype("string[pyarrow_numpy]")``, again putting a burden on end users to know what ``pyarrow_numpy`` means and to understand the missing value semantics of both systems.
46+
To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has always been to add another string dtype. With PDEP-14 we now have a "compatibility" string of ``pd.StringDtype("python|pyarrow", na_value=np.nan)`` that makes a best effort to move users towards all the benefits of PyArrow strings (assuming pyarrow is installed) while retaining backwards-compatible missing value handling with ``np.nan`` as the missing value marker. The usage of the ``pd.StringDtype`` in this manner is a good stepping stone towards the goals of this PDEP, although it is stuck in an "in-between" state without other types following suit.
4547

46-
PDEP-14 has been proposed to smooth over that and change our ``pd.StringDtype()`` to be an alias for ``string[pyarrow_numpy]``. This would at least offer some abstraction to end users who just want strings, but on the flip side would be breaking behavior for users that have already opted into ``dtype="string"`` or ``dtype=pd.StringDtype()`` and the related pd.NA missing value marker for the prior 4 years of their existence.
48+
For instance, if a user calls ``Series.value_counts()`` on the ``pd.StringDtype()``, the type of the returned Series can vary wildly, and in non-obvious ways:
4749

48-
A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.len()`` against a Series of that type with missing data, they should get back a Series with an integer data type.
50+
```python
51+
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=pd.NA)).value_counts().dtype
52+
Int64Dtype()
53+
>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=pd.NA)).value_counts().dtype
54+
int64[pyarrow]
55+
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=np.nan)).value_counts().dtype
56+
Int64Dtype()
57+
>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).value_counts().dtype
58+
dtype('int64')
59+
```
60+
61+
It is also worth noting that different methods will return different data types. For a pyarrow-backed string with pd.NA, ``Series.value_counts()`` returns a ``int64[pyarrow]`` but ``Series.str.len()`` returns a ``pd.Int64Dtype()``.
62+
63+
A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.value_counts()`` against a Series of that type with missing data, they should get back a Series with an integer data type.
4964

5065
### Problem 2: Inconsistent Constructors
5166

@@ -69,7 +84,7 @@ It would stand to reason in this approach that you could use a ``pd.DatetimeDtyp
6984

7085
### Problem 3: Lack of Clarity on Type Support
7186

72-
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
87+
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
7388

7489
## Assessing the Current Type System(s)
7590

@@ -84,7 +99,7 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes
8499
- Signed Integer
85100
- Unsigned Integer
86101
- Floating Point
87-
- Fixed Point
102+
- Decimal
88103
- Boolean
89104
- Date
90105
- Datetime
@@ -93,7 +108,7 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes
93108
- Period
94109
- Binary
95110
- String
96-
- Map
111+
- Dict
97112
- List
98113
- Struct
99114
- Interval
@@ -120,10 +135,11 @@ To satisfy all of the types highlighted above, this would require the addition o
120135
- pd.Duration()
121136
- pd.CalendarInterval()
122137
- pd.BinaryDtype()
123-
- pd.MapDtype() # or pd.DictDtype()
138+
- pd.DictDtype()
124139
- pd.ListDtype()
125140
- pd.StructDtype()
126141
- pd.ObjectDtype()
142+
- pd.NullDtype()
127143

128144
The storage / backend to each of these types is left as an implementation detail. The fact that ``pd.StringDtype()`` may be backed by Arrow while ``pd.PeriodDtype()`` continues to be a custom solution is of no concern to the end user. Over time this will allow us to adopt more Arrow behind the scenes without breaking the front end for our end users, but _still_ giving us the flexibility to produce data types that Arrow will not implement (e.g. ``pd.ObjectDtype()``).
129145

@@ -137,43 +153,24 @@ The methods of each logical type are expected in turn to yield another logical t
137153

138154
The ``Series.dt.date`` example is worth an extra look - with a PDEP-13 logical type system in place we would theoretically have the ability to keep our default ``pd.DatetimeDtype()`` backed by our current NumPy-based array but leverage pyarrow for the ``Series.dt.date`` solution, rather than having to implement a DateArray ourselves.
139155

140-
While this PDEP proposes reusing existing extension types, it also necessitates extending those types with extra metadata:
156+
To implement this PDEP, we expect all of the logical types to have at least the following metadata:
141157

142-
```python
143-
class BaseType:
144-
145-
@property
146-
def data_manager -> Literal["numpy", "pyarrow"]:
147-
"""
148-
Who manages the data buffer - NumPy or pyarrow
149-
"""
150-
...
151-
152-
@property
153-
def physical_type:
154-
"""
155-
For logical types which may have different implementations, what is the
156-
actual implementation? For pyarrow strings this may mean pa.string() versus
157-
pa.large_string() versrus pa.string_view(); for NumPy this may mean object
158-
or their 2.0 string implementation.
159-
"""
160-
...
161-
162-
@property
163-
def na_marker -> pd.NA|np.nan|pd.NaT:
164-
"""
165-
Sentinel used to denote missing values
166-
"""
167-
...
168-
```
158+
* storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer
159+
* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view
160+
* na_value: Either pd.NA, np.nan, or pd.NaT.
169161

170-
``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred).
162+
While these attributes are exposed as construction arguments to end users, users are highly discouraged from trying to control them directly. Put explicitly, this PDEP allows a user to request a ``pd.XXXDtype(storage="numpy")`` to request a NumPy-backed array, if possible. While pandas may respect that during construction, operations against that data make no guarantees that the storage backend will be persisted through, giving pandas the freedom to convert to whichever storage is internally optimal (Arrow will typically be preferred).
171163

172164
### Missing Value Handling
173165

174-
Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear.
166+
Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. This PDEP does not aim to "solve" that issue per se (for that discussion, please refer to PDEP-16), but aims to provide a go-forward path that strikes a reasonable balance between backwards compatibility and a consistent missing value approach in the future.
167+
168+
This PDEP proposes that the default missing value for logical types is ``pd.NA``. The reasoning is two-fold:
169+
170+
1. We are in many cases re-using extension types as logical types, which mostly use pd.NA (StrDtype and datetimes are the exception)
171+
2. For new logical types that have nothing to do with NumPy, using np.nan as a missing value marker is an odd fit
175172

176-
Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:
173+
However, to help with backwards compatibility for users that heavily rely on the semantics of ``np.nan`` or ``pd.NaT``, an option of ``pd.na_value = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:
177174

178175
| Logical Type | Default Missing Value | Legacy Missing Value |
179176
| pd.BooleanDtype() | pd.NA | np.nan |
@@ -182,17 +179,18 @@ Because this PDEP proposes reuse of the existing pandas extension type system, t
182179
| pd.StringDtype() | pd.NA | np.nan |
183180
| pd.DatetimeType() | pd.NA | pd.NaT |
184181

185-
However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but pd.NA usage is encouraged going forward to give users one exclusive missing value indicator.
182+
However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but ``pd.NA`` usage is encouraged going forward to give users one exclusive missing value indicator and better align with the goals of PDEP-16.
186183

187184
### Transitioning from Current Constructors
188185

189-
To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to just pyarrow.
186+
To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to use PyArrow.
190187

191188
The theory behind this is that the majority of users are not expecting anything particular from NumPy to happen when they say ``dtype=np.int64``. The expectation is that a user just wants _integer_ data, and the ``np.int64`` specification owes to the legacy of pandas' evolution.
192189

193-
This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that a few years down the road we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope.
190+
This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that, in the future, we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope.
194191

195-
## PDEP-11 History
192+
## PDEP-13 History
196193

197194
- 27 April 2024: Initial version
198195
- 10 May 2024: First revision
196+
- 01 Aug 2024: Revisions for PDEP-14 and PDEP-16

0 commit comments

Comments
 (0)