Skip to content

Commit fbeb69d

Browse files
PDEP: Dedicated string data type for pandas 3.0
1 parent a2bce66 commit fbeb69d

File tree

1 file changed

+245
-0
lines changed

1 file changed

+245
-0
lines changed

web/pandas/pdeps/00xx-string-dtype.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# PDEP-XX: Dedicated string data type for pandas 3.0
2+
3+
- Created: May 3, 2024
4+
- Status: Under discussion
5+
- Discussion:
6+
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche)
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
This PDEP proposes to introduce a dedicated string dtype that will be used by
12+
default in pandas 3.0:
13+
14+
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
15+
or otherwise the numpy object-dtype alternative.
16+
* The default string dtype will use missing value semantics using NaN consistent
17+
with the other default data types.
18+
19+
This will give users a long-awaited proper string dtype for 3.0, while 1) not
20+
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default,
21+
and 2) leaving room for future improvements (different missing value semantics,
22+
using NumPy 2.0, etc).
23+
24+
# Dedicated string data type for pandas 3.0
25+
26+
## Background
27+
28+
Currently, pandas by default stores text data in an `object`-dtype NumPy array.
29+
The current implementation has two primary drawbacks: First, `object`-dtype is
30+
not specific to strings: any Python object can be stored in an `object`-dtype
31+
array, not just strings, and seeing `object` as the dtype for a column with
32+
strings is confusing for users. Second: this is not efficient (all string
33+
methods on a Series are eventually done by calling Python methods on the
34+
individual string objects).
35+
36+
To solve the first issue, a dedicated extension dtype for string data has
37+
already been
38+
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type).
39+
This has always been opt-in for now, requiring users to explicitly request the
40+
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing
41+
this string dtype was initially almost the same as the default implementation,
42+
i.e. an `object`-dtype NumPy array of Python strings.
43+
44+
To solve the second issue (performance), pandas contributed to the development
45+
of string kernels in the PyArrow package, and a variant of the string dtype
46+
backed by PyArrow was
47+
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type).
48+
This could be specified with the `storage` keyword in the opt-in string dtype
49+
(`pd.StringDtype(storage="pyarrow")`).
50+
51+
Since its introduction, the `StringDtype` has always been opt-in, and has used
52+
the experimental `pd.NA` sentinel for missing values (which was also [introduced
53+
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)).
54+
However, up to this date, pandas has not yet made the step to use `pd.NA` by
55+
default.
56+
57+
In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html)
58+
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
59+
(i.e. infer this type for string data instead of object dtype). To ensure we
60+
could use the variant of `StringDtype` backed by PyArrow instead of Python
61+
objects (for better performance), it proposed to make `pyarrow` a new required
62+
runtime dependency of pandas.
63+
64+
In the meantime, NumPy has also been working on a native variable-width string
65+
data type, which will be available [starting with NumPy
66+
2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy).
67+
This can provide a potential alternative to PyArrow for implementing a string
68+
data type in pandas that is not backed by Python objects.
69+
70+
After acceptance of PDEP-10, two aspects of the proposal have been under
71+
reconsideration:
72+
73+
- Based on user feedback, it has been considered to relax the new `pyarrow`
74+
requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can
75+
potentially reduce the need to make PyArrow a required dependency specifically
76+
for a dedicated pandas string dtype.
77+
- The PDEP did not consider the usage of the experimental `pd.NA` as a
78+
consequence of adopting one of the existing implementations of the
79+
`StringDtype`.
80+
81+
For the second aspect, another variant of the `StringDtype` was
82+
[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings)
83+
that is still backed by PyArrow but follows the default missing values semantics
84+
pandas uses for all other default data types (and using `NaN` as the missing
85+
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)).
86+
At the time, the `storage` option for this new variant was called
87+
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using `pd.NA`.
88+
89+
This last dtype variant is what you currently (pandas 2.2) get for string data
90+
when enabling the ``future.infer_string`` option (to enable the behaviour which
91+
is intended to become the default in pandas 3.0).
92+
93+
## Proposal
94+
95+
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
96+
97+
1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow
98+
if installed, and otherwise falls back to an in-house functionally-equivalent
99+
(but slower) version.
100+
2. This default "string" dtype will follow the same behaviour for missing values
101+
as our other default data types, and use `NaN` as the missing value sentinel.
102+
3. The version that is not backed by PyArrow can reuse the existing numpy
103+
object-dtype backed StringArray for its implementation.
104+
4. We update installation guidelines to clearly encourage users to install
105+
pyarrow for the default user experience.
106+
107+
### Default inference of a string dtype
108+
109+
By default, pandas will infer this new string dtype for string data (when
110+
creating pandas objects, such as in constructors or IO functions).
111+
112+
The existing `future.infer_string` option can be used to opt-in to the future
113+
default behaviour:
114+
115+
```python
116+
>>> pd.options.future.infer_string = True
117+
>>> pd.Series(["a", "b", None])
118+
0 a
119+
1 b
120+
2 NaN
121+
dtype: string
122+
```
123+
124+
This option will be expanded to also work when PyArrow is not installed.
125+
126+
### Missing value semantics
127+
128+
Given that all other default data types uses NaN semantics for missing values,
129+
this proposal says that a new default string dtype should still use the same
130+
default semantics. Further, it should result in default data types when doing
131+
operations on the string column that result in a boolean or numeric data type
132+
(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison
133+
operators like `==`, should result in default `int64` and `bool` data types).
134+
135+
Because the current original `StringDtype` implementations already use `pd.NA`
136+
and return masked integer and boolean arrays in operations, a new variant of the
137+
existing dtypes that uses `NaN` and default data types is needed.
138+
139+
### Object-dtype "fallback" implementation
140+
141+
To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
142+
a "fallback" option in case PyArrow is not installed. The original `StringDtype`
143+
backed by a numpy object-dtype array of Python strings can be used for this, and
144+
only need minor updates to follow the above-mentioned missing value semantics
145+
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).
146+
147+
For pandas 3.0, this is the most realistic option given this implementation is
148+
already available for a long time. Beyond 3.0, we can still explore further
149+
improvements such as using nanoarrow or NumPy 2.0, but at that point that is an
150+
implementation detail that should not have a direct impact on users (except for
151+
performance).
152+
153+
### Naming
154+
155+
Given the long history of this topic, the naming of the dtypes is a difficult
156+
topic.
157+
158+
In the first place, we need to acknowledge that most users should not need to
159+
use storage-specific options. Users are expected to specify `pd.StringDtype()`
160+
or `"string"`, and that will give them their default string dtype (which
161+
depends on whether PyArrow is installed or not).
162+
163+
But for testing purposes and advanced use cases that want control over this, we
164+
need some way to specify this and distinguish them from the other string dtypes.
165+
Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
166+
"pyarrow_numpy" is a rather confusing option.
167+
168+
TODO see if we can come up with a better naming scheme
169+
170+
## Alternatives
171+
172+
### Why not delay introducing a default string dtype?
173+
174+
To avoid introducing a new string dtype while other discussions and changes are
175+
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
176+
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
177+
could also delay introducing a default string dtype until there is more clarity
178+
for those other discussions.
179+
180+
However:
181+
182+
1. Delaying has a cost: it further postpones introducing a dedicated string
183+
dtype that has massive benefits for our users, both in usability as (for the
184+
significant part of the user base that has PyArrow installed) in performance.
185+
2. In case we eventually transition to use `pd.NA` as the default missing value
186+
sentinel, we will need a migration path for _all_ our data types, and thus
187+
the challenges around this will not be unique to the string dtype.
188+
189+
### Why not use the existing StringDtype with `pd.NA`?
190+
191+
Because adding even more variants of the string dtype will make things only more
192+
confusing? Indeed, this proposal unfortunately introduces more variants of the
193+
string dtype. However, the reason for this is to ensure the actual default user
194+
experience is _less_ confusing, and the new string dtype fits better with the
195+
other default data types.
196+
197+
If the new default string data type would use `pd.NA`, then after some
198+
operations, a user can easily end up with a DataFrame that mixes columns using
199+
`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that
200+
could have columns with two different int64, two different float64, two different
201+
bool, etc dtypes). This would lead to a very confusing default experience.
202+
203+
With the proposed new variant of the StringDtype, this will ensure that for the
204+
_default_ experience, a user will only see only 1 kind of integer dtype, only
205+
kind of 1 bool dtype, etc. For now, a user should only get columns with an
206+
`ArrowDtype` and/or using `pd.NA` when explicitly opting into this.
207+
208+
## Backward compatibility
209+
210+
The most visible backwards incompatible change will be that columns with string
211+
data will no longer have an `object` dtype. Therefore, code that assumes
212+
`object` dtype (such as `ser.dtype == object`) will need to be updated.
213+
214+
To allow testing your code in advance, the
215+
`pd.options.future.infer_string = True` option is available.
216+
217+
Otherwise, the actual string-specific functionality (such as the `.str` accessor
218+
methods) should all keep working as is. By preserving the current missing value
219+
semantics, this proposal is also backwards compatible on this aspect.
220+
221+
One other backwards incompatible change is present for early adopters of the
222+
existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start
223+
returning the new default string dtype, while up to now this returned the
224+
experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users
225+
will need to start specifying a keyword in the dtype constructor if they want to
226+
keep using `pd.NA` (but if they just want to have a dedicated string dtype, they
227+
don't need to change their code).
228+
229+
## Timeline
230+
231+
The future PyArrow-backed string dtype was already made available behind a feature
232+
flag in pandas 2.1 (by `pd.options.future.infer_string = True`).
233+
234+
Some small enhancements or fixes (or naming changes) might still be needed and
235+
can be backported to pandas 2.2.x.
236+
237+
The variant using numpy object-dtype could potentially also be backported to
238+
2.2.x to allow easier testing.
239+
240+
For pandas 3.0, this flag becomes enabled by default.
241+
242+
243+
## PDEP-XX History
244+
245+
- 3 May 2024: Initial version

0 commit comments

Comments
 (0)