@@ -19,7 +19,7 @@ default in pandas 3.0:
19
19
This will give users a long-awaited proper string dtype for 3.0, while 1) not
20
20
(yet) making PyArrow a _ hard_ dependency, but only a dependency used by default,
21
21
and 2) leaving room for future improvements (different missing value semantics,
22
- using NumPy 2.0, etc).
22
+ using NumPy 2.0 strings , etc).
23
23
24
24
## Background
25
25
@@ -74,7 +74,7 @@ reconsideration:
74
74
runtime dependency. In addition, NumPy 2.0 could in the future potentially
75
75
reduce the need to make PyArrow a required dependency specifically for a
76
76
dedicated pandas string dtype.
77
- - The PDEP did not consider the usage of the experimental ` pd.NA ` as a
77
+ - PDEP-10 did not consider the usage of the experimental ` pd.NA ` as a
78
78
consequence of adopting one of the existing implementations of the
79
79
` StringDtype ` .
80
80
@@ -88,23 +88,23 @@ At the time, the `storage` option for this new variant was called
88
88
` pd.NA ` (but this PDEP proposes a better naming scheme, see the "Naming"
89
89
subsection below).
90
90
91
- This last dtype variant is what you currently (pandas 2.2) get for string data
91
+ This last dtype variant is what users currently (pandas 2.2) get for string data
92
92
when enabling the `` future.infer_string `` option (to enable the behaviour which
93
93
is intended to become the default in pandas 3.0).
94
94
95
95
## Proposal
96
96
97
97
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
98
98
99
- 1 . For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow
99
+ 1 . For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
100
100
if installed, and otherwise falls back to an in-house functionally-equivalent
101
101
(but slower) version.
102
102
2 . This default "string" dtype will follow the same behaviour for missing values
103
- as our other default data types, and use ` NaN ` as the missing value sentinel.
103
+ as other default data types, and use ` NaN ` as the missing value sentinel.
104
104
3 . The version that is not backed by PyArrow can reuse (with minor code
105
105
additions) the existing numpy object-dtype backed StringArray for its
106
106
implementation.
107
- 4 . We update installation guidelines to clearly encourage users to install
107
+ 4 . Installation guidelines are updated to clearly encourage users to install
108
108
pyarrow for the default user experience.
109
109
110
110
Those string dtypes enabled by default will then no longer be considered as
@@ -145,7 +145,7 @@ that:
145
145
nullable ` 'Int64" ` / ` pd.Int64Dtype() ` dtype instead of the numpy ` int64 `
146
146
dtype (or ` float64 ` in case of missing values)).
147
147
148
- However, up to this date, all other default data types still use NaN semantics
148
+ However, up to this date, all other default data types still use ` NaN ` semantics
149
149
for missing values. Therefore, this proposal says that a new default string
150
150
dtype should also still use the same default missing value semantics and return
151
151
default data types when doing operations on the string column, to be consistent
@@ -176,9 +176,10 @@ needs minor changes to follow the above-mentioned missing value semantics
176
176
([ GH-58451 ] ( https://github.com/pandas-dev/pandas/pull/58451 ) ).
177
177
178
178
For pandas 3.0, this is the most realistic option given this implementation has
179
- already been available for a long time. Beyond 3.0, we can still explore further
179
+ already been available for a long time. Beyond 3.0, further
180
180
improvements such as using NumPy 2.0 ([ GH-58503 ] ( https://github.com/pandas-dev/pandas/issues/58503 ) )
181
- or nanoarrow ([ GH-58552 ] ( https://github.com/pandas-dev/pandas/issues/58552 ) ),
181
+ or nanoarrow ([ GH-58552 ] ( https://github.com/pandas-dev/pandas/issues/58552 ) )
182
+ can still be explored,
182
183
but at that point that is an implementation detail that should not have a
183
184
direct impact on users (except for performance).
184
185
@@ -187,7 +188,7 @@ direct impact on users (except for performance).
187
188
Given the long history of this topic, the naming of the dtypes is a difficult
188
189
topic.
189
190
190
- In the first place, we need to acknowledge that most users should not need to
191
+ In the first place, it should be acknowledged that most users should not need to
191
192
use storage-specific options. Users are expected to specify ` pd.StringDtype() `
192
193
or ` "string" ` , and that will give them their default string dtype (which
193
194
depends on whether PyArrow is installed or not).
@@ -201,8 +202,8 @@ Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
201
202
the ` "pyarrow_numpy" ` storage was used to disambiguate from the existing
202
203
` "pyarrow" ` option using ` pd.NA ` . However, "pyarrow_numpy" is a rather
203
204
confusing option and doesn't generalize well. Therefore, this PDEP proposes
204
- a new naming scheme as outlined below, and we will deprecate and remove
205
- "pyarrow_numpy" before pandas 3.0.
205
+ a new naming scheme as outlined below, and
206
+ "pyarrow_numpy" will be deprecated and removed before pandas 3.0.
206
207
207
208
The ` storage ` keyword of ` StringDtype ` is kept to disambiguate the underlying
208
209
storage of the string data (using pyarrow or python objects), but an additional
@@ -227,12 +228,12 @@ Notes:
227
228
228
229
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
229
230
- (2) Those three rows are backwards incompatible (i.e. they work now but give
230
- you the NA-variant), see the "Backward compatibility" section below.
231
+ the NA-variant), see the "Backward compatibility" section below.
231
232
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
232
233
version, but we can deprecate it in 2.2.x and have it removed for 3.0.
233
234
234
235
For the new default string dtype, only the ` "string" ` alias can be used to
235
- specify the dtype as a string, i.e. we would not provide a way to make the
236
+ specify the dtype as a string, i.e. a way would not be provided to make the
236
237
underlying storage (pyarrow or python) explicit through the string alias. This
237
238
string alias is only a convenience shortcut and for most users ` "string" ` is
238
239
sufficient (they don't need to specify the storage), and the explicit
@@ -245,23 +246,23 @@ sufficient (they don't need to specify the storage), and the explicit
245
246
To avoid introducing a new string dtype while other discussions and changes are
246
247
in flux (eventually making pyarrow a required dependency? adopting ` pd.NA ` as
247
248
the default missing value sentinel? using the new NumPy 2.0 capabilities?
248
- overhauling all our dtypes to use a logical data type system?), we could also
249
- delay introducing a default string dtype until there is more clarity in those
249
+ overhauling all our dtypes to use a logical data type system?),
250
+ introducing a default string dtype could also be delayed until there is more clarity in those
250
251
other discussions.
251
252
252
253
However:
253
254
254
255
1 . Delaying has a cost: it further postpones introducing a dedicated string
255
- dtype that has massive benefits for our users, both in usability as (for the
256
+ dtype that has massive benefits for users, both in usability as (for the
256
257
significant part of the user base that has PyArrow installed) in performance.
257
- 2 . In case we eventually transition to use ` pd.NA ` as the default missing value
258
- sentinel, we will need a migration path for _ all_ our data types, and thus
258
+ 2 . In case pandas eventually transitions to use ` pd.NA ` as the default missing value
259
+ sentinel, a migration path for _ all_ our data types will be needed , and thus
259
260
the challenges around this will not be unique to the string dtype and
260
261
therefore not a reason to delay this.
261
262
262
- Making this change now for 3.0 will benefit the majority of our users, while
263
+ Making this change now for 3.0 will benefit the majority of users, while
263
264
coming at a cost for a part of the users who already started using the
264
- ` "string" ` dtype (they will have to update their code to continue to the variant
265
+ ` "string" ` or ` pd.StringDtype() ` dtype (they will have to update their code to continue to the variant
265
266
using ` pd.NA ` , see the "Backward compatibility" section below).
266
267
267
268
### Why not use the existing StringDtype with ` pd.NA ` ?
@@ -302,10 +303,10 @@ The most visible backwards incompatible change will be that columns with string
302
303
data will no longer have an ` object ` dtype. Therefore, code that assumes
303
304
` object ` dtype (such as ` ser.dtype == object ` ) will need to be updated. This
304
305
change is done as a hard break in a major release, as warning in advance for the
305
- changed inference is deemed to noisy.
306
+ changed inference is deemed too noisy.
306
307
307
- To allow testing your code in advance, the
308
- ` pd.options.future.infer_string = True ` option is available.
308
+ To allow testing code in advance, the
309
+ ` pd.options.future.infer_string = True ` option is available for users .
309
310
310
311
Otherwise, the actual string-specific functionality (such as the ` .str ` accessor
311
312
methods) should generally all keep working as is. By preserving the current
@@ -339,12 +340,12 @@ numpy dtypes (see the "Missing value semantics" section above).
339
340
While this change will be transparent in many cases (e.g. checking for missing
340
341
values with ` isna() ` /` dropna() ` /` fillna() ` or filtering rows with the result of
341
342
a string predicate method keeps working regardless of the sentinel), this can be
342
- a breaking change if you relied on the exact sentinel or resulting dtype. Since
343
+ a breaking change if users relied on the exact sentinel or resulting dtype. Since
343
344
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
344
345
that many users already have started using this dtype, even though officially
345
346
still labeled as "experimental".
346
347
347
- To smooth the upgrade experience for those users, we propose to add a
348
+ To smooth the upgrade experience for those users, it is proposed to add a
348
349
deprecation warning before 3.0 when such dtype is created, giving them two
349
350
options:
350
351
@@ -368,7 +369,7 @@ Some small enhancements or fixes might still be needed and can continue to be
368
369
backported to pandas 2.2.x.
369
370
370
371
The variant using numpy object-dtype can also be backported to the 2.2.x branch
371
- to allow easier testing. We would propose to release this as 2.3.0 (created from
372
+ to allow easier testing. It is proposed to release this as 2.3.0 (created from
372
373
the 2.2.x branch, given that the main branch already includes many other changes
373
374
targeted for 3.0), together with the deprecation warning when creating a dtype
374
375
from ` "string" ` / ` pd.StringDtype() ` .
0 commit comments