@@ -12,7 +12,7 @@ This PDEP proposes to introduce a dedicated string dtype that will be used by
12
12
default in pandas 3.0:
13
13
14
14
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
15
- or otherwise the numpy object-dtype alternative .
15
+ or otherwise a string dtype using numpy object-dtype under the hood as fallback .
16
16
* The default string dtype will use missing value semantics (using NaN) consistent
17
17
with the other default data types.
18
18
@@ -69,11 +69,11 @@ data type in pandas that is not backed by Python objects.
69
69
After acceptance of PDEP-10, two aspects of the proposal have been under
70
70
reconsideration:
71
71
72
- - Based on user feedback (mostly around installation complexity and size), it
73
- has been considered to relax the new ` pyarrow ` requirement to not be a _ hard _
74
- runtime dependency. In addition, NumPy 2.0 could in the future potentially
75
- reduce the need to make PyArrow a required dependency specifically for a
76
- dedicated pandas string dtype.
72
+ - Based on feedback from users and maintainers from other packages (mostly
73
+ around installation complexity and size), it has been considered to relax the
74
+ new ` pyarrow ` requirement to not be a _ hard _ runtime dependency. In addition,
75
+ NumPy 2.0 could in the future potentially reduce the need to make PyArrow a
76
+ required dependency specifically for a dedicated pandas string dtype.
77
77
- PDEP-10 did not consider the usage of the experimental ` pd.NA ` as a
78
78
consequence of adopting one of the existing implementations of the
79
79
` StringDtype ` .
@@ -250,22 +250,24 @@ in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
250
250
the default missing value sentinel? using the new NumPy 2.0 capabilities?
251
251
overhauling all our dtypes to use a logical data type system?), introducing a
252
252
default string dtype could also be delayed until there is more clarity in those
253
- other discussions.
253
+ other discussions. Specifically, it would avoid temporarily switching to use
254
+ ` NaN ` for the string dtype, while in a future version we might switch back
255
+ to ` pd.NA ` by default.
254
256
255
257
However:
256
258
257
259
1 . Delaying has a cost: it further postpones introducing a dedicated string
258
260
dtype that has massive benefits for users, both in usability as (for the
259
- significant part of the user base that has PyArrow installed) in performance.
261
+ part of the user base that has PyArrow installed) in performance.
260
262
2 . In case pandas eventually transitions to use ` pd.NA ` as the default missing value
261
263
sentinel, a migration path for _ all_ pandas data types will be needed, and thus
262
264
the challenges around this will not be unique to the string dtype and
263
265
therefore not a reason to delay this.
264
266
265
267
Making this change now for 3.0 will benefit the majority of users, while coming
266
268
at a cost for a part of the users who already started using the ` "string" ` or
267
- ` pd.StringDtype() ` dtype (they will have to update their code to continue to the
268
- variant using ` pd.NA ` , see the "Backward compatibility" section below).
269
+ ` pd.StringDtype() ` dtype (they will have to update their code to continue to use
270
+ the variant using ` pd.NA ` , see the "Backward compatibility" section below).
269
271
270
272
### Why not use the existing StringDtype with ` pd.NA ` ?
271
273
@@ -311,9 +313,14 @@ To allow testing code in advance, the
311
313
` pd.options.future.infer_string = True ` option is available for users.
312
314
313
315
Otherwise, the actual string-specific functionality (such as the ` .str ` accessor
314
- methods) should generally all keep working as is. By preserving the current
315
- missing value semantics, this proposal is also backwards compatible on this
316
- aspect.
316
+ methods) should generally all keep working as is.
317
+
318
+ By preserving the current missing value semantics, this proposal is also mostly
319
+ backwards compatible on this aspect. When storing strings in object dtype, pandas
320
+ however did allow using ` None ` as the missing value indicator as well (and in
321
+ certain cases such as the ` shift ` method, pandas even introduced this itself).
322
+ For all the cases where currently ` None ` was used as the missing value sentinel,
323
+ this will change to use ` NaN ` consistently.
317
324
318
325
### For existing users of ` StringDtype `
319
326
@@ -359,17 +366,18 @@ options:
359
366
- If the user specifically wants the variant of the string dtype that uses
360
367
` pd.NA ` (and returns nullable numeric/boolean dtypes in operations), they will
361
368
have to update their dtype specification from ` "string" ` / ` pd.StringDtype() `
362
- to ` pd.StringDtype(na_value=pd.NA) ` to suppress the warning and further keep
363
- their code running as is.
369
+ to ` "String" ` / ` pd.StringDtype(na_value=pd.NA) ` to suppress the warning and
370
+ further keep their code running as is.
371
+
372
+ A ` "String" ` alias (capitalized) would be added to make it easier for users to
373
+ continue using the variant using ` pd.NA ` , and such capitalized string alias is
374
+ consistent with other nullable dtypes (` "float64 ` " vs ` "Float64" ` ).
364
375
365
376
## Timeline
366
377
367
378
The future PyArrow-backed string dtype was already made available behind a feature
368
379
flag in pandas 2.1 (enabled by ` pd.options.future.infer_string = True ` ).
369
380
370
- Some small enhancements or fixes might still be needed and can continue to be
371
- backported to pandas 2.2.x.
372
-
373
381
The variant using numpy object-dtype can also be backported to the 2.2.x branch
374
382
to allow easier testing. It is proposed to release this as 2.3.0 (created from
375
383
the 2.2.x branch, given that the main branch already includes many other changes
0 commit comments