Skip to content

Commit 8974c5b

Browse files
next round of updates (small text updates, add capitalized String alias)
1 parent 0a68504 commit 8974c5b

File tree

1 file changed

+26
-18
lines changed

1 file changed

+26
-18
lines changed

web/pandas/pdeps/0014-string-dtype.md

Lines changed: 26 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This PDEP proposes to introduce a dedicated string dtype that will be used by
1212
default in pandas 3.0:
1313

1414
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
15-
or otherwise the numpy object-dtype alternative.
15+
or otherwise a string dtype using numpy object-dtype under the hood as fallback.
1616
* The default string dtype will use missing value semantics (using NaN) consistent
1717
with the other default data types.
1818

@@ -69,11 +69,11 @@ data type in pandas that is not backed by Python objects.
6969
After acceptance of PDEP-10, two aspects of the proposal have been under
7070
reconsideration:
7171

72-
- Based on user feedback (mostly around installation complexity and size), it
73-
has been considered to relax the new `pyarrow` requirement to not be a _hard_
74-
runtime dependency. In addition, NumPy 2.0 could in the future potentially
75-
reduce the need to make PyArrow a required dependency specifically for a
76-
dedicated pandas string dtype.
72+
- Based on feedback from users and maintainers from other packages (mostly
73+
around installation complexity and size), it has been considered to relax the
74+
new `pyarrow` requirement to not be a _hard_ runtime dependency. In addition,
75+
NumPy 2.0 could in the future potentially reduce the need to make PyArrow a
76+
required dependency specifically for a dedicated pandas string dtype.
7777
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a
7878
consequence of adopting one of the existing implementations of the
7979
`StringDtype`.
@@ -250,22 +250,24 @@ in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
250250
the default missing value sentinel? using the new NumPy 2.0 capabilities?
251251
overhauling all our dtypes to use a logical data type system?), introducing a
252252
default string dtype could also be delayed until there is more clarity in those
253-
other discussions.
253+
other discussions. Specifically, it would avoid temporarily switching to use
254+
`NaN` for the string dtype, while in a future version we might switch back
255+
to `pd.NA` by default.
254256

255257
However:
256258

257259
1. Delaying has a cost: it further postpones introducing a dedicated string
258260
dtype that has massive benefits for users, both in usability as (for the
259-
significant part of the user base that has PyArrow installed) in performance.
261+
part of the user base that has PyArrow installed) in performance.
260262
2. In case pandas eventually transitions to use `pd.NA` as the default missing value
261263
sentinel, a migration path for _all_ pandas data types will be needed, and thus
262264
the challenges around this will not be unique to the string dtype and
263265
therefore not a reason to delay this.
264266

265267
Making this change now for 3.0 will benefit the majority of users, while coming
266268
at a cost for a part of the users who already started using the `"string"` or
267-
`pd.StringDtype()` dtype (they will have to update their code to continue to the
268-
variant using `pd.NA`, see the "Backward compatibility" section below).
269+
`pd.StringDtype()` dtype (they will have to update their code to continue to use
270+
the variant using `pd.NA`, see the "Backward compatibility" section below).
269271

270272
### Why not use the existing StringDtype with `pd.NA`?
271273

@@ -311,9 +313,14 @@ To allow testing code in advance, the
311313
`pd.options.future.infer_string = True` option is available for users.
312314

313315
Otherwise, the actual string-specific functionality (such as the `.str` accessor
314-
methods) should generally all keep working as is. By preserving the current
315-
missing value semantics, this proposal is also backwards compatible on this
316-
aspect.
316+
methods) should generally all keep working as is.
317+
318+
By preserving the current missing value semantics, this proposal is also mostly
319+
backwards compatible on this aspect. When storing strings in object dtype, pandas
320+
however did allow using `None` as the missing value indicator as well (and in
321+
certain cases such as the `shift` method, pandas even introduced this itself).
322+
For all the cases where currently `None` was used as the missing value sentinel,
323+
this will change to use `NaN` consistently.
317324

318325
### For existing users of `StringDtype`
319326

@@ -359,17 +366,18 @@ options:
359366
- If the user specifically wants the variant of the string dtype that uses
360367
`pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will
361368
have to update their dtype specification from `"string"` / `pd.StringDtype()`
362-
to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep
363-
their code running as is.
369+
to `"String"` / `pd.StringDtype(na_value=pd.NA)` to suppress the warning and
370+
further keep their code running as is.
371+
372+
A `"String"` alias (capitalized) would be added to make it easier for users to
373+
continue using the variant using `pd.NA`, and such capitalized string alias is
374+
consistent with other nullable dtypes (`"float64`" vs `"Float64"`).
364375

365376
## Timeline
366377

367378
The future PyArrow-backed string dtype was already made available behind a feature
368379
flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`).
369380

370-
Some small enhancements or fixes might still be needed and can continue to be
371-
backported to pandas 2.2.x.
372-
373381
The variant using numpy object-dtype can also be backported to the 2.2.x branch
374382
to allow easier testing. It is proposed to release this as 2.3.0 (created from
375383
the 2.2.x branch, given that the main branch already includes many other changes

0 commit comments

Comments
 (0)