Skip to content

Commit 9c5342a

Browse files
New revision: keep back compat for 'string', introduce 'str' for the new default dtype
1 parent d24a80a commit 9c5342a

File tree

1 file changed

+76
-99
lines changed

1 file changed

+76
-99
lines changed

web/pandas/pdeps/0014-string-dtype.md

Lines changed: 76 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
This PDEP proposes to introduce a dedicated string dtype that will be used by
1212
default in pandas 3.0:
1313

14-
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
14+
* In pandas 3.0, enable a string dtype (`"str"`) by default, using PyArrow if available
1515
or otherwise a string dtype using numpy object-dtype under the hood as fallback.
1616
* The default string dtype will use missing value semantics (using NaN) consistent
1717
with the other default data types.
@@ -96,10 +96,10 @@ is intended to become the default in pandas 3.0).
9696

9797
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
9898

99-
1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
99+
1. For pandas 3.0, a `"str"` string dtype is enabled by default, which will use PyArrow
100100
if installed, and otherwise falls back to an in-house functionally-equivalent
101101
(but slower) version.
102-
2. This default "string" dtype will follow the same behaviour for missing values
102+
2. This default string dtype will follow the same behaviour for missing values
103103
as other default data types, and use `NaN` as the missing value sentinel.
104104
3. The version that is not backed by PyArrow can reuse (with minor code
105105
additions) the existing numpy object-dtype backed StringArray for its
@@ -135,10 +135,9 @@ that case.
135135

136136
### Missing value semantics
137137

138-
As mentioned in the background section, the original `StringDtype` has used
139-
the experimental `pd.NA` sentinel for missing values. In addition to using
140-
`pd.NA` as the scalar for a missing value, this essentially means
141-
that:
138+
As mentioned in the background section, the original `StringDtype` has always
139+
used the experimental `pd.NA` sentinel for missing values. In addition to using
140+
`pd.NA` as the scalar for a missing value, this essentially means that:
142141

143142
- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics)
144143
for missing values, where `NA` propagates in boolean operations such as
@@ -154,7 +153,7 @@ dtype should also still use the same default missing value semantics and return
154153
default data types when doing operations on the string column, to be consistent
155154
with the other default dtypes at this point.
156155

157-
In practice, this means that the default `"string"` dtype will use `NaN` as
156+
In practice, this means that the default string dtype will use `NaN` as
158157
the missing value sentinel, and:
159158

160159
- String columns will follow NaN-semantics for missing values, where `NaN` gives
@@ -165,9 +164,8 @@ the missing value sentinel, and:
165164
Because the original `StringDtype` implementations already use `pd.NA` and
166165
return masked integer and boolean arrays in operations, a new variant of the
167166
existing dtypes that uses `NaN` and default data types was needed. The original
168-
variant of `StringDtype` using `pd.NA` will still be available for those who
169-
want to keep using it (see below in the "Naming" subsection for how to specify
170-
this).
167+
variant of `StringDtype` using `pd.NA` will continue to be available for those
168+
who were already using it.
171169

172170
### Object-dtype "fallback" implementation
173171

@@ -185,23 +183,35 @@ or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can
185183
still be explored, but at that point that is an implementation detail that
186184
should not have a direct impact on users (except for performance).
187185

186+
For the original variant of `StringDtype` using `pd.NA`, currently the default
187+
storage is `"python"` (the object-dtype based implementation). Also for this
188+
variant, it is proposed follow the same logic for determining the default
189+
storage, i.e. the default to `"pyarrow"` if available, and otherwise
190+
fall back to `"python"`.
191+
188192
### Naming
189193

190194
Given the long history of this topic, the naming of the dtypes is a difficult
191195
topic.
192196

193197
In the first place, it should be acknowledged that most users should not need to
194-
use storage-specific options. Users are expected to specify `pd.StringDtype()`
195-
or `"string"`, and that will give them their default string dtype (which
196-
depends on whether PyArrow is installed or not).
197-
198-
But for testing purposes and advanced use cases that want control over this, we
199-
need some way to specify this and distinguish them from the other string dtypes.
200-
In addition, users that want to continue using the original NA-variant of the
201-
dtype need a way to specify this.
202-
203-
Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
204-
the `"pyarrow_numpy"` storage was used to disambiguate from the existing
198+
use storage-specific options. Users are expected to specify a generic name (such
199+
as `"str"` or `"string"`), and that will give them their default string dtype
200+
(which depends on whether PyArrow is installed or not).
201+
202+
For the generic string alias to specify the dtype, `"string"` is already used
203+
for the `StringDtype` using `pd.NA`. This PDEP proposes to use `"str"` for the
204+
new default `StringDtype` using `NaN`. This ensures backwards compatibility for
205+
code using `dtype="string"`, and was also chosen because `dtype="str"` or
206+
`dtype=str` currently already works to ensure your data is converted to
207+
strings (only using object dtype for the result).
208+
209+
But for testing purposes and advanced use cases that want control over the exact
210+
variant of the `StringDtype`, we need some way to specify this and distinguish
211+
them from the other string dtypes.
212+
213+
Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used for the new variant using `NaN`,
214+
where the `"pyarrow_numpy"` storage was used to disambiguate from the existing
205215
`"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing
206216
option and doesn't generalize well. Therefore, this PDEP proposes a new naming
207217
scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed
@@ -217,29 +227,31 @@ dtype of the data:
217227

218228
| User specification | Concrete dtype | String alias | Note |
219229
|---------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------|
220-
| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "string" | (1) |
221-
| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow" \| "python", na_value=np.nan)` | "string" | (1), (2) |
222-
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) |
223-
| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) |
224-
| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "String[pyarrow]" | |
225-
| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "String[python]" | |
226-
| `StringDtype(na_value=pd.NA)` or `"String"` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "String[pyarrow]" or "String[python]" | (1) |
227-
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) |
230+
| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) |
231+
| `"str"` or `StringDtype(na_value=np.nan)` | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) |
232+
| `StringDtype("pyarrow", na_value=np.nan)` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "str" | |
233+
| `StringDtype("python", na_value=np.nan)` | `StringDtype(storage="python", na_value=np.nan)` | "str" | |
234+
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | |
235+
| `StringDtype("python")` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | |
236+
| `"string"` or `StringDtype()` | `StringDtype(storage="pyarrow"\|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) |
237+
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (2) |
228238

229239
Notes:
230240

231241
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
232-
- (2) Those three rows are backwards incompatible (i.e. they work now but give
233-
the NA-variant), see the "Backward compatibility" section below.
234-
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
242+
- (2) "pyarrow_numpy" is kept temporarily because this is already in a released
235243
version, but we can deprecate it in 2.x and have it removed for 3.0.
236244

237-
For the new default string dtype, only the `"string"` alias can be used to
238-
specify the dtype as a string, i.e. a way would not be provided to make the
245+
For the new default string dtype, only the `"str"` alias can be used to
246+
specify the dtype as a string, i.e. pandas would not provide a way to make the
239247
underlying storage (pyarrow or python) explicit through the string alias. This
240-
string alias is only a convenience shortcut and for most users `"string"` is
248+
string alias is only a convenience shortcut and for most users `"str"` is
241249
sufficient (they don't need to specify the storage), and the explicit
242-
`pd.StringDtype(...)` is still available for more fine-grained control.
250+
`pd.StringDtype(storage=..., na_value=np.nan)` is still available for more
251+
fine-grained control.
252+
253+
Also for the existing variant using `pd.NA`, specifying the storage through the
254+
string alias could be deprecated, but that is left for a separate decision.
243255

244256
## Alternatives
245257

@@ -257,17 +269,16 @@ to `pd.NA` by default.
257269
However:
258270

259271
1. Delaying has a cost: it further postpones introducing a dedicated string
260-
dtype that has massive benefits for users, both in usability as (for the
272+
dtype that has significant benefits for users, both in usability as (for the
261273
part of the user base that has PyArrow installed) in performance.
262274
2. In case pandas eventually transitions to use `pd.NA` as the default missing value
263-
sentinel, a migration path for _all_ pandas data types will be needed, and thus
275+
sentinel, a migration path for _all_ pandas data types will be needed, and thus
264276
the challenges around this will not be unique to the string dtype and
265277
therefore not a reason to delay this.
266278

267-
Making this change now for 3.0 will benefit the majority of users, while coming
268-
at a cost for a part of the users who already started using the `"string"` or
269-
`pd.StringDtype()` dtype (they will have to update their code to continue to use
270-
the variant using `pd.NA`, see the "Backward compatibility" section below).
279+
Making this change now for 3.0 will benefit the majority of users, and the PDEP
280+
author believes this is worth the cost of the added complexity around "yet
281+
another dtype" (also for other data types we already have multiple variants).
271282

272283
### Why not use the existing StringDtype with `pd.NA`?
273284

@@ -290,17 +301,26 @@ when explicitly opting into this.
290301

291302
### Naming alternatives
292303

293-
This PDEP now keeps the `pd.StringDtype` class constructor with the existing
294-
`storage` keyword and with an additional `na_value` keyword.
304+
An initial version of this PDEP proposed to use the `"string"` alias and the
305+
default `pd.StringDtype()` class constructor for the new default dtype.
306+
However, that caused a lot of discussion around backwards compatibility for
307+
existing users of the `StringDtype` using `pd.NA`.
295308

296309
During the discussion, several alternatives have been brought up. Both
297-
alternative keyword names as using a different constructor. This PDEP opted to
298-
keep using the existing `pd.StringDtype()` for now to keep the changes as
310+
alternative keyword names as using a different constructor. In the end,
311+
this PDEP proposes to use a different string alias (`"str"`) but to keep
312+
using the existing `pd.StringDtype` (with the existing `storage` keyword but
313+
with an additional `na_value` keyword) for now to keep the changes as
299314
minimal as possible, leaving a larger overhaul of the dtype system (potentially
300315
including different constructor functions or namespace) for a future discussion.
301316
See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full
302317
discussion.
303318

319+
One consequence is that when using the class constructor for the default dtype,
320+
it has to be used with non-default arguments, i.e. a user needs to specify
321+
`pd.StringDtype(na_value=np.nan)` to get the default dtype using `NaN`.
322+
Therefore, the pandas documentation will focus on the usage of `dtype="str"`.
323+
304324
## Backward compatibility
305325

306326
The most visible backwards incompatible change will be that columns with string
@@ -324,54 +344,14 @@ this will change to use `NaN` consistently.
324344

325345
### For existing users of `StringDtype`
326346

327-
Users of the existing `StringDtype` will see more backwards incompatible
328-
changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying
329-
`dtype="string"`) will start returning the new default string dtype using `NaN`,
330-
while up to now this returned the string dtype using `pd.NA` introduced in
331-
pandas 1.0.
332-
333-
For example, this code snippet returned the NA-variant of `StringDtype` with
334-
pandas 1.x and 2.x:
335-
336-
```python
337-
>>> pd.Series(["a", "b", None], dtype="string")
338-
0 a
339-
1 b
340-
2 <NA>
341-
dtype: string
342-
```
347+
Existing code that already opted in to use the `StringDtype` using `pd.NA`
348+
should generally keep working as is. The latest version of this PDEP preserves
349+
the behaviour of `dtype="string"` or `dtype=pd.StringDtype()` to mean the
350+
`pd.NA` variant of the dtype.
343351

344-
but will start returning the new default NaN-variant of `StringDtype` with
345-
pandas 3.0. This means that the missing value sentinel will change from `pd.NA`
346-
to `NaN`, and that operations will no longer return nullable dtypes but default
347-
numpy dtypes (see the "Missing value semantics" section above).
348-
349-
While this change will be transparent in many cases (e.g. checking for missing
350-
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
351-
a string predicate method keeps working regardless of the sentinel), this can be
352-
a breaking change if users relied on the exact sentinel or resulting dtype. Since
353-
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
354-
that many users already have started using this dtype, even though officially
355-
still labeled as "experimental".
356-
357-
To smooth the upgrade experience for those users, it is proposed to add a
358-
deprecation warning before 3.0 when such dtype is created, giving them two
359-
options:
360-
361-
- If the user just wants to have a dedicated "string" dtype (or the better
362-
performance when using pyarrow) but is fine with using the default NaN
363-
semantics, they can add `pd.options.future.infer_string = True` to their code
364-
to suppress the warning and already opt-in to the future behaviour of pandas
365-
3.0.
366-
- If the user specifically wants the variant of the string dtype that uses
367-
`pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will
368-
have to update their dtype specification from `"string"` / `pd.StringDtype()`
369-
to `"String"` / `pd.StringDtype(na_value=pd.NA)` to suppress the warning and
370-
further keep their code running as is.
371-
372-
A `"String"` alias (capitalized) would be added to make it easier for users to
373-
continue using the variant using `pd.NA`, and such capitalized string alias is
374-
consistent with other nullable dtypes (`"float64`" vs `"Float64"`).
352+
It does propose the change the default storage to `"pyarrow"` (if available) for
353+
the opt-in `pd.NA` variant as well, but this should not have much user-visible
354+
impact.
375355

376356
## Timeline
377357

@@ -381,13 +361,10 @@ flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`).
381361
The variant using numpy object-dtype can also be backported to the 2.2.x branch
382362
to allow easier testing. It is proposed to release this as 2.3.0 (created from
383363
the 2.2.x branch, given that the main branch already includes many other changes
384-
targeted for 3.0), together with the deprecation warning when creating a dtype
385-
from `"string"` / `pd.StringDtype()`.
364+
targeted for 3.0), together with the changes to the naming scheme.
386365

387366
The 2.3.0 release would then have all future string functionality available
388-
(both the pyarrow and object-dtype based variants of the default string dtype),
389-
and warn existing users of the `StringDtype` in advance of 3.0 about how to
390-
update their code.
367+
(both the pyarrow and object-dtype based variants of the default string dtype).
391368

392369
For pandas 3.0, this `future.infer_string` flag becomes enabled by default.
393370

0 commit comments

Comments
 (0)