Skip to content

Commit 9ede2e6

Browse files
Expand Backward compatibility section + add proposal for deprecation
1 parent 5b5835b commit 9ede2e6

File tree

1 file changed

+62
-14
lines changed

1 file changed

+62
-14
lines changed

web/pandas/pdeps/0014-string-dtype.md

Lines changed: 62 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -244,9 +244,10 @@ sufficient (they don't need to specify the storage), and the explicit
244244

245245
To avoid introducing a new string dtype while other discussions and changes are
246246
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
247-
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
248-
could also delay introducing a default string dtype until there is more clarity
249-
in those other discussions.
247+
the default missing value sentinel? using the new NumPy 2.0 capabilities?
248+
overhauling all our dtypes to use a logical data type system?), we could also
249+
delay introducing a default string dtype until there is more clarity in those
250+
other discussions.
250251

251252
However:
252253

@@ -258,6 +259,11 @@ However:
258259
the challenges around this will not be unique to the string dtype and
259260
therefore not a reason to delay this.
260261

262+
Making this change now for 3.0 will benefit the majority of our users, while
263+
coming at a cost for a part of the users who already started using the
264+
`"string"` dtype (they will have to update their code to continue to the variant
265+
using `pd.NA`, see the "Backward compatibility" section below).
266+
261267
### Why not use the existing StringDtype with `pd.NA`?
262268

263269
Wouldn't adding even more variants of the string dtype make things only more
@@ -294,22 +300,64 @@ discussion.
294300

295301
The most visible backwards incompatible change will be that columns with string
296302
data will no longer have an `object` dtype. Therefore, code that assumes
297-
`object` dtype (such as `ser.dtype == object`) will need to be updated.
303+
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
304+
change is done as a hard break in a major release, as warning in advance for the
305+
changed inference is deemed to noisy.
298306

299307
To allow testing your code in advance, the
300308
`pd.options.future.infer_string = True` option is available.
301309

302310
Otherwise, the actual string-specific functionality (such as the `.str` accessor
303-
methods) should all keep working as is. By preserving the current missing value
304-
semantics, this proposal is also backwards compatible on this aspect.
305-
306-
One other backwards incompatible change is present for early adopters of the
307-
existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start
308-
returning the new default string dtype, while up to now this returned the
309-
experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users
310-
will need to start specifying a keyword in the dtype constructor if they want to
311-
keep using `pd.NA` (but if they just want to have a dedicated string dtype, they
312-
don't need to change their code).
311+
methods) should generally all keep working as is. By preserving the current
312+
missing value semantics, this proposal is also backwards compatible on this
313+
aspect.
314+
315+
### For existing users of `StringDtype`
316+
317+
Users of the existing `StringDtype` will see more backwards incompatible
318+
changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying
319+
`dtype="string"`) will start returning the new default string dtype using `NaN`,
320+
while up to now this returned the string dtype using `pd.NA` introduced in
321+
pandas 1.0.
322+
323+
For example, this code snippet returned the NA-variant of `StringDtype` with
324+
pandas 1.x and 2.x:
325+
326+
```python
327+
>>> pd.Series(["a", "b", None], dtype="string")
328+
0 a
329+
1 b
330+
2 <NA>
331+
dtype: string
332+
```
333+
334+
but will start returning the new default NaN-variant of `StringDtype` with
335+
pandas 3.0. This means that the missing value sentinel will change from `pd.NA`
336+
to `NaN`, and that operations will no longer return nullable dtypes but default
337+
numpy dtypes (see the "Missing value semantics" section above).
338+
339+
While this change will be transparent in many cases (e.g. checking for missing
340+
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
341+
a string predicate method keeps working regardless of the sentinel), this can be
342+
a breaking change if you relied on the exact sentinel or resulting dtype. Since
343+
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
344+
that many users already have started using this dtype, even though officially
345+
still labeled as "experimental".
346+
347+
To smooth the upgrade experience for those users, we propose to add a
348+
deprecation warning before 3.0 when such dtype is created, giving them two
349+
options:
350+
351+
- If the user just wants to have a dedicated "string" dtype (or the better
352+
performance when using pyarrow) but is fine with using the default NaN
353+
semantics, they can add `pd.options.future.infer_string = True` to their code
354+
to suppress the warning and already opt-in to the future behaviour of pandas
355+
3.0.
356+
- If the user specifically wants the variant of the string dtype that uses
357+
`pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will
358+
have to update their dtype specification from `"string"` / `pd.StringDtype()`
359+
to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep
360+
their code running as is.
313361

314362
## Timeline
315363

0 commit comments

Comments
 (0)