@@ -244,9 +244,10 @@ sufficient (they don't need to specify the storage), and the explicit
244
244
245
245
To avoid introducing a new string dtype while other discussions and changes are
246
246
in flux (eventually making pyarrow a required dependency? adopting ` pd.NA ` as
247
- the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
248
- could also delay introducing a default string dtype until there is more clarity
249
- in those other discussions.
247
+ the default missing value sentinel? using the new NumPy 2.0 capabilities?
248
+ overhauling all our dtypes to use a logical data type system?), we could also
249
+ delay introducing a default string dtype until there is more clarity in those
250
+ other discussions.
250
251
251
252
However:
252
253
@@ -258,6 +259,11 @@ However:
258
259
the challenges around this will not be unique to the string dtype and
259
260
therefore not a reason to delay this.
260
261
262
+ Making this change now for 3.0 will benefit the majority of our users, while
263
+ coming at a cost for a part of the users who already started using the
264
+ ` "string" ` dtype (they will have to update their code to continue to the variant
265
+ using ` pd.NA ` , see the "Backward compatibility" section below).
266
+
261
267
### Why not use the existing StringDtype with ` pd.NA ` ?
262
268
263
269
Wouldn't adding even more variants of the string dtype make things only more
@@ -294,22 +300,64 @@ discussion.
294
300
295
301
The most visible backwards incompatible change will be that columns with string
296
302
data will no longer have an ` object ` dtype. Therefore, code that assumes
297
- ` object ` dtype (such as ` ser.dtype == object ` ) will need to be updated.
303
+ ` object ` dtype (such as ` ser.dtype == object ` ) will need to be updated. This
304
+ change is done as a hard break in a major release, as warning in advance for the
305
+ changed inference is deemed to noisy.
298
306
299
307
To allow testing your code in advance, the
300
308
` pd.options.future.infer_string = True ` option is available.
301
309
302
310
Otherwise, the actual string-specific functionality (such as the ` .str ` accessor
303
- methods) should all keep working as is. By preserving the current missing value
304
- semantics, this proposal is also backwards compatible on this aspect.
305
-
306
- One other backwards incompatible change is present for early adopters of the
307
- existing ` StringDtype ` . In pandas 3.0, calling ` pd.StringDtype() ` will start
308
- returning the new default string dtype, while up to now this returned the
309
- experimental string dtype using ` pd.NA ` introduced in pandas 1.0. Those users
310
- will need to start specifying a keyword in the dtype constructor if they want to
311
- keep using ` pd.NA ` (but if they just want to have a dedicated string dtype, they
312
- don't need to change their code).
311
+ methods) should generally all keep working as is. By preserving the current
312
+ missing value semantics, this proposal is also backwards compatible on this
313
+ aspect.
314
+
315
+ ### For existing users of ` StringDtype `
316
+
317
+ Users of the existing ` StringDtype ` will see more backwards incompatible
318
+ changes, though. In pandas 3.0, calling ` pd.StringDtype() ` (or specifying
319
+ ` dtype="string" ` ) will start returning the new default string dtype using ` NaN ` ,
320
+ while up to now this returned the string dtype using ` pd.NA ` introduced in
321
+ pandas 1.0.
322
+
323
+ For example, this code snippet returned the NA-variant of ` StringDtype ` with
324
+ pandas 1.x and 2.x:
325
+
326
+ ``` python
327
+ >> > pd.Series([" a" , " b" , None ], dtype = " string" )
328
+ 0 a
329
+ 1 b
330
+ 2 < NA >
331
+ dtype: string
332
+ ```
333
+
334
+ but will start returning the new default NaN-variant of ` StringDtype ` with
335
+ pandas 3.0. This means that the missing value sentinel will change from ` pd.NA `
336
+ to ` NaN ` , and that operations will no longer return nullable dtypes but default
337
+ numpy dtypes (see the "Missing value semantics" section above).
338
+
339
+ While this change will be transparent in many cases (e.g. checking for missing
340
+ values with ` isna() ` /` dropna() ` /` fillna() ` or filtering rows with the result of
341
+ a string predicate method keeps working regardless of the sentinel), this can be
342
+ a breaking change if you relied on the exact sentinel or resulting dtype. Since
343
+ pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
344
+ that many users already have started using this dtype, even though officially
345
+ still labeled as "experimental".
346
+
347
+ To smooth the upgrade experience for those users, we propose to add a
348
+ deprecation warning before 3.0 when such dtype is created, giving them two
349
+ options:
350
+
351
+ - If the user just wants to have a dedicated "string" dtype (or the better
352
+ performance when using pyarrow) but is fine with using the default NaN
353
+ semantics, they can add ` pd.options.future.infer_string = True ` to their code
354
+ to suppress the warning and already opt-in to the future behaviour of pandas
355
+ 3.0.
356
+ - If the user specifically wants the variant of the string dtype that uses
357
+ ` pd.NA ` (and returns nullable numeric/boolean dtypes in operations), they will
358
+ have to update their dtype specification from ` "string" ` / ` pd.StringDtype() `
359
+ to ` pd.StringDtype(na_value=pd.NA) ` to suppress the warning and further keep
360
+ their code running as is.
313
361
314
362
## Timeline
315
363
0 commit comments