@@ -86,7 +86,7 @@ It can also be specified explicitly using the ``"str"`` alias:
86
86
2 NaN
87
87
dtype: str
88
88
89
- Similarly, functions like :func: `read_csv `, :func: `read_parquet `, and otherwise
89
+ Similarly, functions like :func: `read_csv `, :func: `read_parquet `, and others
90
90
will now use the new string dtype when reading string data.
91
91
92
92
In contrast to the current object dtype, the new string dtype will only store
@@ -268,6 +268,118 @@ the :meth:`~pandas.Series.astype` method:
268
268
This ``astype("object") `` call will be redundant when using pandas 2.x, but
269
269
this code will work for all versions.
270
270
271
+ Invalid unicode input
272
+ ~~~~~~~~~~~~~~~~~~~~~
273
+
274
+ Python allows to have a built-in ``str `` object that represents invalid unicode
275
+ data. And since the ``object `` dtype can hold any Python object, you can have a
276
+ pandas Series with such invalid unicode data:
277
+
278
+ .. code-block :: python
279
+
280
+ >> > ser = pd.Series([" \u2600 " , " \ud83d " ], dtype = object )
281
+ >> > ser
282
+ 0 ☀
283
+ 1 \ud83d
284
+ dtype: object
285
+
286
+ However, when using the string dtype using ``pyarrow `` under the hood, this can
287
+ only store valid unicode data, and otherwise it will raise an error:
288
+
289
+ .. code-block :: python
290
+
291
+ >> > ser = pd.Series([" \u2600 " , " \ud83d " ])
292
+ -------------------------------------------------------------------------- -
293
+ UnicodeEncodeError Traceback (most recent call last)
294
+ ...
295
+ UnicodeEncodeError : ' utf-8' codec can' t encode character ' \ud83d' in position 0: surrogates not allowed
296
+
297
+ If you want to keep the previous behaviour, you can explicitly specify
298
+ ``dtype=object `` to keep working with object dtype.
299
+
300
+ When you have byte data that you want to convert to strings using ``decode() ``,
301
+ the :meth: `~pandas.Series.str.decode ` method now has a ``dtype `` parameter to be
302
+ able to specify object dtype instead of the default of string dtype for this use
303
+ case.
304
+
305
+ Notable bug fixes
306
+ ~~~~~~~~~~~~~~~~~
307
+
308
+ ``astype(str) `` preserving missing values
309
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310
+
311
+ This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
312
+
313
+ With pandas < 3, when using ``astype(str) `` (using the built-in :func: `str `, not
314
+ ``astype("str") ``!), the operation would convert every element to a string,
315
+ including the missing values:
316
+
317
+ .. code-block :: python
318
+
319
+ # OLD behavior in pandas < 3
320
+ >> > ser = pd.Series([" a" , np.nan], dtype = object )
321
+ >> > ser
322
+ 0 a
323
+ 1 NaN
324
+ dtype: object
325
+ >> > ser.astype(str )
326
+ 0 a
327
+ 1 nan
328
+ dtype: object
329
+ >> > ser.astype(str ).to_numpy()
330
+ array([' a' , ' nan' ], dtype = object )
331
+
332
+ Note how ``NaN `` (``np.nan ``) was converted to the string ``"nan" ``. This was
333
+ not the intended behavior, and it was inconsistent with how other dtypes handled
334
+ missing values.
335
+
336
+ With pandas 3, this behavior has been fixed, and now ``astype(str) `` is an alias
337
+ for ``astype("str") ``, i.e. casting to the new string dtype, which will preserve
338
+ the missing values:
339
+
340
+ .. code-block :: python
341
+
342
+ # NEW behavior in pandas 3
343
+ >> > pd.options.future.infer_string = True
344
+ >> > ser = pd.Series([" a" , np.nan], dtype = object )
345
+ >> > ser.astype(str )
346
+ 0 a
347
+ 1 NaN
348
+ dtype: str
349
+ >> > ser.astype(str ).values
350
+ array([' a' , nan], dtype = object )
351
+
352
+ If you want to preserve the old behaviour of converting every object to a
353
+ string, you can use ``ser.map(str) `` instead.
354
+
355
+
356
+ ``prod() `` raising for string data
357
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358
+
359
+ In pandas < 3, calling the :meth: `~pandas.Series.prod ` method on a Series with
360
+ string data would generally raise an error, except when the Series was empty or
361
+ contained only a single string (potentially with missing values):
362
+
363
+ .. code-block :: python
364
+
365
+ >> > ser = pd.Series([" a" , None ], dtype = object )
366
+ >> > ser.prod()
367
+ ' a'
368
+
369
+ When the Series contains multiple strings, it will raise a ``TypeError ``. This
370
+ behaviour stays the same in pandas 3 when using the flexible ``object `` dtype.
371
+ But by virtue of using the new string dtype, this will generally consistently
372
+ raise an error regardless of the number of strings:
373
+
374
+ .. code-block :: python
375
+
376
+ >> > ser = pd.Series([" a" , None ], dtype = " str" )
377
+ >> > ser.prod()
378
+ -------------------------------------------------------------------------- -
379
+ TypeError Traceback (most recent call last)
380
+ ...
381
+ TypeError : Cannot perform reduction ' prod' with string dtype
382
+
271
383
For existing users of the nullable ``StringDtype ``
272
384
--------------------------------------------------
273
385
0 commit comments