@@ -86,7 +86,7 @@ It can also be specified explicitly using the ``"str"`` alias:
8686 2 NaN
8787 dtype: str
8888
89- Similarly, functions like :func: `read_csv `, :func: `read_parquet `, and otherwise
89+ Similarly, functions like :func: `read_csv `, :func: `read_parquet `, and others
9090will now use the new string dtype when reading string data.
9191
9292In contrast to the current object dtype, the new string dtype will only store
@@ -268,6 +268,118 @@ the :meth:`~pandas.Series.astype` method:
268268 This ``astype("object") `` call will be redundant when using pandas 2.x, but
269269this code will work for all versions.
270270
271+ Invalid unicode input
272+ ~~~~~~~~~~~~~~~~~~~~~
273+
274+ Python allows to have a built-in ``str `` object that represents invalid unicode
275+ data. And since the ``object `` dtype can hold any Python object, you can have a
276+ pandas Series with such invalid unicode data:
277+
278+ .. code-block :: python
279+
280+ >> > ser = pd.Series([" \u2600 " , " \ud83d " ], dtype = object )
281+ >> > ser
282+ 0 ☀
283+ 1 \ud83d
284+ dtype: object
285+
286+ However, when using the string dtype using ``pyarrow `` under the hood, this can
287+ only store valid unicode data, and otherwise it will raise an error:
288+
289+ .. code-block :: python
290+
291+ >> > ser = pd.Series([" \u2600 " , " \ud83d " ])
292+ -------------------------------------------------------------------------- -
293+ UnicodeEncodeError Traceback (most recent call last)
294+ ...
295+ UnicodeEncodeError : ' utf-8' codec can' t encode character ' \ud83d' in position 0: surrogates not allowed
296+
297+ If you want to keep the previous behaviour, you can explicitly specify
298+ ``dtype=object `` to keep working with object dtype.
299+
300+ When you have byte data that you want to convert to strings using ``decode() ``,
301+ the :meth: `~pandas.Series.str.decode ` method now has a ``dtype `` parameter to be
302+ able to specify object dtype instead of the default of string dtype for this use
303+ case.
304+
305+ Notable bug fixes
306+ ~~~~~~~~~~~~~~~~~
307+
308+ ``astype(str) `` preserving missing values
309+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
311+ This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
312+
313+ With pandas < 3, when using ``astype(str) `` (using the built-in :func: `str `, not
314+ ``astype("str") ``!), the operation would convert every element to a string,
315+ including the missing values:
316+
317+ .. code-block :: python
318+
319+ # OLD behavior in pandas < 3
320+ >> > ser = pd.Series([" a" , np.nan], dtype = object )
321+ >> > ser
322+ 0 a
323+ 1 NaN
324+ dtype: object
325+ >> > ser.astype(str )
326+ 0 a
327+ 1 nan
328+ dtype: object
329+ >> > ser.astype(str ).to_numpy()
330+ array([' a' , ' nan' ], dtype = object )
331+
332+ Note how ``NaN `` (``np.nan ``) was converted to the string ``"nan" ``. This was
333+ not the intended behavior, and it was inconsistent with how other dtypes handled
334+ missing values.
335+
336+ With pandas 3, this behavior has been fixed, and now ``astype(str) `` is an alias
337+ for ``astype("str") ``, i.e. casting to the new string dtype, which will preserve
338+ the missing values:
339+
340+ .. code-block :: python
341+
342+ # NEW behavior in pandas 3
343+ >> > pd.options.future.infer_string = True
344+ >> > ser = pd.Series([" a" , np.nan], dtype = object )
345+ >> > ser.astype(str )
346+ 0 a
347+ 1 NaN
348+ dtype: str
349+ >> > ser.astype(str ).values
350+ array([' a' , nan], dtype = object )
351+
352+ If you want to preserve the old behaviour of converting every object to a
353+ string, you can use ``ser.map(str) `` instead.
354+
355+
356+ ``prod() `` raising for string data
357+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358+
359+ In pandas < 3, calling the :meth: `~pandas.Series.prod ` method on a Series with
360+ string data would generally raise an error, except when the Series was empty or
361+ contained only a single string (potentially with missing values):
362+
363+ .. code-block :: python
364+
365+ >> > ser = pd.Series([" a" , None ], dtype = object )
366+ >> > ser.prod()
367+ ' a'
368+
369+ When the Series contains multiple strings, it will raise a ``TypeError ``. This
370+ behaviour stays the same in pandas 3 when using the flexible ``object `` dtype.
371+ But by virtue of using the new string dtype, this will generally consistently
372+ raise an error regardless of the number of strings:
373+
374+ .. code-block :: python
375+
376+ >> > ser = pd.Series([" a" , None ], dtype = " str" )
377+ >> > ser.prod()
378+ -------------------------------------------------------------------------- -
379+ TypeError Traceback (most recent call last)
380+ ...
381+ TypeError : Cannot perform reduction ' prod' with string dtype
382+
271383 For existing users of the nullable ``StringDtype ``
272384--------------------------------------------------
273385
0 commit comments