Skip to content

Commit 9760fee

Browse files
add sections about invalid unicode, astype(str) and prod()
1 parent e4a764d commit 9760fee

File tree

1 file changed

+113
-1
lines changed

1 file changed

+113
-1
lines changed

doc/source/user_guide/migration-3-strings.rst

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ It can also be specified explicitly using the ``"str"`` alias:
8686
2 NaN
8787
dtype: str
8888
89-
Similarly, functions like :func:`read_csv`, :func:`read_parquet`, and otherwise
89+
Similarly, functions like :func:`read_csv`, :func:`read_parquet`, and others
9090
will now use the new string dtype when reading string data.
9191

9292
In contrast to the current object dtype, the new string dtype will only store
@@ -268,6 +268,118 @@ the :meth:`~pandas.Series.astype` method:
268268
This ``astype("object")`` call will be redundant when using pandas 2.x, but
269269
this code will work for all versions.
270270

271+
Invalid unicode input
272+
~~~~~~~~~~~~~~~~~~~~~
273+
274+
Python allows to have a built-in ``str`` object that represents invalid unicode
275+
data. And since the ``object`` dtype can hold any Python object, you can have a
276+
pandas Series with such invalid unicode data:
277+
278+
.. code-block:: python
279+
280+
>>> ser = pd.Series(["\u2600", "\ud83d"], dtype=object)
281+
>>> ser
282+
0
283+
1 \ud83d
284+
dtype: object
285+
286+
However, when using the string dtype using ``pyarrow`` under the hood, this can
287+
only store valid unicode data, and otherwise it will raise an error:
288+
289+
.. code-block:: python
290+
291+
>>> ser = pd.Series(["\u2600", "\ud83d"])
292+
---------------------------------------------------------------------------
293+
UnicodeEncodeError Traceback (most recent call last)
294+
...
295+
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
296+
297+
If you want to keep the previous behaviour, you can explicitly specify
298+
``dtype=object`` to keep working with object dtype.
299+
300+
When you have byte data that you want to convert to strings using ``decode()``,
301+
the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be
302+
able to specify object dtype instead of the default of string dtype for this use
303+
case.
304+
305+
Notable bug fixes
306+
~~~~~~~~~~~~~~~~~
307+
308+
``astype(str)`` preserving missing values
309+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
311+
This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
312+
313+
With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
314+
``astype("str")``!), the operation would convert every element to a string,
315+
including the missing values:
316+
317+
.. code-block:: python
318+
319+
# OLD behavior in pandas < 3
320+
>>> ser = pd.Series(["a", np.nan], dtype=object)
321+
>>> ser
322+
0 a
323+
1 NaN
324+
dtype: object
325+
>>> ser.astype(str)
326+
0 a
327+
1 nan
328+
dtype: object
329+
>>> ser.astype(str).to_numpy()
330+
array(['a', 'nan'], dtype=object)
331+
332+
Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
333+
not the intended behavior, and it was inconsistent with how other dtypes handled
334+
missing values.
335+
336+
With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
337+
for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
338+
the missing values:
339+
340+
.. code-block:: python
341+
342+
# NEW behavior in pandas 3
343+
>>> pd.options.future.infer_string = True
344+
>>> ser = pd.Series(["a", np.nan], dtype=object)
345+
>>> ser.astype(str)
346+
0 a
347+
1 NaN
348+
dtype: str
349+
>>> ser.astype(str).values
350+
array(['a', nan], dtype=object)
351+
352+
If you want to preserve the old behaviour of converting every object to a
353+
string, you can use ``ser.map(str)`` instead.
354+
355+
356+
``prod()`` raising for string data
357+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358+
359+
In pandas < 3, calling the :meth:`~pandas.Series.prod` method on a Series with
360+
string data would generally raise an error, except when the Series was empty or
361+
contained only a single string (potentially with missing values):
362+
363+
.. code-block:: python
364+
365+
>>> ser = pd.Series(["a", None], dtype=object)
366+
>>> ser.prod()
367+
'a'
368+
369+
When the Series contains multiple strings, it will raise a ``TypeError``. This
370+
behaviour stays the same in pandas 3 when using the flexible ``object`` dtype.
371+
But by virtue of using the new string dtype, this will generally consistently
372+
raise an error regardless of the number of strings:
373+
374+
.. code-block:: python
375+
376+
>>> ser = pd.Series(["a", None], dtype="str")
377+
>>> ser.prod()
378+
---------------------------------------------------------------------------
379+
TypeError Traceback (most recent call last)
380+
...
381+
TypeError: Cannot perform reduction 'prod' with string dtype
382+
271383
For existing users of the nullable ``StringDtype``
272384
--------------------------------------------------
273385

0 commit comments

Comments
 (0)