Skip to content

Commit 975dea1

Browse files
DOC: add pandas 3.0 migration guide for the string dtype
1 parent 09f7cc0 commit 975dea1

File tree

2 files changed

+273
-0
lines changed

2 files changed

+273
-0
lines changed

doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,5 +87,6 @@ Guides
8787
enhancingperf
8888
scale
8989
sparse
90+
migration-3-strings
9091
gotchas
9192
cookbook
Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
{{ header }}
2+
3+
.. _string_migration_guide:
4+
5+
=========================================================
6+
Migration guide for the new string data type (pandas 3.0)
7+
=========================================================
8+
9+
The upcoming pandas 3.0 release introduces a new, default string data type. This
10+
will most likely cause some work when upgrading to pandas 3.0, and this page
11+
provides an overview of the issues you might run into and gives guidance on how
12+
to address them.
13+
14+
This new dtype is already available in the pandas 2.3 release, and you can
15+
enable it with:
16+
17+
.. code-block:: python
18+
19+
pd.options.future.infer_string = True
20+
21+
This allows to test your code before the final 3.0 release.
22+
23+
Background
24+
----------
25+
26+
Historically, pandas has always used the NumPy ``object`` dtype as the default
27+
to store text data. This has two primary drawbacks. First, ``object`` dtype is
28+
not specific to strings: any Python object can be stored in an ``object```-dtype
29+
array, not just strings, and seeing ``object`` as the dtype for a column with
30+
strings is confusing for users. Second, this is not always very efficient (both
31+
performance wise as for memory usage).
32+
33+
Since pandas 1.0, an opt-in string data type has been available, but this has
34+
not yet been made the default, and uses the ``pd.NA`` scalar to represent
35+
missing values.
36+
37+
Pandas 3.0 changes the default dtype for strings to a new string data type,
38+
a variant of the existing optional string data type but using ``NaN`` as the
39+
missing value indicator, to be consistent with the other default data types.
40+
41+
To improve performance, the new string data type will use the ``pyarrow``
42+
package by default, if installed (and otherwise it uses object dtype under the
43+
hood as a fallback).
44+
45+
See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
46+
for more background and details.
47+
48+
.. - brief primer on the new dtype
49+
50+
.. - Main characteristics:
51+
.. - inferred by default (Default inference of a string dtype)
52+
.. - only strings (setitem with non string fails)
53+
.. - missing values sentinel is always NaN and uses NaN semantics
54+
55+
.. - Breaking changes:
56+
.. - dtype is no longer object dtype
57+
.. - None gets coerced to NaN
58+
.. - setitem raises an error for non-string data
59+
60+
Brief intro to the new default string dtype
61+
-------------------------------------------
62+
63+
By default, pandas will infer this new string dtype instead of object dtype for
64+
string data (when creating pandas objects, such as in constructors or IO
65+
functions).
66+
67+
Being a default dtype means that the string dtype will be used in IO methods or
68+
constructors when the dtype is being inferred and the input is inferred to be
69+
string data:
70+
71+
.. code-block:: python
72+
73+
>>> pd.Series(["a", "b", None])
74+
0 a
75+
1 b
76+
2 NaN
77+
dtype: str
78+
79+
It can also be specified explicitly using the ``"str"`` alias:
80+
81+
.. code-block:: python
82+
83+
>>> pd.Series(["a", "b", None], dtype="str")
84+
0 a
85+
1 b
86+
2 NaN
87+
dtype: str
88+
89+
In contrast the the current object dtype, the new string dtype will only store
90+
strings. This also means that it will raise an error if you try to store a
91+
non-string value in it (see below for more details).
92+
93+
Missing values with the new string dtype are always represented as ``NaN``, and
94+
the missing value behaviour is similar as for other default dtypes.
95+
96+
For the rest, this new string dtype should work the same as how you have been
97+
using pandas with string data today. For example, all string-specific methods
98+
through the ``str`` accessor will work the same:
99+
100+
.. code-block:: python
101+
102+
>>> ser = pd.Series(["a", "b", None], dtype="str")
103+
>>> ser.str.upper()
104+
0 A
105+
1 B
106+
2 NaN
107+
dtype: str
108+
109+
.. note::
110+
111+
The new default string dtype is an instance of the :class:`pandas.StringDtype`
112+
class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``,
113+
but for general usage we recommend to use the shorter ``"str"`` alias.
114+
115+
Overview of behaviour differences and how to address them
116+
---------------------------------------------------------
117+
118+
The dtype is no longer object dtype
119+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
120+
121+
When inferring string data, the data type of the resulting DataFrame column or
122+
Series will silently start being the new ``"str"`` dtype instead of ``"object"``
123+
dtype, and this can have some impact on your code.
124+
125+
Checking the dtype
126+
^^^^^^^^^^^^^^^^^^
127+
128+
When checking the dtype, code might currently do something like:
129+
130+
.. code-block:: python
131+
132+
>>> ser = pd.Series(["a", "b", "c"])
133+
>>> ser.dtype == "object"
134+
135+
to check for columns with string data (by checking for the dtype being
136+
``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will
137+
now be ``"str"`` with the new default string dtype, and the above check will
138+
return ``False``.
139+
140+
To check for columns with string data, you should instead use:
141+
142+
.. code-block:: python
143+
144+
>>> ser.dtype == "str"
145+
146+
**How to write compatible code?**
147+
148+
For code that should work on both pandas 2.x and 3.x, you can use the
149+
:func:`pandas.api.types.is_string_dtype` function:
150+
151+
.. code-block:: python
152+
153+
>>> pd.api.types.is_string_dtype(ser.dtype)
154+
True
155+
156+
This will return ``True`` for both the object dtype as for the string dtypes.
157+
158+
Hardcoded use of object dtype
159+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160+
161+
If you have code where the dtype is hardcoded in constructors, like
162+
163+
.. code-block:: python
164+
165+
>>> pd.Series(["a", "b", "c"], dtype="object")
166+
167+
this will keep using the object dtype. You will want to update this code to
168+
ensure you get the benefits of the new string dtype.
169+
170+
**How to write compatible code?**
171+
172+
First, in many cases it can be sufficient to remove the specific data type, and
173+
let pandas do the inference. But if you want to be specific, you can specify the
174+
``"str"`` dtype:
175+
176+
.. code-block:: python
177+
178+
>>> pd.Series(["a", "b", "c"], dtype="str")
179+
180+
This is actually compatible with pandas 2.x as well, since in pandas < 3,
181+
``dtype="str"`` was essentially treated as an alias for object dtype.
182+
183+
The missing value sentinel is now always NaN
184+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
185+
186+
When using object dtype, multiple possible missing value sentinels are
187+
supported, including ``None`` and ``np.nan``. With the new default string dtype,
188+
the missing value sentinel is always NaN (``np.nan``):
189+
190+
.. code-block:: python
191+
192+
# with object dtype, None is preserved as None and seen as missing
193+
>>> ser = pd.Series(["a", "b", None], dtype="object")
194+
>>> ser
195+
0 a
196+
1 b
197+
2 None
198+
dtype: object
199+
>>> print(ser[2])
200+
None
201+
202+
# with the new string dtype, any missing value like None is coerced to NaN
203+
>>> ser = pd.Series(["a", "b", None], dtype="str")
204+
>>> ser
205+
0 a
206+
1 b
207+
2 NaN
208+
dtype: str
209+
>>> print(ser[2])
210+
nan
211+
212+
Generally this should be no problem when relying on missing value behaviour in
213+
pandas methods (for example, ``ser.isna()`` will give the same result as before).
214+
But when you relied on the exact value of ``None`` being present, that can
215+
impact your code.
216+
217+
**How to write compatible code?**
218+
219+
When checking for a missing value, instead of checking for the exact value of
220+
``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is
221+
the most robust way to check for missing values, as it will work regardless of
222+
the dtype and the exact missing value sentinel:
223+
224+
.. code-block:: python
225+
226+
>>> pd.isna(ser[2])
227+
True
228+
229+
One caveat: this function works both on scalars and on array-likes, and in the
230+
latter case it will return an array of boolean dtype. When using it in a boolean
231+
context (for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to
232+
it.
233+
234+
"setitem" operations will now raise an error for non-string data
235+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
236+
237+
With the new string dtype, any attempt to set a non-string value in a Series or
238+
DataFrame will raise an error:
239+
240+
.. code-block:: python
241+
242+
>>> ser = pd.Series(["a", "b", None], dtype="str")
243+
>>> ser[1] = 2.5
244+
---------------------------------------------------------------------------
245+
TypeError Traceback (most recent call last)
246+
...
247+
TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead.
248+
249+
If you relied on the flexible nature of object dtype being able to hold any
250+
Python object, but your initial data was inferred as strings, your code might be
251+
impacted by this change.
252+
253+
**How to write compatible code?**
254+
255+
You can update your code to ensure you only set string values in such columns,
256+
or otherwise you have explicitly ensure the column has object dtype first. This
257+
can be done by specifying the dtype explicitly in the constructor, or by using
258+
the :meth:`~pandas.Series.astype` method:
259+
260+
.. code-block:: python
261+
262+
>>> ser = pd.Series(["a", "b", None], dtype="str")
263+
>>> ser = ser.astype("object")
264+
>>> ser[1] = 2.5
265+
266+
This ``astype("object")`` call will be redundant when using pandas 2.x, but
267+
this way such code can work for all versions.
268+
269+
For existing users of the nullable ``StringDtype``
270+
--------------------------------------------------
271+
272+
TODO

0 commit comments

Comments
 (0)