DOC: add pandas 3.0 migration guide for the string dtype

jorisvandenbossche · jorisvandenbossche · commit 975dea1fe7cc · 2025-06-25T16:29:48.000+02:00
diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
@@ -87,5 +87,6 @@ Guides
     enhancingperf
     scale
     sparse
+    migration-3-strings
     gotchas
     cookbook
diff --git a/doc/source/user_guide/migration-3-strings.rst b/doc/source/user_guide/migration-3-strings.rst
@@ -0,0 +1,272 @@
+{{ header }}
+
+.. _string_migration_guide:
+
+=========================================================
+Migration guide for the new string data type (pandas 3.0)
+=========================================================
+
+The upcoming pandas 3.0 release introduces a new, default string data type. This
+will most likely cause some work when upgrading to pandas 3.0, and this page
+provides an overview of the issues you might run into and gives guidance on how
+to address them.
+
+This new dtype is already available in the pandas 2.3 release, and you can
+enable it with:
+
+.. code-block:: python
+
+    pd.options.future.infer_string = True
+
+This allows to test your code before the final 3.0 release.
+
+Background
+----------
+
+Historically, pandas has always used the NumPy ``object`` dtype as the default
+to store text data. This has two primary drawbacks. First, ``object`` dtype is
+not specific to strings: any Python object can be stored in an ``object```-dtype
+array, not just strings, and seeing ``object`` as the dtype for a column with
+strings is confusing for users. Second, this is not always very efficient (both
+performance wise as for memory usage).
+
+Since pandas 1.0, an opt-in string data type has been available, but this has
+not yet been made the default, and uses the ``pd.NA`` scalar to represent
+missing values.
+
+Pandas 3.0 changes the default dtype for strings to a new string data type,
+a variant of the existing optional string data type but using ``NaN`` as the
+missing value indicator, to be consistent with the other default data types.
+
+To improve performance, the new string data type will use the ``pyarrow``
+package by default, if installed (and otherwise it uses object dtype under the
+hood as a fallback).
+
+See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
+for more background and details.
+
+.. - brief primer on the new dtype
+
+.. - Main characteristics:
+..    - inferred by default (Default inference of a string dtype)
+..    - only strings (setitem with non string fails)
+..    - missing values sentinel is always NaN and uses NaN semantics
+
+.. - Breaking changes:
+..    - dtype is no longer object dtype
+..    - None gets coerced to NaN
+..    - setitem raises an error for non-string data
+
+Brief intro to the new default string dtype
+-------------------------------------------
+
+By default, pandas will infer this new string dtype instead of object dtype for
+string data (when creating pandas objects, such as in constructors or IO
+functions).
+
+Being a default dtype means that the string dtype will be used in IO methods or
+constructors when the dtype is being inferred and the input is inferred to be
+string data:
+
+.. code-block:: python
+
+   >>> pd.Series(["a", "b", None])
+   0      a
+   1      b
+   2    NaN
+   dtype: str
+
+It can also be specified explicitly using the ``"str"`` alias:
+
+.. code-block:: python
+
+   >>> pd.Series(["a", "b", None], dtype="str")
+   0      a
+   1      b
+   2    NaN
+   dtype: str
+
+In contrast the the current object dtype, the new string dtype will only store
+strings. This also means that it will raise an error if you try to store a
+non-string value in it (see below for more details).
+
+Missing values with the new string dtype are always represented as ``NaN``, and
+the missing value behaviour is similar as for other default dtypes.
+
+For the rest, this new string dtype should work the same as how you have been
+using pandas with string data today. For example, all string-specific methods
+through the ``str`` accessor will work the same:
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", None], dtype="str")
+   >>> ser.str.upper()
+   0    A
+   1    B
+   2  NaN
+   dtype: str
+
+.. note::
+
+   The new default string dtype is an instance of the :class:`pandas.StringDtype`
+   class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``,
+   but for general usage we recommend to use the shorter ``"str"`` alias.
+
+Overview of behaviour differences and how to address them
+---------------------------------------------------------
+
+The dtype is no longer object dtype
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When inferring string data, the data type of the resulting DataFrame column or
+Series will silently start being the new ``"str"`` dtype instead of ``"object"``
+dtype, and this can have some impact on your code.
+
+Checking the dtype
+^^^^^^^^^^^^^^^^^^
+
+When checking the dtype, code might currently do something like:
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", "c"])
+   >>> ser.dtype == "object"
+
+to check for columns with string data (by checking for the dtype being
+``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will
+now be ``"str"`` with the new default string dtype, and the above check will
+return ``False``.
+
+To check for columns with string data, you should instead use:
+
+.. code-block:: python
+
+   >>> ser.dtype == "str"
+
+**How to write compatible code?**
+
+For code that should work on both pandas 2.x and 3.x, you can use the
+:func:`pandas.api.types.is_string_dtype` function:
+
+.. code-block:: python
+
+   >>> pd.api.types.is_string_dtype(ser.dtype)
+   True
+
+This will return ``True`` for both the object dtype as for the string dtypes.
+
+Hardcoded use of object dtype
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you have code where the dtype is hardcoded in constructors, like
+
+.. code-block:: python
+
+   >>> pd.Series(["a", "b", "c"], dtype="object")
+
+this will keep using the object dtype. You will want to update this code to
+ensure you get the benefits of the new string dtype.
+
+**How to write compatible code?**
+
+First, in many cases it can be sufficient to remove the specific data type, and
+let pandas do the inference. But if you want to be specific, you can specify the
+``"str"`` dtype:
+
+.. code-block:: python
+
+   >>> pd.Series(["a", "b", "c"], dtype="str")
+
+This is actually compatible with pandas 2.x as well, since in pandas < 3,
+``dtype="str"`` was essentially treated as an alias for object dtype.
+
+The missing value sentinel is now always NaN
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using object dtype, multiple possible missing value sentinels are
+supported, including ``None`` and ``np.nan``. With the new default string dtype,
+the missing value sentinel is always NaN (``np.nan``):
+
+.. code-block:: python
+
+   # with object dtype, None is preserved as None and seen as missing
+   >>> ser = pd.Series(["a", "b", None], dtype="object")
+   >>> ser
+   0       a
+   1       b
+   2    None
+   dtype: object
+   >>> print(ser[2])
+   None
+
+   # with the new string dtype, any missing value like None is coerced to NaN
+   >>> ser = pd.Series(["a", "b", None], dtype="str")
+   >>> ser
+   0      a
+   1      b
+   2    NaN
+   dtype: str
+   >>> print(ser[2])
+   nan
+
+Generally this should be no problem when relying on missing value behaviour in
+pandas methods (for example, ``ser.isna()`` will give the same result as before).
+But when you relied on the exact value of ``None`` being present, that can
+impact your code.
+
+**How to write compatible code?**
+
+When checking for a missing value, instead of checking for the exact value of
+``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is
+the most robust way to check for missing values, as it will work regardless of
+the dtype and the exact missing value sentinel:
+
+.. code-block:: python
+
+   >>> pd.isna(ser[2])
+   True
+
+One caveat: this function works both on scalars and on array-likes, and in the
+latter case it will return an array of boolean dtype. When using it in a boolean
+context (for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to
+it.
+
+"setitem" operations will now raise an error for non-string data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the new string dtype, any attempt to set a non-string value in a Series or
+DataFrame will raise an error:
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", None], dtype="str")
+   >>> ser[1] = 2.5
+   ---------------------------------------------------------------------------
+   TypeError                                 Traceback (most recent call last)
+   ...
+   TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead.
+
+If you relied on the flexible nature of object dtype being able to hold any
+Python object, but your initial data was inferred as strings, your code might be
+impacted by this change.
+
+**How to write compatible code?**
+
+You can update your code to ensure you only set string values in such columns,
+or otherwise you have explicitly ensure the column has object dtype first. This
+can be done by specifying the dtype explicitly in the constructor, or by using
+the :meth:`~pandas.Series.astype` method:
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", None], dtype="str")
+   >>> ser = ser.astype("object")
+   >>> ser[1] = 2.5
+
+This ``astype("object")`` call will be redundant when using pandas 2.x, but
+this way such code can work for all versions.
+
+For existing users of the nullable ``StringDtype``
+--------------------------------------------------
+
+TODO