|
| 1 | +{{ header }} |
| 2 | + |
| 3 | +.. _string_migration_guide: |
| 4 | + |
| 5 | +========================================================= |
| 6 | +Migration guide for the new string data type (pandas 3.0) |
| 7 | +========================================================= |
| 8 | + |
| 9 | +The upcoming pandas 3.0 release introduces a new, default string data type. This |
| 10 | +will most likely cause some work when upgrading to pandas 3.0, and this page |
| 11 | +provides an overview of the issues you might run into and gives guidance on how |
| 12 | +to address them. |
| 13 | + |
| 14 | +This new dtype is already available in the pandas 2.3 release, and you can |
| 15 | +enable it with: |
| 16 | + |
| 17 | +.. code-block:: python |
| 18 | +
|
| 19 | + pd.options.future.infer_string = True |
| 20 | +
|
| 21 | +This allows to test your code before the final 3.0 release. |
| 22 | + |
| 23 | +Background |
| 24 | +---------- |
| 25 | + |
| 26 | +Historically, pandas has always used the NumPy ``object`` dtype as the default |
| 27 | +to store text data. This has two primary drawbacks. First, ``object`` dtype is |
| 28 | +not specific to strings: any Python object can be stored in an ``object```-dtype |
| 29 | +array, not just strings, and seeing ``object`` as the dtype for a column with |
| 30 | +strings is confusing for users. Second, this is not always very efficient (both |
| 31 | +performance wise as for memory usage). |
| 32 | + |
| 33 | +Since pandas 1.0, an opt-in string data type has been available, but this has |
| 34 | +not yet been made the default, and uses the ``pd.NA`` scalar to represent |
| 35 | +missing values. |
| 36 | + |
| 37 | +Pandas 3.0 changes the default dtype for strings to a new string data type, |
| 38 | +a variant of the existing optional string data type but using ``NaN`` as the |
| 39 | +missing value indicator, to be consistent with the other default data types. |
| 40 | + |
| 41 | +To improve performance, the new string data type will use the ``pyarrow`` |
| 42 | +package by default, if installed (and otherwise it uses object dtype under the |
| 43 | +hood as a fallback). |
| 44 | + |
| 45 | +See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__ |
| 46 | +for more background and details. |
| 47 | + |
| 48 | +.. - brief primer on the new dtype |
| 49 | +
|
| 50 | +.. - Main characteristics: |
| 51 | +.. - inferred by default (Default inference of a string dtype) |
| 52 | +.. - only strings (setitem with non string fails) |
| 53 | +.. - missing values sentinel is always NaN and uses NaN semantics |
| 54 | +
|
| 55 | +.. - Breaking changes: |
| 56 | +.. - dtype is no longer object dtype |
| 57 | +.. - None gets coerced to NaN |
| 58 | +.. - setitem raises an error for non-string data |
| 59 | +
|
| 60 | +Brief intro to the new default string dtype |
| 61 | +------------------------------------------- |
| 62 | + |
| 63 | +By default, pandas will infer this new string dtype instead of object dtype for |
| 64 | +string data (when creating pandas objects, such as in constructors or IO |
| 65 | +functions). |
| 66 | + |
| 67 | +Being a default dtype means that the string dtype will be used in IO methods or |
| 68 | +constructors when the dtype is being inferred and the input is inferred to be |
| 69 | +string data: |
| 70 | + |
| 71 | +.. code-block:: python |
| 72 | +
|
| 73 | + >>> pd.Series(["a", "b", None]) |
| 74 | + 0 a |
| 75 | + 1 b |
| 76 | + 2 NaN |
| 77 | + dtype: str |
| 78 | +
|
| 79 | +It can also be specified explicitly using the ``"str"`` alias: |
| 80 | + |
| 81 | +.. code-block:: python |
| 82 | +
|
| 83 | + >>> pd.Series(["a", "b", None], dtype="str") |
| 84 | + 0 a |
| 85 | + 1 b |
| 86 | + 2 NaN |
| 87 | + dtype: str |
| 88 | +
|
| 89 | +In contrast the the current object dtype, the new string dtype will only store |
| 90 | +strings. This also means that it will raise an error if you try to store a |
| 91 | +non-string value in it (see below for more details). |
| 92 | + |
| 93 | +Missing values with the new string dtype are always represented as ``NaN``, and |
| 94 | +the missing value behaviour is similar as for other default dtypes. |
| 95 | + |
| 96 | +For the rest, this new string dtype should work the same as how you have been |
| 97 | +using pandas with string data today. For example, all string-specific methods |
| 98 | +through the ``str`` accessor will work the same: |
| 99 | + |
| 100 | +.. code-block:: python |
| 101 | +
|
| 102 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 103 | + >>> ser.str.upper() |
| 104 | + 0 A |
| 105 | + 1 B |
| 106 | + 2 NaN |
| 107 | + dtype: str |
| 108 | +
|
| 109 | +.. note:: |
| 110 | + |
| 111 | + The new default string dtype is an instance of the :class:`pandas.StringDtype` |
| 112 | + class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``, |
| 113 | + but for general usage we recommend to use the shorter ``"str"`` alias. |
| 114 | + |
| 115 | +Overview of behaviour differences and how to address them |
| 116 | +--------------------------------------------------------- |
| 117 | + |
| 118 | +The dtype is no longer object dtype |
| 119 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 120 | + |
| 121 | +When inferring string data, the data type of the resulting DataFrame column or |
| 122 | +Series will silently start being the new ``"str"`` dtype instead of ``"object"`` |
| 123 | +dtype, and this can have some impact on your code. |
| 124 | + |
| 125 | +Checking the dtype |
| 126 | +^^^^^^^^^^^^^^^^^^ |
| 127 | + |
| 128 | +When checking the dtype, code might currently do something like: |
| 129 | + |
| 130 | +.. code-block:: python |
| 131 | +
|
| 132 | + >>> ser = pd.Series(["a", "b", "c"]) |
| 133 | + >>> ser.dtype == "object" |
| 134 | +
|
| 135 | +to check for columns with string data (by checking for the dtype being |
| 136 | +``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will |
| 137 | +now be ``"str"`` with the new default string dtype, and the above check will |
| 138 | +return ``False``. |
| 139 | + |
| 140 | +To check for columns with string data, you should instead use: |
| 141 | + |
| 142 | +.. code-block:: python |
| 143 | +
|
| 144 | + >>> ser.dtype == "str" |
| 145 | +
|
| 146 | +**How to write compatible code?** |
| 147 | + |
| 148 | +For code that should work on both pandas 2.x and 3.x, you can use the |
| 149 | +:func:`pandas.api.types.is_string_dtype` function: |
| 150 | + |
| 151 | +.. code-block:: python |
| 152 | +
|
| 153 | + >>> pd.api.types.is_string_dtype(ser.dtype) |
| 154 | + True |
| 155 | +
|
| 156 | +This will return ``True`` for both the object dtype as for the string dtypes. |
| 157 | + |
| 158 | +Hardcoded use of object dtype |
| 159 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 160 | + |
| 161 | +If you have code where the dtype is hardcoded in constructors, like |
| 162 | + |
| 163 | +.. code-block:: python |
| 164 | +
|
| 165 | + >>> pd.Series(["a", "b", "c"], dtype="object") |
| 166 | +
|
| 167 | +this will keep using the object dtype. You will want to update this code to |
| 168 | +ensure you get the benefits of the new string dtype. |
| 169 | + |
| 170 | +**How to write compatible code?** |
| 171 | + |
| 172 | +First, in many cases it can be sufficient to remove the specific data type, and |
| 173 | +let pandas do the inference. But if you want to be specific, you can specify the |
| 174 | +``"str"`` dtype: |
| 175 | + |
| 176 | +.. code-block:: python |
| 177 | +
|
| 178 | + >>> pd.Series(["a", "b", "c"], dtype="str") |
| 179 | +
|
| 180 | +This is actually compatible with pandas 2.x as well, since in pandas < 3, |
| 181 | +``dtype="str"`` was essentially treated as an alias for object dtype. |
| 182 | + |
| 183 | +The missing value sentinel is now always NaN |
| 184 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 185 | + |
| 186 | +When using object dtype, multiple possible missing value sentinels are |
| 187 | +supported, including ``None`` and ``np.nan``. With the new default string dtype, |
| 188 | +the missing value sentinel is always NaN (``np.nan``): |
| 189 | + |
| 190 | +.. code-block:: python |
| 191 | +
|
| 192 | + # with object dtype, None is preserved as None and seen as missing |
| 193 | + >>> ser = pd.Series(["a", "b", None], dtype="object") |
| 194 | + >>> ser |
| 195 | + 0 a |
| 196 | + 1 b |
| 197 | + 2 None |
| 198 | + dtype: object |
| 199 | + >>> print(ser[2]) |
| 200 | + None |
| 201 | +
|
| 202 | + # with the new string dtype, any missing value like None is coerced to NaN |
| 203 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 204 | + >>> ser |
| 205 | + 0 a |
| 206 | + 1 b |
| 207 | + 2 NaN |
| 208 | + dtype: str |
| 209 | + >>> print(ser[2]) |
| 210 | + nan |
| 211 | +
|
| 212 | +Generally this should be no problem when relying on missing value behaviour in |
| 213 | +pandas methods (for example, ``ser.isna()`` will give the same result as before). |
| 214 | +But when you relied on the exact value of ``None`` being present, that can |
| 215 | +impact your code. |
| 216 | + |
| 217 | +**How to write compatible code?** |
| 218 | + |
| 219 | +When checking for a missing value, instead of checking for the exact value of |
| 220 | +``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is |
| 221 | +the most robust way to check for missing values, as it will work regardless of |
| 222 | +the dtype and the exact missing value sentinel: |
| 223 | + |
| 224 | +.. code-block:: python |
| 225 | +
|
| 226 | + >>> pd.isna(ser[2]) |
| 227 | + True |
| 228 | +
|
| 229 | +One caveat: this function works both on scalars and on array-likes, and in the |
| 230 | +latter case it will return an array of boolean dtype. When using it in a boolean |
| 231 | +context (for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to |
| 232 | +it. |
| 233 | + |
| 234 | +"setitem" operations will now raise an error for non-string data |
| 235 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 236 | + |
| 237 | +With the new string dtype, any attempt to set a non-string value in a Series or |
| 238 | +DataFrame will raise an error: |
| 239 | + |
| 240 | +.. code-block:: python |
| 241 | +
|
| 242 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 243 | + >>> ser[1] = 2.5 |
| 244 | + --------------------------------------------------------------------------- |
| 245 | + TypeError Traceback (most recent call last) |
| 246 | + ... |
| 247 | + TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead. |
| 248 | +
|
| 249 | +If you relied on the flexible nature of object dtype being able to hold any |
| 250 | +Python object, but your initial data was inferred as strings, your code might be |
| 251 | +impacted by this change. |
| 252 | +
|
| 253 | +**How to write compatible code?** |
| 254 | +
|
| 255 | +You can update your code to ensure you only set string values in such columns, |
| 256 | +or otherwise you have explicitly ensure the column has object dtype first. This |
| 257 | +can be done by specifying the dtype explicitly in the constructor, or by using |
| 258 | +the :meth:`~pandas.Series.astype` method: |
| 259 | + |
| 260 | +.. code-block:: python |
| 261 | +
|
| 262 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 263 | + >>> ser = ser.astype("object") |
| 264 | + >>> ser[1] = 2.5 |
| 265 | +
|
| 266 | +This ``astype("object")`` call will be redundant when using pandas 2.x, but |
| 267 | +this way such code can work for all versions. |
| 268 | + |
| 269 | +For existing users of the nullable ``StringDtype`` |
| 270 | +-------------------------------------------------- |
| 271 | + |
| 272 | +TODO |
0 commit comments