|
| 1 | +# PDEP-XX: Dedicated string data type for pandas 3.0 |
| 2 | + |
| 3 | +- Created: May 3, 2024 |
| 4 | +- Status: Under discussion |
| 5 | +- Discussion: |
| 6 | +- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +This PDEP proposes to introduce a dedicated string dtype that will be used by |
| 12 | +default in pandas 3.0: |
| 13 | + |
| 14 | +* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available |
| 15 | + or otherwise the numpy object-dtype alternative. |
| 16 | +* The default string dtype will use missing value semantics using NaN consistent |
| 17 | + with the other default data types. |
| 18 | + |
| 19 | +This will give users a long-awaited proper string dtype for 3.0, while 1) not |
| 20 | +(yet) making PyArrow a _hard_ dependency, but only a dependency used by default, |
| 21 | +and 2) leaving room for future improvements (different missing value semantics, |
| 22 | +using NumPy 2.0, etc). |
| 23 | + |
| 24 | +# Dedicated string data type for pandas 3.0 |
| 25 | + |
| 26 | +## Background |
| 27 | + |
| 28 | +Currently, pandas by default stores text data in an `object`-dtype NumPy array. |
| 29 | +The current implementation has two primary drawbacks: First, `object`-dtype is |
| 30 | +not specific to strings: any Python object can be stored in an `object`-dtype |
| 31 | +array, not just strings, and seeing `object` as the dtype for a column with |
| 32 | +strings is confusing for users. Second: this is not efficient (all string |
| 33 | +methods on a Series are eventually done by calling Python methods on the |
| 34 | +individual string objects). |
| 35 | + |
| 36 | +To solve the first issue, a dedicated extension dtype for string data has |
| 37 | +already been |
| 38 | +[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type). |
| 39 | +This has always been opt-in for now, requiring users to explicitly request the |
| 40 | +dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing |
| 41 | +this string dtype was initially almost the same as the default implementation, |
| 42 | +i.e. an `object`-dtype NumPy array of Python strings. |
| 43 | + |
| 44 | +To solve the second issue (performance), pandas contributed to the development |
| 45 | +of string kernels in the PyArrow package, and a variant of the string dtype |
| 46 | +backed by PyArrow was |
| 47 | +[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type). |
| 48 | +This could be specified with the `storage` keyword in the opt-in string dtype |
| 49 | +(`pd.StringDtype(storage="pyarrow")`). |
| 50 | + |
| 51 | +Since its introduction, the `StringDtype` has always been opt-in, and has used |
| 52 | +the experimental `pd.NA` sentinel for missing values (which was also [introduced |
| 53 | +in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). |
| 54 | +However, up to this date, pandas has not yet made the step to use `pd.NA` by |
| 55 | +default. |
| 56 | + |
| 57 | +In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html) |
| 58 | +proposed to start using a PyArrow-backed string dtype by default in pandas 3.0 |
| 59 | +(i.e. infer this type for string data instead of object dtype). To ensure we |
| 60 | +could use the variant of `StringDtype` backed by PyArrow instead of Python |
| 61 | +objects (for better performance), it proposed to make `pyarrow` a new required |
| 62 | +runtime dependency of pandas. |
| 63 | + |
| 64 | +In the meantime, NumPy has also been working on a native variable-width string |
| 65 | +data type, which will be available [starting with NumPy |
| 66 | +2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy). |
| 67 | +This can provide a potential alternative to PyArrow for implementing a string |
| 68 | +data type in pandas that is not backed by Python objects. |
| 69 | + |
| 70 | +After acceptance of PDEP-10, two aspects of the proposal have been under |
| 71 | +reconsideration: |
| 72 | + |
| 73 | +- Based on user feedback, it has been considered to relax the new `pyarrow` |
| 74 | + requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can |
| 75 | + potentially reduce the need to make PyArrow a required dependency specifically |
| 76 | + for a dedicated pandas string dtype. |
| 77 | +- The PDEP did not consider the usage of the experimental `pd.NA` as a |
| 78 | + consequence of adopting one of the existing implementations of the |
| 79 | + `StringDtype`. |
| 80 | + |
| 81 | +For the second aspect, another variant of the `StringDtype` was |
| 82 | +[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings) |
| 83 | +that is still backed by PyArrow but follows the default missing values semantics |
| 84 | +pandas uses for all other default data types (and using `NaN` as the missing |
| 85 | +value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)). |
| 86 | +At the time, the `storage` option for this new variant was called |
| 87 | +`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using `pd.NA`. |
| 88 | + |
| 89 | +This last dtype variant is what you currently (pandas 2.2) get for string data |
| 90 | +when enabling the ``future.infer_string`` option (to enable the behaviour which |
| 91 | +is intended to become the default in pandas 3.0). |
| 92 | + |
| 93 | +## Proposal |
| 94 | + |
| 95 | +To be able to move forward with a string data type in pandas 3.0, this PDEP proposes: |
| 96 | + |
| 97 | +1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow |
| 98 | + if installed, and otherwise falls back to an in-house functionally-equivalent |
| 99 | + (but slower) version. |
| 100 | +2. This default "string" dtype will follow the same behaviour for missing values |
| 101 | + as our other default data types, and use `NaN` as the missing value sentinel. |
| 102 | +3. The version that is not backed by PyArrow can reuse the existing numpy |
| 103 | + object-dtype backed StringArray for its implementation. |
| 104 | +4. We update installation guidelines to clearly encourage users to install |
| 105 | + pyarrow for the default user experience. |
| 106 | + |
| 107 | +### Default inference of a string dtype |
| 108 | + |
| 109 | +By default, pandas will infer this new string dtype for string data (when |
| 110 | +creating pandas objects, such as in constructors or IO functions). |
| 111 | + |
| 112 | +The existing `future.infer_string` option can be used to opt-in to the future |
| 113 | +default behaviour: |
| 114 | + |
| 115 | +```python |
| 116 | +>>> pd.options.future.infer_string = True |
| 117 | +>>> pd.Series(["a", "b", None]) |
| 118 | +0 a |
| 119 | +1 b |
| 120 | +2 NaN |
| 121 | +dtype: string |
| 122 | +``` |
| 123 | + |
| 124 | +This option will be expanded to also work when PyArrow is not installed. |
| 125 | + |
| 126 | +### Missing value semantics |
| 127 | + |
| 128 | +Given that all other default data types uses NaN semantics for missing values, |
| 129 | +this proposal says that a new default string dtype should still use the same |
| 130 | +default semantics. Further, it should result in default data types when doing |
| 131 | +operations on the string column that result in a boolean or numeric data type |
| 132 | +(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison |
| 133 | +operators like `==`, should result in default `int64` and `bool` data types). |
| 134 | + |
| 135 | +Because the current original `StringDtype` implementations already use `pd.NA` |
| 136 | +and return masked integer and boolean arrays in operations, a new variant of the |
| 137 | +existing dtypes that uses `NaN` and default data types is needed. |
| 138 | + |
| 139 | +### Object-dtype "fallback" implementation |
| 140 | + |
| 141 | +To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep |
| 142 | +a "fallback" option in case PyArrow is not installed. The original `StringDtype` |
| 143 | +backed by a numpy object-dtype array of Python strings can be used for this, and |
| 144 | +only need minor updates to follow the above-mentioned missing value semantics |
| 145 | +([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)). |
| 146 | + |
| 147 | +For pandas 3.0, this is the most realistic option given this implementation is |
| 148 | +already available for a long time. Beyond 3.0, we can still explore further |
| 149 | +improvements such as using nanoarrow or NumPy 2.0, but at that point that is an |
| 150 | +implementation detail that should not have a direct impact on users (except for |
| 151 | +performance). |
| 152 | + |
| 153 | +### Naming |
| 154 | + |
| 155 | +Given the long history of this topic, the naming of the dtypes is a difficult |
| 156 | +topic. |
| 157 | + |
| 158 | +In the first place, we need to acknowledge that most users should not need to |
| 159 | +use storage-specific options. Users are expected to specify `pd.StringDtype()` |
| 160 | +or `"string"`, and that will give them their default string dtype (which |
| 161 | +depends on whether PyArrow is installed or not). |
| 162 | + |
| 163 | +But for testing purposes and advanced use cases that want control over this, we |
| 164 | +need some way to specify this and distinguish them from the other string dtypes. |
| 165 | +Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where |
| 166 | +"pyarrow_numpy" is a rather confusing option. |
| 167 | + |
| 168 | +TODO see if we can come up with a better naming scheme |
| 169 | + |
| 170 | +## Alternatives |
| 171 | + |
| 172 | +### Why not delay introducing a default string dtype? |
| 173 | + |
| 174 | +To avoid introducing a new string dtype while other discussions and changes are |
| 175 | +in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as |
| 176 | +the default missing value sentinel? using the new NumPy 2.0 capabilities?), we |
| 177 | +could also delay introducing a default string dtype until there is more clarity |
| 178 | +for those other discussions. |
| 179 | + |
| 180 | +However: |
| 181 | + |
| 182 | +1. Delaying has a cost: it further postpones introducing a dedicated string |
| 183 | + dtype that has massive benefits for our users, both in usability as (for the |
| 184 | + significant part of the user base that has PyArrow installed) in performance. |
| 185 | +2. In case we eventually transition to use `pd.NA` as the default missing value |
| 186 | + sentinel, we will need a migration path for _all_ our data types, and thus |
| 187 | + the challenges around this will not be unique to the string dtype. |
| 188 | + |
| 189 | +### Why not use the existing StringDtype with `pd.NA`? |
| 190 | + |
| 191 | +Because adding even more variants of the string dtype will make things only more |
| 192 | +confusing? Indeed, this proposal unfortunately introduces more variants of the |
| 193 | +string dtype. However, the reason for this is to ensure the actual default user |
| 194 | +experience is _less_ confusing, and the new string dtype fits better with the |
| 195 | +other default data types. |
| 196 | + |
| 197 | +If the new default string data type would use `pd.NA`, then after some |
| 198 | +operations, a user can easily end up with a DataFrame that mixes columns using |
| 199 | +`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that |
| 200 | +could have columns with two different int64, two different float64, two different |
| 201 | +bool, etc dtypes). This would lead to a very confusing default experience. |
| 202 | + |
| 203 | +With the proposed new variant of the StringDtype, this will ensure that for the |
| 204 | +_default_ experience, a user will only see only 1 kind of integer dtype, only |
| 205 | +kind of 1 bool dtype, etc. For now, a user should only get columns with an |
| 206 | +`ArrowDtype` and/or using `pd.NA` when explicitly opting into this. |
| 207 | + |
| 208 | +## Backward compatibility |
| 209 | + |
| 210 | +The most visible backwards incompatible change will be that columns with string |
| 211 | +data will no longer have an `object` dtype. Therefore, code that assumes |
| 212 | +`object` dtype (such as `ser.dtype == object`) will need to be updated. |
| 213 | + |
| 214 | +To allow testing your code in advance, the |
| 215 | +`pd.options.future.infer_string = True` option is available. |
| 216 | + |
| 217 | +Otherwise, the actual string-specific functionality (such as the `.str` accessor |
| 218 | +methods) should all keep working as is. By preserving the current missing value |
| 219 | +semantics, this proposal is also backwards compatible on this aspect. |
| 220 | + |
| 221 | +One other backwards incompatible change is present for early adopters of the |
| 222 | +existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start |
| 223 | +returning the new default string dtype, while up to now this returned the |
| 224 | +experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users |
| 225 | +will need to start specifying a keyword in the dtype constructor if they want to |
| 226 | +keep using `pd.NA` (but if they just want to have a dedicated string dtype, they |
| 227 | +don't need to change their code). |
| 228 | + |
| 229 | +## Timeline |
| 230 | + |
| 231 | +The future PyArrow-backed string dtype was already made available behind a feature |
| 232 | +flag in pandas 2.1 (by `pd.options.future.infer_string = True`). |
| 233 | + |
| 234 | +Some small enhancements or fixes (or naming changes) might still be needed and |
| 235 | +can be backported to pandas 2.2.x. |
| 236 | + |
| 237 | +The variant using numpy object-dtype could potentially also be backported to |
| 238 | +2.2.x to allow easier testing. |
| 239 | + |
| 240 | +For pandas 3.0, this flag becomes enabled by default. |
| 241 | + |
| 242 | + |
| 243 | +## PDEP-XX History |
| 244 | + |
| 245 | +- 3 May 2024: Initial version |
0 commit comments