@@ -21,8 +21,6 @@ This will give users a long-awaited proper string dtype for 3.0, while 1) not
21
21
and 2) leaving room for future improvements (different missing value semantics,
22
22
using NumPy 2.0, etc).
23
23
24
- # Dedicated string data type for pandas 3.0
25
-
26
24
## Background
27
25
28
26
Currently, pandas by default stores text data in an ` object ` -dtype NumPy array.
@@ -86,7 +84,9 @@ that is still backed by PyArrow but follows the default missing values semantics
86
84
pandas uses for all other default data types (and using ` NaN ` as the missing
87
85
value sentinel) ([ GH-54792 ] ( https://github.com/pandas-dev/pandas/issues/54792 ) ).
88
86
At the time, the ` storage ` option for this new variant was called
89
- ` "pyarrow_numpy" ` to disambiguate from the existing ` "pyarrow" ` option using ` pd.NA ` .
87
+ ` "pyarrow_numpy" ` to disambiguate from the existing ` "pyarrow" ` option using
88
+ ` pd.NA ` (but this PDEP proposes a better naming scheme, see the "Naming"
89
+ subsection below).
90
90
91
91
This last dtype variant is what you currently (pandas 2.2) get for string data
92
92
when enabling the `` future.infer_string `` option (to enable the behaviour which
@@ -194,10 +194,49 @@ depends on whether PyArrow is installed or not).
194
194
195
195
But for testing purposes and advanced use cases that want control over this, we
196
196
need some way to specify this and distinguish them from the other string dtypes.
197
- Currently, the ` StringDtype(storage="pyarrow_numpy") ` is used, where
198
- "pyarrow_numpy" is a rather confusing option.
199
-
200
- TODO see if we can come up with a better naming scheme
197
+ In addition, users that want to continue using the original NA-variant of the
198
+ dtype need a way to specify this.
199
+
200
+ Currently (pandas 2.2), ` StringDtype(storage="pyarrow_numpy") ` is used, where
201
+ the ` "pyarrow_numpy" ` storage was used to disambiguate from the existing
202
+ ` "pyarrow" ` option using ` pd.NA ` . However, "pyarrow_numpy" is a rather
203
+ confusing option and doesn't generalize well. Therefore, this PDEP proposes
204
+ a new naming scheme as outlined below, and we will deprecate and remove
205
+ "pyarrow_numpy" before pandas 3.0.
206
+
207
+ The ` storage ` keyword of ` StringDtype ` is kept to disambiguate the underlying
208
+ storage of the string data (using pyarrow or python objects), but an additional
209
+ ` na_value ` is introduced to disambiguate the the variants using NA semantics
210
+ and NaN semantics.
211
+
212
+ Overview of the different ways to specify a dtype and the resulting concrete
213
+ dtype of the data:
214
+
215
+ | User specification | Concrete dtype | String alias | Note |
216
+ | ----------------------------------------| ---------------------------------------------------| -------------------------| ------|
217
+ | Unspecified (inference) | `StringDtype(storage="pyarrow"| "python", na_value=np.nan)` | "string" | (1) |
218
+ | ` StringDtype() ` or ` "string" ` | `StringDtype(storage="pyarrow"| "python", na_value=np.nan)` | "string" | (1), (2) |
219
+ | ` StringDtype("pyarrow") ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "string" | (2) |
220
+ | ` StringDtype("python") ` | ` StringDtype(storage="python", na_value=np.nan) ` | "string" | (2) |
221
+ | ` StringDtype("pyarrow", na_value=pd.NA) ` | ` StringDtype(storage="pyarrow", na_value=pd.NA) ` | "string[ pyarrow] " | |
222
+ | ` StringDtype("python", na_value=pd.NA) ` | ` StringDtype(storage="pyarrow", na_value=pd.NA) ` | "string[ python] " | |
223
+ | ` StringDtype(na_value=pd.NA) ` | `StringDtype(storage="pyarrow"| "python", na_value=pd.NA)` | "string[ pyarrow] " or "string[ python] " | (1) |
224
+ | ` StringDtype("pyarrow_numpy") ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "string[ pyarrow_numpy] " | (3) |
225
+
226
+ Notes:
227
+
228
+ - (1) You get "pyarrow" or "python" depending on pyarrow being installed.
229
+ - (2) Those three rows are backwards incompatible (i.e. they work now but give
230
+ you the NA-variant), see the "Backward compatibility" section below.
231
+ - (3) "pyarrow_numpy" is kept temporarily because this is already in a released
232
+ version, but we can deprecate it in 2.2.x and have it removed for 3.0.
233
+
234
+ For the new default string dtype, only the ` "string" ` alias can be used to
235
+ specify the dtype as a string, i.e. we would not provide a way to make the
236
+ underlying storage (pyarrow or python) explicit through the string alias. This
237
+ string alias is only a convenience shortcut and for most users ` "string" ` is
238
+ sufficient (they don't need to specify the storage), and the explicit
239
+ ` pd.StringDtype(...) ` is still available for more fine-grained control.
201
240
202
241
## Alternatives
203
242
@@ -238,6 +277,19 @@ _default_ experience, a user will only see only 1 kind of integer dtype, only
238
277
kind of 1 bool dtype, etc. For now, a user should only get columns using ` pd.NA `
239
278
when explicitly opting into this.
240
279
280
+ ### Naming alternatives
281
+
282
+ This PDEP now keeps the ` pd.StringDtype ` class constructor with the existing
283
+ ` storage ` keyword and with an additional ` na_value ` keyword.
284
+
285
+ During the discussion, several alternatives have been brought up. Both
286
+ alternative keyword names as using a different constructor. This PDEP opted to
287
+ keep using the existing ` pd.StringDtype() ` for now to keep the changes as
288
+ minimal as possible, leaving a larger overhaul of the dtype system (potentially
289
+ including different constructor functions or namespace) for a future discussion.
290
+ See [ GH-58613 ] ( https://github.com/pandas-dev/pandas/issues/58613 ) for the full
291
+ discussion.
292
+
241
293
## Backward compatibility
242
294
243
295
The most visible backwards incompatible change will be that columns with string
0 commit comments