11
11
This PDEP proposes to introduce a dedicated string dtype that will be used by
12
12
default in pandas 3.0:
13
13
14
- * In pandas 3.0, enable a " string" dtype by default, using PyArrow if available
14
+ * In pandas 3.0, enable a string dtype ( ` "str" ` ) by default, using PyArrow if available
15
15
or otherwise a string dtype using numpy object-dtype under the hood as fallback.
16
16
* The default string dtype will use missing value semantics (using NaN) consistent
17
17
with the other default data types.
@@ -96,10 +96,10 @@ is intended to become the default in pandas 3.0).
96
96
97
97
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
98
98
99
- 1 . For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
99
+ 1 . For pandas 3.0, a ` "str" ` string dtype is enabled by default, which will use PyArrow
100
100
if installed, and otherwise falls back to an in-house functionally-equivalent
101
101
(but slower) version.
102
- 2 . This default " string" dtype will follow the same behaviour for missing values
102
+ 2 . This default string dtype will follow the same behaviour for missing values
103
103
as other default data types, and use ` NaN ` as the missing value sentinel.
104
104
3 . The version that is not backed by PyArrow can reuse (with minor code
105
105
additions) the existing numpy object-dtype backed StringArray for its
@@ -135,10 +135,9 @@ that case.
135
135
136
136
### Missing value semantics
137
137
138
- As mentioned in the background section, the original ` StringDtype ` has used
139
- the experimental ` pd.NA ` sentinel for missing values. In addition to using
140
- ` pd.NA ` as the scalar for a missing value, this essentially means
141
- that:
138
+ As mentioned in the background section, the original ` StringDtype ` has always
139
+ used the experimental ` pd.NA ` sentinel for missing values. In addition to using
140
+ ` pd.NA ` as the scalar for a missing value, this essentially means that:
142
141
143
142
- String columns follow [ "NA-semantics"] ( https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics )
144
143
for missing values, where ` NA ` propagates in boolean operations such as
@@ -154,7 +153,7 @@ dtype should also still use the same default missing value semantics and return
154
153
default data types when doing operations on the string column, to be consistent
155
154
with the other default dtypes at this point.
156
155
157
- In practice, this means that the default ` " string" ` dtype will use ` NaN ` as
156
+ In practice, this means that the default string dtype will use ` NaN ` as
158
157
the missing value sentinel, and:
159
158
160
159
- String columns will follow NaN-semantics for missing values, where ` NaN ` gives
@@ -165,9 +164,8 @@ the missing value sentinel, and:
165
164
Because the original ` StringDtype ` implementations already use ` pd.NA ` and
166
165
return masked integer and boolean arrays in operations, a new variant of the
167
166
existing dtypes that uses ` NaN ` and default data types was needed. The original
168
- variant of ` StringDtype ` using ` pd.NA ` will still be available for those who
169
- want to keep using it (see below in the "Naming" subsection for how to specify
170
- this).
167
+ variant of ` StringDtype ` using ` pd.NA ` will continue to be available for those
168
+ who were already using it.
171
169
172
170
### Object-dtype "fallback" implementation
173
171
@@ -185,23 +183,35 @@ or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can
185
183
still be explored, but at that point that is an implementation detail that
186
184
should not have a direct impact on users (except for performance).
187
185
186
+ For the original variant of ` StringDtype ` using ` pd.NA ` , currently the default
187
+ storage is ` "python" ` (the object-dtype based implementation). Also for this
188
+ variant, it is proposed follow the same logic for determining the default
189
+ storage, i.e. the default to ` "pyarrow" ` if available, and otherwise
190
+ fall back to ` "python" ` .
191
+
188
192
### Naming
189
193
190
194
Given the long history of this topic, the naming of the dtypes is a difficult
191
195
topic.
192
196
193
197
In the first place, it should be acknowledged that most users should not need to
194
- use storage-specific options. Users are expected to specify ` pd.StringDtype() `
195
- or ` "string" ` , and that will give them their default string dtype (which
196
- depends on whether PyArrow is installed or not).
197
-
198
- But for testing purposes and advanced use cases that want control over this, we
199
- need some way to specify this and distinguish them from the other string dtypes.
200
- In addition, users that want to continue using the original NA-variant of the
201
- dtype need a way to specify this.
202
-
203
- Currently (pandas 2.2), ` StringDtype(storage="pyarrow_numpy") ` is used, where
204
- the ` "pyarrow_numpy" ` storage was used to disambiguate from the existing
198
+ use storage-specific options. Users are expected to specify a generic name (such
199
+ as ` "str" ` or ` "string" ` ), and that will give them their default string dtype
200
+ (which depends on whether PyArrow is installed or not).
201
+
202
+ For the generic string alias to specify the dtype, ` "string" ` is already used
203
+ for the ` StringDtype ` using ` pd.NA ` . This PDEP proposes to use ` "str" ` for the
204
+ new default ` StringDtype ` using ` NaN ` . This ensures backwards compatibility for
205
+ code using ` dtype="string" ` , and was also chosen because ` dtype="str" ` or
206
+ ` dtype=str ` currently already works to ensure your data is converted to
207
+ strings (only using object dtype for the result).
208
+
209
+ But for testing purposes and advanced use cases that want control over the exact
210
+ variant of the ` StringDtype ` , we need some way to specify this and distinguish
211
+ them from the other string dtypes.
212
+
213
+ Currently (pandas 2.2), ` StringDtype(storage="pyarrow_numpy") ` is used for the new variant using ` NaN ` ,
214
+ where the ` "pyarrow_numpy" ` storage was used to disambiguate from the existing
205
215
` "pyarrow" ` option using ` pd.NA ` . However, ` "pyarrow_numpy" ` is a rather confusing
206
216
option and doesn't generalize well. Therefore, this PDEP proposes a new naming
207
217
scheme as outlined below, and ` "pyarrow_numpy" ` will be deprecated and removed
@@ -217,29 +227,31 @@ dtype of the data:
217
227
218
228
| User specification | Concrete dtype | String alias | Note |
219
229
| ---------------------------------------------| ---------------------------------------------------------------| ---------------------------------------| ----------|
220
- | Unspecified (inference) | ` StringDtype(storage="pyarrow"\|"python", na_value=np.nan) ` | "string" | (1) |
221
- | ` StringDtype() ` or ` "string" ` | ` StringDtype(storage="pyarrow" \| "python", na_value=np.nan) ` | "string " | (1), (2) |
222
- | ` StringDtype("pyarrow") ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "string " | (2) |
223
- | ` StringDtype("python") ` | ` StringDtype(storage="python", na_value=np.nan) ` | "string " | (2) |
224
- | ` StringDtype("pyarrow", na_value=pd.NA ) ` | ` StringDtype(storage="pyarrow", na_value=pd.NA) ` | "String [ pyarrow] " | |
225
- | ` StringDtype("python", na_value=pd.NA ) ` | ` StringDtype(storage="python", na_value=pd.NA) ` | "String [ python] " | |
226
- | ` StringDtype(na_value=pd.NA) ` or ` "String" ` | ` StringDtype(storage="pyarrow" \| "python", na_value=pd.NA) ` | "String [ pyarrow] " or "String [ python] " | (1) |
227
- | ` StringDtype("pyarrow_numpy") ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "string[ pyarrow_numpy] " | (3 ) |
230
+ | Unspecified (inference) | ` StringDtype(storage="pyarrow"\|"python", na_value=np.nan) ` | "str" | (1) |
231
+ | ` "str" ` or ` StringDtype(na_value=np.nan) ` | ` StringDtype(storage="pyarrow"\| "python", na_value=np.nan) ` | "str " | (1) |
232
+ | ` StringDtype("pyarrow", na_value=np.nan ) ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "str " | |
233
+ | ` StringDtype("python", na_value=np.nan ) ` | ` StringDtype(storage="python", na_value=np.nan) ` | "str " | |
234
+ | ` StringDtype("pyarrow") ` | ` StringDtype(storage="pyarrow", na_value=pd.NA) ` | "string [ pyarrow] " | |
235
+ | ` StringDtype("python") ` | ` StringDtype(storage="python", na_value=pd.NA) ` | "string [ python] " | |
236
+ | ` "string" ` or ` StringDtype() ` | ` StringDtype(storage="pyarrow"\| "python", na_value=pd.NA) ` | "string [ pyarrow] " or "string [ python] " | (1) |
237
+ | ` StringDtype("pyarrow_numpy") ` | ` StringDtype(storage="pyarrow", na_value=np.nan) ` | "string[ pyarrow_numpy] " | (2 ) |
228
238
229
239
Notes:
230
240
231
241
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
232
- - (2) Those three rows are backwards incompatible (i.e. they work now but give
233
- the NA-variant), see the "Backward compatibility" section below.
234
- - (3) "pyarrow_numpy" is kept temporarily because this is already in a released
242
+ - (2) "pyarrow_numpy" is kept temporarily because this is already in a released
235
243
version, but we can deprecate it in 2.x and have it removed for 3.0.
236
244
237
- For the new default string dtype, only the ` "string " ` alias can be used to
238
- specify the dtype as a string, i.e. a way would not be provided to make the
245
+ For the new default string dtype, only the ` "str " ` alias can be used to
246
+ specify the dtype as a string, i.e. pandas would not provide a way to make the
239
247
underlying storage (pyarrow or python) explicit through the string alias. This
240
- string alias is only a convenience shortcut and for most users ` "string " ` is
248
+ string alias is only a convenience shortcut and for most users ` "str " ` is
241
249
sufficient (they don't need to specify the storage), and the explicit
242
- ` pd.StringDtype(...) ` is still available for more fine-grained control.
250
+ ` pd.StringDtype(storage=..., na_value=np.nan) ` is still available for more
251
+ fine-grained control.
252
+
253
+ Also for the existing variant using ` pd.NA ` , specifying the storage through the
254
+ string alias could be deprecated, but that is left for a separate decision.
243
255
244
256
## Alternatives
245
257
@@ -257,17 +269,16 @@ to `pd.NA` by default.
257
269
However:
258
270
259
271
1 . Delaying has a cost: it further postpones introducing a dedicated string
260
- dtype that has massive benefits for users, both in usability as (for the
272
+ dtype that has significant benefits for users, both in usability as (for the
261
273
part of the user base that has PyArrow installed) in performance.
262
274
2 . In case pandas eventually transitions to use ` pd.NA ` as the default missing value
263
- sentinel, a migration path for _ all_ pandas data types will be needed, and thus
275
+ sentinel, a migration path for _ all_ pandas data types will be needed, and thus
264
276
the challenges around this will not be unique to the string dtype and
265
277
therefore not a reason to delay this.
266
278
267
- Making this change now for 3.0 will benefit the majority of users, while coming
268
- at a cost for a part of the users who already started using the ` "string" ` or
269
- ` pd.StringDtype() ` dtype (they will have to update their code to continue to use
270
- the variant using ` pd.NA ` , see the "Backward compatibility" section below).
279
+ Making this change now for 3.0 will benefit the majority of users, and the PDEP
280
+ author believes this is worth the cost of the added complexity around "yet
281
+ another dtype" (also for other data types we already have multiple variants).
271
282
272
283
### Why not use the existing StringDtype with ` pd.NA ` ?
273
284
@@ -290,17 +301,26 @@ when explicitly opting into this.
290
301
291
302
### Naming alternatives
292
303
293
- This PDEP now keeps the ` pd.StringDtype ` class constructor with the existing
294
- ` storage ` keyword and with an additional ` na_value ` keyword.
304
+ An initial version of this PDEP proposed to use the ` "string" ` alias and the
305
+ default ` pd.StringDtype() ` class constructor for the new default dtype.
306
+ However, that caused a lot of discussion around backwards compatibility for
307
+ existing users of the ` StringDtype ` using ` pd.NA ` .
295
308
296
309
During the discussion, several alternatives have been brought up. Both
297
- alternative keyword names as using a different constructor. This PDEP opted to
298
- keep using the existing ` pd.StringDtype() ` for now to keep the changes as
310
+ alternative keyword names as using a different constructor. In the end,
311
+ this PDEP proposes to use a different string alias (` "str" ` ) but to keep
312
+ using the existing ` pd.StringDtype ` (with the existing ` storage ` keyword but
313
+ with an additional ` na_value ` keyword) for now to keep the changes as
299
314
minimal as possible, leaving a larger overhaul of the dtype system (potentially
300
315
including different constructor functions or namespace) for a future discussion.
301
316
See [ GH-58613 ] ( https://github.com/pandas-dev/pandas/issues/58613 ) for the full
302
317
discussion.
303
318
319
+ One consequence is that when using the class constructor for the default dtype,
320
+ it has to be used with non-default arguments, i.e. a user needs to specify
321
+ ` pd.StringDtype(na_value=np.nan) ` to get the default dtype using ` NaN ` .
322
+ Therefore, the pandas documentation will focus on the usage of ` dtype="str" ` .
323
+
304
324
## Backward compatibility
305
325
306
326
The most visible backwards incompatible change will be that columns with string
@@ -324,54 +344,14 @@ this will change to use `NaN` consistently.
324
344
325
345
### For existing users of ` StringDtype `
326
346
327
- Users of the existing ` StringDtype ` will see more backwards incompatible
328
- changes, though. In pandas 3.0, calling ` pd.StringDtype() ` (or specifying
329
- ` dtype="string" ` ) will start returning the new default string dtype using ` NaN ` ,
330
- while up to now this returned the string dtype using ` pd.NA ` introduced in
331
- pandas 1.0.
332
-
333
- For example, this code snippet returned the NA-variant of ` StringDtype ` with
334
- pandas 1.x and 2.x:
335
-
336
- ``` python
337
- >> > pd.Series([" a" , " b" , None ], dtype = " string" )
338
- 0 a
339
- 1 b
340
- 2 < NA >
341
- dtype: string
342
- ```
347
+ Existing code that already opted in to use the ` StringDtype ` using ` pd.NA `
348
+ should generally keep working as is. The latest version of this PDEP preserves
349
+ the behaviour of ` dtype="string" ` or ` dtype=pd.StringDtype() ` to mean the
350
+ ` pd.NA ` variant of the dtype.
343
351
344
- but will start returning the new default NaN-variant of ` StringDtype ` with
345
- pandas 3.0. This means that the missing value sentinel will change from ` pd.NA `
346
- to ` NaN ` , and that operations will no longer return nullable dtypes but default
347
- numpy dtypes (see the "Missing value semantics" section above).
348
-
349
- While this change will be transparent in many cases (e.g. checking for missing
350
- values with ` isna() ` /` dropna() ` /` fillna() ` or filtering rows with the result of
351
- a string predicate method keeps working regardless of the sentinel), this can be
352
- a breaking change if users relied on the exact sentinel or resulting dtype. Since
353
- pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
354
- that many users already have started using this dtype, even though officially
355
- still labeled as "experimental".
356
-
357
- To smooth the upgrade experience for those users, it is proposed to add a
358
- deprecation warning before 3.0 when such dtype is created, giving them two
359
- options:
360
-
361
- - If the user just wants to have a dedicated "string" dtype (or the better
362
- performance when using pyarrow) but is fine with using the default NaN
363
- semantics, they can add ` pd.options.future.infer_string = True ` to their code
364
- to suppress the warning and already opt-in to the future behaviour of pandas
365
- 3.0.
366
- - If the user specifically wants the variant of the string dtype that uses
367
- ` pd.NA ` (and returns nullable numeric/boolean dtypes in operations), they will
368
- have to update their dtype specification from ` "string" ` / ` pd.StringDtype() `
369
- to ` "String" ` / ` pd.StringDtype(na_value=pd.NA) ` to suppress the warning and
370
- further keep their code running as is.
371
-
372
- A ` "String" ` alias (capitalized) would be added to make it easier for users to
373
- continue using the variant using ` pd.NA ` , and such capitalized string alias is
374
- consistent with other nullable dtypes (` "float64 ` " vs ` "Float64" ` ).
352
+ It does propose the change the default storage to ` "pyarrow" ` (if available) for
353
+ the opt-in ` pd.NA ` variant as well, but this should not have much user-visible
354
+ impact.
375
355
376
356
## Timeline
377
357
@@ -381,13 +361,10 @@ flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`).
381
361
The variant using numpy object-dtype can also be backported to the 2.2.x branch
382
362
to allow easier testing. It is proposed to release this as 2.3.0 (created from
383
363
the 2.2.x branch, given that the main branch already includes many other changes
384
- targeted for 3.0), together with the deprecation warning when creating a dtype
385
- from ` "string" ` / ` pd.StringDtype() ` .
364
+ targeted for 3.0), together with the changes to the naming scheme.
386
365
387
366
The 2.3.0 release would then have all future string functionality available
388
- (both the pyarrow and object-dtype based variants of the default string dtype),
389
- and warn existing users of the ` StringDtype ` in advance of 3.0 about how to
390
- update their code.
367
+ (both the pyarrow and object-dtype based variants of the default string dtype).
391
368
392
369
For pandas 3.0, this ` future.infer_string ` flag becomes enabled by default.
393
370
0 commit comments