Skip to content

Commit f554909

Browse files
Apply suggestions from code review
Co-authored-by: Irv Lustig <[email protected]>
1 parent f5faf4e commit f554909

File tree

1 file changed

+28
-27
lines changed

1 file changed

+28
-27
lines changed

web/pandas/pdeps/0014-string-dtype.md

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ default in pandas 3.0:
1919
This will give users a long-awaited proper string dtype for 3.0, while 1) not
2020
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default,
2121
and 2) leaving room for future improvements (different missing value semantics,
22-
using NumPy 2.0, etc).
22+
using NumPy 2.0 strings, etc).
2323

2424
## Background
2525

@@ -74,7 +74,7 @@ reconsideration:
7474
runtime dependency. In addition, NumPy 2.0 could in the future potentially
7575
reduce the need to make PyArrow a required dependency specifically for a
7676
dedicated pandas string dtype.
77-
- The PDEP did not consider the usage of the experimental `pd.NA` as a
77+
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a
7878
consequence of adopting one of the existing implementations of the
7979
`StringDtype`.
8080

@@ -88,23 +88,23 @@ At the time, the `storage` option for this new variant was called
8888
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming"
8989
subsection below).
9090

91-
This last dtype variant is what you currently (pandas 2.2) get for string data
91+
This last dtype variant is what users currently (pandas 2.2) get for string data
9292
when enabling the ``future.infer_string`` option (to enable the behaviour which
9393
is intended to become the default in pandas 3.0).
9494

9595
## Proposal
9696

9797
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
9898

99-
1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow
99+
1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
100100
if installed, and otherwise falls back to an in-house functionally-equivalent
101101
(but slower) version.
102102
2. This default "string" dtype will follow the same behaviour for missing values
103-
as our other default data types, and use `NaN` as the missing value sentinel.
103+
as other default data types, and use `NaN` as the missing value sentinel.
104104
3. The version that is not backed by PyArrow can reuse (with minor code
105105
additions) the existing numpy object-dtype backed StringArray for its
106106
implementation.
107-
4. We update installation guidelines to clearly encourage users to install
107+
4. Installation guidelines are updated to clearly encourage users to install
108108
pyarrow for the default user experience.
109109

110110
Those string dtypes enabled by default will then no longer be considered as
@@ -145,7 +145,7 @@ that:
145145
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64`
146146
dtype (or `float64` in case of missing values)).
147147

148-
However, up to this date, all other default data types still use NaN semantics
148+
However, up to this date, all other default data types still use `NaN` semantics
149149
for missing values. Therefore, this proposal says that a new default string
150150
dtype should also still use the same default missing value semantics and return
151151
default data types when doing operations on the string column, to be consistent
@@ -176,9 +176,10 @@ needs minor changes to follow the above-mentioned missing value semantics
176176
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).
177177

178178
For pandas 3.0, this is the most realistic option given this implementation has
179-
already been available for a long time. Beyond 3.0, we can still explore further
179+
already been available for a long time. Beyond 3.0, further
180180
improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503))
181-
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)),
181+
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552))
182+
can still be explored,
182183
but at that point that is an implementation detail that should not have a
183184
direct impact on users (except for performance).
184185

@@ -187,7 +188,7 @@ direct impact on users (except for performance).
187188
Given the long history of this topic, the naming of the dtypes is a difficult
188189
topic.
189190

190-
In the first place, we need to acknowledge that most users should not need to
191+
In the first place, it should be acknowledged that most users should not need to
191192
use storage-specific options. Users are expected to specify `pd.StringDtype()`
192193
or `"string"`, and that will give them their default string dtype (which
193194
depends on whether PyArrow is installed or not).
@@ -201,8 +202,8 @@ Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
201202
the `"pyarrow_numpy"` storage was used to disambiguate from the existing
202203
`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather
203204
confusing option and doesn't generalize well. Therefore, this PDEP proposes
204-
a new naming scheme as outlined below, and we will deprecate and remove
205-
"pyarrow_numpy" before pandas 3.0.
205+
a new naming scheme as outlined below, and
206+
"pyarrow_numpy" will be deprecated and removed before pandas 3.0.
206207

207208
The `storage` keyword of `StringDtype` is kept to disambiguate the underlying
208209
storage of the string data (using pyarrow or python objects), but an additional
@@ -227,12 +228,12 @@ Notes:
227228

228229
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
229230
- (2) Those three rows are backwards incompatible (i.e. they work now but give
230-
you the NA-variant), see the "Backward compatibility" section below.
231+
the NA-variant), see the "Backward compatibility" section below.
231232
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
232233
version, but we can deprecate it in 2.2.x and have it removed for 3.0.
233234

234235
For the new default string dtype, only the `"string"` alias can be used to
235-
specify the dtype as a string, i.e. we would not provide a way to make the
236+
specify the dtype as a string, i.e. a way would not be provided to make the
236237
underlying storage (pyarrow or python) explicit through the string alias. This
237238
string alias is only a convenience shortcut and for most users `"string"` is
238239
sufficient (they don't need to specify the storage), and the explicit
@@ -245,23 +246,23 @@ sufficient (they don't need to specify the storage), and the explicit
245246
To avoid introducing a new string dtype while other discussions and changes are
246247
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
247248
the default missing value sentinel? using the new NumPy 2.0 capabilities?
248-
overhauling all our dtypes to use a logical data type system?), we could also
249-
delay introducing a default string dtype until there is more clarity in those
249+
overhauling all our dtypes to use a logical data type system?),
250+
introducing a default string dtype could also be delayed until there is more clarity in those
250251
other discussions.
251252

252253
However:
253254

254255
1. Delaying has a cost: it further postpones introducing a dedicated string
255-
dtype that has massive benefits for our users, both in usability as (for the
256+
dtype that has massive benefits for users, both in usability as (for the
256257
significant part of the user base that has PyArrow installed) in performance.
257-
2. In case we eventually transition to use `pd.NA` as the default missing value
258-
sentinel, we will need a migration path for _all_ our data types, and thus
258+
2. In case pandas eventually transitions to use `pd.NA` as the default missing value
259+
sentinel, a migration path for _all_ our data types will be needed, and thus
259260
the challenges around this will not be unique to the string dtype and
260261
therefore not a reason to delay this.
261262

262-
Making this change now for 3.0 will benefit the majority of our users, while
263+
Making this change now for 3.0 will benefit the majority of users, while
263264
coming at a cost for a part of the users who already started using the
264-
`"string"` dtype (they will have to update their code to continue to the variant
265+
`"string"` or `pd.StringDtype()` dtype (they will have to update their code to continue to the variant
265266
using `pd.NA`, see the "Backward compatibility" section below).
266267

267268
### Why not use the existing StringDtype with `pd.NA`?
@@ -302,10 +303,10 @@ The most visible backwards incompatible change will be that columns with string
302303
data will no longer have an `object` dtype. Therefore, code that assumes
303304
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
304305
change is done as a hard break in a major release, as warning in advance for the
305-
changed inference is deemed to noisy.
306+
changed inference is deemed too noisy.
306307

307-
To allow testing your code in advance, the
308-
`pd.options.future.infer_string = True` option is available.
308+
To allow testing code in advance, the
309+
`pd.options.future.infer_string = True` option is available for users.
309310

310311
Otherwise, the actual string-specific functionality (such as the `.str` accessor
311312
methods) should generally all keep working as is. By preserving the current
@@ -339,12 +340,12 @@ numpy dtypes (see the "Missing value semantics" section above).
339340
While this change will be transparent in many cases (e.g. checking for missing
340341
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
341342
a string predicate method keeps working regardless of the sentinel), this can be
342-
a breaking change if you relied on the exact sentinel or resulting dtype. Since
343+
a breaking change if users relied on the exact sentinel or resulting dtype. Since
343344
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
344345
that many users already have started using this dtype, even though officially
345346
still labeled as "experimental".
346347

347-
To smooth the upgrade experience for those users, we propose to add a
348+
To smooth the upgrade experience for those users, it is proposed to add a
348349
deprecation warning before 3.0 when such dtype is created, giving them two
349350
options:
350351

@@ -368,7 +369,7 @@ Some small enhancements or fixes might still be needed and can continue to be
368369
backported to pandas 2.2.x.
369370

370371
The variant using numpy object-dtype can also be backported to the 2.2.x branch
371-
to allow easier testing. We would propose to release this as 2.3.0 (created from
372+
to allow easier testing. It is proposed to release this as 2.3.0 (created from
372373
the 2.2.x branch, given that the main branch already includes many other changes
373374
targeted for 3.0), together with the deprecation warning when creating a dtype
374375
from `"string"` / `pd.StringDtype()`.

0 commit comments

Comments
 (0)