Skip to content

Commit 5b5835b

Browse files
expand Naming subsection with storage+na_value proposal
1 parent 54a43b3 commit 5b5835b

File tree

1 file changed

+59
-7
lines changed

1 file changed

+59
-7
lines changed

web/pandas/pdeps/0014-string-dtype.md

Lines changed: 59 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,6 @@ This will give users a long-awaited proper string dtype for 3.0, while 1) not
2121
and 2) leaving room for future improvements (different missing value semantics,
2222
using NumPy 2.0, etc).
2323

24-
# Dedicated string data type for pandas 3.0
25-
2624
## Background
2725

2826
Currently, pandas by default stores text data in an `object`-dtype NumPy array.
@@ -86,7 +84,9 @@ that is still backed by PyArrow but follows the default missing values semantics
8684
pandas uses for all other default data types (and using `NaN` as the missing
8785
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)).
8886
At the time, the `storage` option for this new variant was called
89-
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using `pd.NA`.
87+
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using
88+
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming"
89+
subsection below).
9090

9191
This last dtype variant is what you currently (pandas 2.2) get for string data
9292
when enabling the ``future.infer_string`` option (to enable the behaviour which
@@ -194,10 +194,49 @@ depends on whether PyArrow is installed or not).
194194

195195
But for testing purposes and advanced use cases that want control over this, we
196196
need some way to specify this and distinguish them from the other string dtypes.
197-
Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
198-
"pyarrow_numpy" is a rather confusing option.
199-
200-
TODO see if we can come up with a better naming scheme
197+
In addition, users that want to continue using the original NA-variant of the
198+
dtype need a way to specify this.
199+
200+
Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
201+
the `"pyarrow_numpy"` storage was used to disambiguate from the existing
202+
`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather
203+
confusing option and doesn't generalize well. Therefore, this PDEP proposes
204+
a new naming scheme as outlined below, and we will deprecate and remove
205+
"pyarrow_numpy" before pandas 3.0.
206+
207+
The `storage` keyword of `StringDtype` is kept to disambiguate the underlying
208+
storage of the string data (using pyarrow or python objects), but an additional
209+
`na_value` is introduced to disambiguate the the variants using NA semantics
210+
and NaN semantics.
211+
212+
Overview of the different ways to specify a dtype and the resulting concrete
213+
dtype of the data:
214+
215+
| User specification | Concrete dtype | String alias | Note |
216+
|----------------------------------------|---------------------------------------------------|-------------------------|------|
217+
| Unspecified (inference) | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1) |
218+
| `StringDtype()` or `"string"` | `StringDtype(storage="pyarrow"|"python", na_value=np.nan)` | "string" | (1), (2) |
219+
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string" | (2) |
220+
| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) |
221+
| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | |
222+
| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[python]" | |
223+
| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow"|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) |
224+
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (3) |
225+
226+
Notes:
227+
228+
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
229+
- (2) Those three rows are backwards incompatible (i.e. they work now but give
230+
you the NA-variant), see the "Backward compatibility" section below.
231+
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
232+
version, but we can deprecate it in 2.2.x and have it removed for 3.0.
233+
234+
For the new default string dtype, only the `"string"` alias can be used to
235+
specify the dtype as a string, i.e. we would not provide a way to make the
236+
underlying storage (pyarrow or python) explicit through the string alias. This
237+
string alias is only a convenience shortcut and for most users `"string"` is
238+
sufficient (they don't need to specify the storage), and the explicit
239+
`pd.StringDtype(...)` is still available for more fine-grained control.
201240

202241
## Alternatives
203242

@@ -238,6 +277,19 @@ _default_ experience, a user will only see only 1 kind of integer dtype, only
238277
kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA`
239278
when explicitly opting into this.
240279

280+
### Naming alternatives
281+
282+
This PDEP now keeps the `pd.StringDtype` class constructor with the existing
283+
`storage` keyword and with an additional `na_value` keyword.
284+
285+
During the discussion, several alternatives have been brought up. Both
286+
alternative keyword names as using a different constructor. This PDEP opted to
287+
keep using the existing `pd.StringDtype()` for now to keep the changes as
288+
minimal as possible, leaving a larger overhaul of the dtype system (potentially
289+
including different constructor functions or namespace) for a future discussion.
290+
See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full
291+
discussion.
292+
241293
## Backward compatibility
242294

243295
The most visible backwards incompatible change will be that columns with string

0 commit comments

Comments
 (0)