-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
PDEP-14: Dedicated string data type for pandas 3.0 #58551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
fbeb69d
f03f54d
561de87
86f4e51
30c7b43
54a43b3
5b5835b
9ede2e6
f5faf4e
f554909
ac2d21a
82027d2
5b24c24
f9c55f4
2c58c4c
0a68504
8974c5b
cca3a7f
d24a80a
9c5342a
b5663cc
1c4c2d9
c44bfb5
af5ad3c
bd52f39
f8fbc61
d78462d
4de20d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,388 @@ | ||||||
# PDEP-14: Dedicated string data type for pandas 3.0 | ||||||
|
||||||
- Created: May 3, 2024 | ||||||
- Status: Under discussion | ||||||
- Discussion: https://github.com/pandas-dev/pandas/pull/58551 | ||||||
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) | ||||||
- Revision: 1 | ||||||
|
||||||
## Abstract | ||||||
|
||||||
This PDEP proposes to introduce a dedicated string dtype that will be used by | ||||||
default in pandas 3.0: | ||||||
|
||||||
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available | ||||||
or otherwise the numpy object-dtype alternative. | ||||||
* The default string dtype will use missing value semantics (using NaN) consistent | ||||||
with the other default data types. | ||||||
|
||||||
This will give users a long-awaited proper string dtype for 3.0, while 1) not | ||||||
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default, | ||||||
and 2) leaving room for future improvements (different missing value semantics, | ||||||
using NumPy 2.0 strings, etc). | ||||||
|
||||||
## Background | ||||||
|
||||||
Currently, pandas by default stores text data in an `object`-dtype NumPy array. | ||||||
The current implementation has two primary drawbacks. First, `object` dtype is | ||||||
not specific to strings: any Python object can be stored in an `object`-dtype | ||||||
array, not just strings, and seeing `object` as the dtype for a column with | ||||||
strings is confusing for users. Second: this is not efficient (all string | ||||||
methods on a Series are eventually calling Python methods on the individual | ||||||
string objects). | ||||||
|
||||||
To solve the first issue, a dedicated extension dtype for string data has | ||||||
already been | ||||||
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type). | ||||||
This has always been opt-in for now, requiring users to explicitly request the | ||||||
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing | ||||||
this string dtype was initially almost the same as the default implementation, | ||||||
i.e. an `object`-dtype NumPy array of Python strings. | ||||||
|
||||||
To solve the second issue (performance), pandas contributed to the development | ||||||
of string kernels in the PyArrow package, and a variant of the string dtype | ||||||
backed by PyArrow was | ||||||
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type). | ||||||
This could be specified with the `storage` keyword in the opt-in string dtype | ||||||
(`pd.StringDtype(storage="pyarrow")`). | ||||||
|
||||||
Since its introduction, the `StringDtype` has always been opt-in, and has used | ||||||
the experimental `pd.NA` sentinel for missing values (which was also [introduced | ||||||
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). | ||||||
However, up to this date, pandas has not yet taken the step to use `pd.NA` by | ||||||
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared | ||||||
|
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared | |
default for all dtypes, and thus the `StringDtype` deviates in missing value behaviour compared |
MarcoGorelli marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still -1 on changing this behavior; I do not want to revert "string" back to NumPy nullability semantics; that is a breaking change for anyone that has been using our extension type system to "solve" this issue for the past 5-6 years
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PDEP should also be clear about long term expectations. I still think right now we are assuming:
- 2.x release -
dtype="string"
uses pd.NA as a missing value marker - 3.x release -
dtype="string"
uses np.nan as a missing value marker by default, user setting to change to pd.NA - 4.x release -
dtype="string"
changes back to the 2.x behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still -1 on changing this behavior; I do not want to revert "string" back to NumPy nullability semantics
For clarification, do you mean "-1 on using NaN semantics for the default string dtype, regardless of how we name it", or only "-1 on using NaN semantics for the dtype created as dtype="string"
" ?
Because it is only the latter that causes the breaking change for anyone already using the nullable string dtype. Assume we would use a different name or different string alias than "string", we could still have a default string dtype (which everyone that was not yet using the nullable StringDtype
would get by default) that uses the proposed NaN semantics, while not causing a breaking change for the existing users of dtype="string"
/ dtype=pd.StringDtype()
.
(it's another question whether there is enough support for using a different name, I personally think "string" is the best choice which we should reserve for the default dtype, but first to get a good understanding of your position)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification, do you mean "-1 on using NaN semantics for the default string dtype, regardless of how we name it", or only "-1 on using NaN semantics for the dtype created as
dtype="string"
" ?
Definitely the latter, maybe the former. My expectation with PDEP-10 was that the default pyarrow string would be using pd.NA. If that is too difficult then yea there probably is a compromise on the former, but I do not want to take away the dtype="string"
functionality from users that has been working all of this time.
Not that it is ideal, but we already have dtype=str
today and dtype="string"
; maybe the former becomes the new name for what is being proposed here instead of string[pyarrow_numpy]
and only change dtype="string"
to be pyarrow backed without changing nullability semantics?
That doesn't solve the str/"string" discrepancy but I don't think introduces any new problems either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to take away the
dtype="string"
functionality from users that has been working all of this time.
We are not "taking away" that functionality, the current revision of the PDEP only asks users to use dtype=pd.StringDtype(na_value=pd.NA)
instead to continue using the same functionality, and Irv's suggestion would minimize the required code change to use dtype="String"
Not that it is ideal, but we already have
dtype=str
today anddtype="string"
; maybe the former becomes the new name for what is being proposed here instead ofstring[pyarrow_numpy]
Then what would you propose to show in the df.dtypes
output? (i.e. the string repr of the dtype) Also "str" instead of "string"?
That would be an option. In that case, we could also use a separate StrDtype()
class for those NaN-variants (which also solves the back compat issue for dtype=pd.StringDtype()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we can discuss further in PDEP-13 but the current revision of it proposes consistently having the format of <TYPE>Dtype(na_marker=pd.NA|"legacy")
, the idea being that long term pd.NA can be used consistently, but we will have a compatability period of "legacy" where you get the mix of np.nan / pd.NaT for NumPy-based types (and still probably pd.NA for any new types like ListDtype).
So StringDtype(na_marker=np.nan)
is slightly different from that. Maybe asking users to explicitly say np.nan
instead of "legacy"
has some downsides from a UI perspective, but my gut feeling is that we can solve that over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to be wary though of users expecting complete control over the na_marker
field. I don't see a value-add in trying to support DatetimeDtype(na_marker=np.nan)
alongside DatetimeDtype(na_marker=pd.NaT)
nor do I think there would ever be value in ListDtype(na_marker=np.nan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to be wary though of users expecting complete control over the
na_marker
field.
I definitely agree users shouldn't expect complete control over the na_marker
or na_value
field. This is not a custom value that can be anything, and we indeed don't want to generalize that to other dtypes in the future.
Because of that, I have been thinking to not actually allow a user to specify StringDtype(na_marker=np.nan)
explicitly, but only allow the implicit default of StringDtype()
(using NaN) or the explicit choice to not have the default with StringDtype(na_value=pd.NA)
.
Just to avoid users actually doing StringDtype(na_marker=np.nan)
(while that is not necessary, when it is possible users will still do it), and thinking this will generalize to other dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PDEP should also be clear about long term expectations.
(on your original comment in this thread) I will mention something about this changing again in the future, but I don't want to make it that explicit because at this point we don't know for sure that this will happen and what the timeline would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still overall -1 on changing the default from pd.NA back to np.nan. The latter is not generalizable (not even in our current design) so I don't see how we can consider that the long term solution
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I don't think the current object
behavior always uses NaN
:
>>> sn = pd.Series(["abc", "defg", "hijkl"])
>>> sn
0 abc
1 defg
2 hijkl
dtype: object
>>> sn.shift(1)
0 None
1 abc
2 defg
dtype: object
>>> sn.shift(1).iloc[0] is None
So at least with the shift()
operation, the "missing value" is None
, not np.nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we are not really consistent with this. When created, we will consider any null-like (None, NaN, NaT, NA) as a missing value in an object-dtype column, but for methods introducing them I would have hoped we were consistent. But apparently not:
>>> ser = pd.Series(["a", "b", "c"], dtype=object)
>>> ser.shift(1)
0 None
1 a
2 b
dtype: object
>>> ser.reindex([1, 2, 3])
1 b
2 c
3 NaN
dtype: object
And to be clear, this will all be NaN with the future string dtype (whether converted to NaN upon construction, or ensuring we use NaN for missing values introduced in methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we're not consistent today does suggest that with this proposal (using np.nan
as the default missing value indicator for strings), that there will be a behavior change for people using dtype=object
today with strings, because we'd replace None
with np.nan
in shift()
(and maybe elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned this in the backwards compatibility section
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a user should be able to specify dtype="String"
, and then they get the equivalent of StringDtype(na_value=pd.NA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to mention the addition of a "String"
string alias for the NA-variant (it's mentioned below in the backwards compatibility section)
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you can say "significant" yet. I would delete that word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you can say "significant" yet. I would delete that word.
Deleted it.
Dr-Irv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
rhshadrach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those fixes should be in a 2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it seems we haven't had any fixes yet in 2.2.x, we merged several fixes for the future default string dtype mode in 2.1.x (after the initial 2.1.0 release). I would think we can continue doing that for fixes, but can also just leave out this sentence if there is disagreement.
(I think the general rule of this being discussed on a PR basis whether it should be backported or not, depending on how critical the fix is, would apply here, and so that maybe doesn't require explicit mentioning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those fixes should be in a 2.3
Removed this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification, isn't the alternative not the "numpy object-dtype alternative", but rather an extension array using numpy objects as strings, with
np.nan
missing value semantics. You're not proposing that you still get anumpy
backed array withobject
dtype, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, definitely not proposing that. I meant the alternative ExtensionArray using numpy object-dtype under the hood. Will need to clarify that.