-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
PDEP-14: Dedicated string data type for pandas 3.0 #58551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 20 commits
fbeb69d
f03f54d
561de87
86f4e51
30c7b43
54a43b3
5b5835b
9ede2e6
f5faf4e
f554909
ac2d21a
82027d2
5b24c24
f9c55f4
2c58c4c
0a68504
8974c5b
cca3a7f
d24a80a
9c5342a
b5663cc
1c4c2d9
c44bfb5
af5ad3c
bd52f39
f8fbc61
d78462d
4de20d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,373 @@ | ||||||
# PDEP-14: Dedicated string data type for pandas 3.0 | ||||||
|
||||||
- Created: May 3, 2024 | ||||||
- Status: Under discussion | ||||||
- Discussion: https://github.com/pandas-dev/pandas/pull/58551 | ||||||
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche) | ||||||
- Revision: 1 | ||||||
|
||||||
## Abstract | ||||||
|
||||||
This PDEP proposes to introduce a dedicated string dtype that will be used by | ||||||
default in pandas 3.0: | ||||||
|
||||||
* In pandas 3.0, enable a string dtype (`"str"`) by default, using PyArrow if available | ||||||
or otherwise a string dtype using numpy object-dtype under the hood as fallback. | ||||||
* The default string dtype will use missing value semantics (using NaN) consistent | ||||||
with the other default data types. | ||||||
|
||||||
This will give users a long-awaited proper string dtype for 3.0, while 1) not | ||||||
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default, | ||||||
and 2) leaving room for future improvements (different missing value semantics, | ||||||
using NumPy 2.0 strings, etc). | ||||||
|
||||||
## Background | ||||||
|
||||||
Currently, pandas by default stores text data in an `object`-dtype NumPy array. | ||||||
The current implementation has two primary drawbacks. First, `object` dtype is | ||||||
not specific to strings: any Python object can be stored in an `object`-dtype | ||||||
array, not just strings, and seeing `object` as the dtype for a column with | ||||||
strings is confusing for users. Second: this is not efficient (all string | ||||||
methods on a Series are eventually calling Python methods on the individual | ||||||
string objects). | ||||||
|
||||||
To solve the first issue, a dedicated extension dtype for string data has | ||||||
already been | ||||||
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type). | ||||||
This has always been opt-in for now, requiring users to explicitly request the | ||||||
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing | ||||||
this string dtype was initially almost the same as the default implementation, | ||||||
i.e. an `object`-dtype NumPy array of Python strings. | ||||||
|
||||||
To solve the second issue (performance), pandas contributed to the development | ||||||
of string kernels in the PyArrow package, and a variant of the string dtype | ||||||
backed by PyArrow was | ||||||
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type). | ||||||
This could be specified with the `storage` keyword in the opt-in string dtype | ||||||
(`pd.StringDtype(storage="pyarrow")`). | ||||||
|
||||||
Since its introduction, the `StringDtype` has always been opt-in, and has used | ||||||
the experimental `pd.NA` sentinel for missing values (which was also [introduced | ||||||
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)). | ||||||
However, up to this date, pandas has not yet taken the step to use `pd.NA` by | ||||||
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared | ||||||
|
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared | |
default for all dtypes, and thus the `StringDtype` deviates in missing value behaviour compared |
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a user specify "string[pyarrow]"
and "string[python]"
as the dtype
as it works in 2.2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nothing changes there (it's mentioned in one of the paragraphs below the table that in theory we could also stop supporting the storage in the string alias for those, but that's out of scope for the PDEP). The fact that they are listed in the "String alias" column is meant to indicate that you can use that string as a string alias when specifying a dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do Series.dtype
. I'm not sure that we are completely bidirectional in all of our dtypes with respect to the strings.
Also, if we are saying that "str"
will not indicate the storage type, should we not then deprecate the usage of "string[python]"
and "string[pyarrow]"
and just make it "string"
? The asymmetry between "str"
not specifying storage and "string"
doesn't feel right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do
Series.dtype
. I'm not sure that we are completely bidirectional in all of our dtypes with respect to the strings.
I know, and I had been going back and forth in my draft calling this columm "String alias" vs "Dtype repr", because both are partly overlapping concepts, but then also "dtype repr" is confusing because there is both __repr__
and __str__
(eg for __str__
(used in df.dtypes
) we always just "string" without the storage suffix).
In the end, this column actually shows the dtype __repr__
, but which in practice is the most specific string alias one can use to specify the dtype (i.e. for "string[pyarrow]" you can also use "string" if pyarrow is the default storage, but the table already includes too much content).
Also, if we are saying that
"str"
will not indicate the storage type, should we not then deprecate the usage of"string[python]"
and"string[pyarrow]"
and just make it"string"
?
As mentioned in my previous comment, a paragraph is included to say this is left for a separate discussion (this would introduce a backwards incompatible change for existing users, while otherwise are not any on this front, and the exact way to indicate backends in string aliases, I would like to defer any final decision about that to the PDEP on logical types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I find this really unfortunate but agree with Joris that we shouldn't rock the boat too much on this PDEP, especially since it is just about strings. The larger discussion on resolving those aliases should be had in PDEP-13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, this column actually shows the dtype
__repr__
, but which in practice is the most specific string alias one can use to specify the dtype (i.e. for "string[pyarrow]" you can also use "string" if pyarrow is the default storage, but the table already includes too much content).
Maybe relabel "String alias" to "String alias for dtype input" ?
Part of the confusion is this with 2.2:
>>> pd.Series(["abc"], dtype="string[pyarrow]")
0 abc
dtype: string
>>> pd.Series(["abc"], dtype="string[pyarrow]").dtype
string[pyarrow]
>>> pd.Series(["abc"], dtype="string")
0 abc
dtype: string
>>> pd.Series(["abc"], dtype="string").dtype
string[python]
When you print a Series
, it shows the dtype
as "string"
, but if you look at the dtype
, it shows "string[python]"
or "string[pyarrow]"
This is where my confusion is coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you print a
Series
, it shows thedtype
as"string"
, but if you look at thedtype
, it shows"string[python]"
or"string[pyarrow]"
This is where my confusion is coming from.
Yes, that's the difference between __repr__
and __str__
that I mentioned. Now, this is all existing behaviour, but so for the new dtype I am proposing to not have any difference here (both repr and str would show "str"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe showing the storage in both __repr__
and __str__
is very important. When you work of large Series, you have two main constraints: storage and computation efficiency. python backed and pyarrow backed are much different in these aspects. Whereas pyarrow is the best option most of the time, if you need mutability, python backed strings could be a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaudlegout it's definitely true that there are trade-offs in deciding which option is best (and for now, you will always be able to specify the storage and inspect it through an attribute).
I think the intent at the moment, though, is to consider the default of pyarrow (if installed) as the general default that almost everyone is expected to use, and that we don't incentivize too much to switch to the non-default storage (i.e. that it is more considered as an implementation detail which storage is used, for most users).
At least that's the sense I got from the discussion here. But more real-world feedback once we actually switch the default in 3.0 will probably give better insights in how big the drawbacks are in workflows heavy on mutability, and to what extent we should more clearly document those trade-offs (at some point we will also have to decide whether to keep the object-dtype based fallback long term, for example when making pyarrow required or when having a numpy string dtype based one).
My personal take on showing the storage in the repr/str is that for most users, I expect that they shouldn't have to care about the exact storage (or even be aware of the concept), and therefore I think it is more confusing to show this by default in the df.dtypes
or ser
repr (which both uses the dtype's __str__
).
If we do want to show it in the dtype's __repr__
(to make it more discoverable), I would personally advocate for making the __repr__
more like a class repr instead of a string alias (eg printing something like <pandas.StringDtype(storage=...)>
instead of string[storage]
, to avoid the []
use for this).
What exactly the __repr__
should look like for this new dtype is not included exactly in the PDEP, so feel free to open a new issue to argue for including the storage where we can continue this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done #59342
Uh oh!
There was an error while loading. Please reload this page.