Skip to content

Commit 315f743

Browse files
add section about string dtype
1 parent 35b0d1d commit 315f743

File tree

1 file changed

+50
-3
lines changed

1 file changed

+50
-3
lines changed

doc/source/whatsnew/v3.0.0.rst

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,57 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_300.enhancements.enhancement1:
17+
.. _whatsnew_300.enhancements.string_dtype:
18+
19+
Dedicated string data type by default
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
Historically, pandas represented string columns with NumPy ``object`` data type.
23+
This representation has numerous problems: it is not specific to strings (any
24+
Python object can be stored in an ``object``-dtype array, not just strings) and
25+
it is often not very efficient (both performance wise and for memory usage).
26+
27+
Starting with pandas 3.0, a dedicated string data type is enabled by default
28+
(backed by PyArrow under the hood, if installed, otherwise falling back to
29+
NumPy). This means that pandas will start inferring columns containing string
30+
data as the new ``str`` data type when creating pandas objects, such as in
31+
constructors or IO functions.
32+
33+
Old behavior:
34+
35+
.. code-block:: python
36+
37+
>>> ser = pd.Series(["a", "b"])
38+
0 a
39+
1 b
40+
dtype: object
41+
42+
New behavior:
43+
44+
.. code-block:: python
45+
46+
>>> ser = pd.Series(["a", "b"])
47+
0 a
48+
1 b
49+
dtype: str
50+
51+
The string data type that is used in these scenarios will mostly behave as NumPy
52+
object would, including missing value semantics and general operations on these
53+
columns.
54+
55+
The main characteristic of the new string data type:
56+
57+
- Inferred by default for string data (instead of object dtype)
58+
- The ``str`` dtype can only hold strings (or missing values), in contrast to
59+
``object`` dtype. (setitem with non string fails)
60+
- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
61+
missing value semantics as the other default dtypes.
62+
63+
Those intentional changes can have breaking consequences, for example when checking
64+
for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
65+
66+
TODO add link to migration guide for more details
1867

19-
Enhancement1
20-
^^^^^^^^^^^^
2168

2269
.. _whatsnew_300.enhancements.enhancement2:
2370

0 commit comments

Comments
 (0)