Skip to content

Commit 1bd75cc

Browse files
DOC: add sections about big new features (CoW, string dtype) to 3.0.0 whatsnew notes (#61724)
1 parent 23aae9f commit 1bd75cc

File tree

1 file changed

+101
-3
lines changed

1 file changed

+101
-3
lines changed

doc/source/whatsnew/v3.0.0.rst

Lines changed: 101 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,108 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_300.enhancements.enhancement1:
17+
.. _whatsnew_300.enhancements.string_dtype:
1818

19-
Enhancement1
20-
^^^^^^^^^^^^
19+
Dedicated string data type by default
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
Historically, pandas represented string columns with NumPy ``object`` data type.
23+
This representation has numerous problems: it is not specific to strings (any
24+
Python object can be stored in an ``object``-dtype array, not just strings) and
25+
it is often not very efficient (both performance wise and for memory usage).
26+
27+
Starting with pandas 3.0, a dedicated string data type is enabled by default
28+
(backed by PyArrow under the hood, if installed, otherwise falling back to being
29+
backed by NumPy ``object``-dtype). This means that pandas will start inferring
30+
columns containing string data as the new ``str`` data type when creating pandas
31+
objects, such as in constructors or IO functions.
32+
33+
Old behavior:
34+
35+
.. code-block:: python
36+
37+
>>> ser = pd.Series(["a", "b"])
38+
0 a
39+
1 b
40+
dtype: object
41+
42+
New behavior:
43+
44+
.. code-block:: python
45+
46+
>>> ser = pd.Series(["a", "b"])
47+
0 a
48+
1 b
49+
dtype: str
50+
51+
The string data type that is used in these scenarios will mostly behave as NumPy
52+
object would, including missing value semantics and general operations on these
53+
columns.
54+
55+
The main characteristic of the new string data type:
56+
57+
- Inferred by default for string data (instead of object dtype)
58+
- The ``str`` dtype can only hold strings (or missing values), in contrast to
59+
``object`` dtype. (setitem with non string fails)
60+
- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
61+
missing value semantics as the other default dtypes.
62+
63+
Those intentional changes can have breaking consequences, for example when checking
64+
for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
65+
See the :ref:`string_migration_guide` for more details on the behaviour changes
66+
and how to adapt your code to the new default.
67+
68+
.. seealso::
69+
70+
`PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
71+
72+
73+
.. _whatsnew_300.enhancements.copy_on_write:
74+
75+
Copy-on-Write
76+
^^^^^^^^^^^^^
77+
78+
The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
79+
how pandas operates with respect to copies and views. A summary of the changes:
80+
81+
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
82+
i.e. including accessing a DataFrame column as a Series) or any method returning a
83+
new DataFrame or Series, always *behaves as if* it were a copy in terms of user
84+
API.
85+
2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
86+
to do this is to directly modify that object itself.
87+
88+
The main goal of this change is to make the user API more consistent and
89+
predictable. There is now a clear rule: *any* subset or returned
90+
series/dataframe **always** behaves as a copy of the original, and thus never
91+
modifies the original (before pandas 3.0, whether a derived object would be a
92+
copy or a view depended on the exact operation performed, which was often
93+
confusing).
94+
95+
Because every single indexing step now behaves as a copy, this also means that
96+
"chained assignment" (updating a DataFrame with multiple setitem steps) will
97+
stop working. Because this now consistently never works, the
98+
``SettingWithCopyWarning`` is removed.
99+
100+
The new behavioral semantics are explained in more detail in the
101+
:ref:`user guide about Copy-on-Write <copy_on_write>`.
102+
103+
A secondary goal is to improve performance by avoiding unnecessary copies. As
104+
mentioned above, every new DataFrame or Series returned from an indexing
105+
operation or method *behaves* as a copy, but under the hood pandas will use
106+
views as much as possible, and only copy when needed to guarantee the "behaves
107+
as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
108+
implementation detail).
109+
110+
Some of the behaviour changes described above are breaking changes in pandas
111+
3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
112+
2.3 to get deprecation warnings for a subset of those changes. The
113+
:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
114+
process in more detail.
115+
116+
.. seealso::
117+
118+
`PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
21119

22120
.. _whatsnew_300.enhancements.enhancement2:
23121

0 commit comments

Comments
 (0)