@@ -14,10 +14,108 @@ including other versions of pandas.
14
14
Enhancements
15
15
~~~~~~~~~~~~
16
16
17
- .. _whatsnew_300.enhancements.enhancement1 :
17
+ .. _whatsnew_300.enhancements.string_dtype :
18
18
19
- Enhancement1
20
- ^^^^^^^^^^^^
19
+ Dedicated string data type by default
20
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21
+
22
+ Historically, pandas represented string columns with NumPy ``object `` data type.
23
+ This representation has numerous problems: it is not specific to strings (any
24
+ Python object can be stored in an ``object ``-dtype array, not just strings) and
25
+ it is often not very efficient (both performance wise and for memory usage).
26
+
27
+ Starting with pandas 3.0, a dedicated string data type is enabled by default
28
+ (backed by PyArrow under the hood, if installed, otherwise falling back to being
29
+ backed by NumPy ``object ``-dtype). This means that pandas will start inferring
30
+ columns containing string data as the new ``str `` data type when creating pandas
31
+ objects, such as in constructors or IO functions.
32
+
33
+ Old behavior:
34
+
35
+ .. code-block :: python
36
+
37
+ >> > ser = pd.Series([" a" , " b" ])
38
+ 0 a
39
+ 1 b
40
+ dtype: object
41
+
42
+ New behavior:
43
+
44
+ .. code-block :: python
45
+
46
+ >> > ser = pd.Series([" a" , " b" ])
47
+ 0 a
48
+ 1 b
49
+ dtype: str
50
+
51
+ The string data type that is used in these scenarios will mostly behave as NumPy
52
+ object would, including missing value semantics and general operations on these
53
+ columns.
54
+
55
+ The main characteristic of the new string data type:
56
+
57
+ - Inferred by default for string data (instead of object dtype)
58
+ - The ``str `` dtype can only hold strings (or missing values), in contrast to
59
+ ``object `` dtype. (setitem with non string fails)
60
+ - The missing value sentinel is always ``NaN `` (``np.nan ``) and follows the same
61
+ missing value semantics as the other default dtypes.
62
+
63
+ Those intentional changes can have breaking consequences, for example when checking
64
+ for the ``.dtype `` being object dtype or checking the exact missing value sentinel.
65
+ See the :ref: `string_migration_guide ` for more details on the behaviour changes
66
+ and how to adapt your code to the new default.
67
+
68
+ .. seealso ::
69
+
70
+ `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html >`__
71
+
72
+
73
+ .. _whatsnew_300.enhancements.copy_on_write :
74
+
75
+ Copy-on-Write
76
+ ^^^^^^^^^^^^^
77
+
78
+ The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
79
+ how pandas operates with respect to copies and views. A summary of the changes:
80
+
81
+ 1. The result of *any * indexing operation (subsetting a DataFrame or Series in any way,
82
+ i.e. including accessing a DataFrame column as a Series) or any method returning a
83
+ new DataFrame or Series, always *behaves as if * it were a copy in terms of user
84
+ API.
85
+ 2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
86
+ to do this is to directly modify that object itself.
87
+
88
+ The main goal of this change is to make the user API more consistent and
89
+ predictable. There is now a clear rule: *any * subset or returned
90
+ series/dataframe **always ** behaves as a copy of the original, and thus never
91
+ modifies the original (before pandas 3.0, whether a derived object would be a
92
+ copy or a view depended on the exact operation performed, which was often
93
+ confusing).
94
+
95
+ Because every single indexing step now behaves as a copy, this also means that
96
+ "chained assignment" (updating a DataFrame with multiple setitem steps) will
97
+ stop working. Because this now consistently never works, the
98
+ ``SettingWithCopyWarning `` is removed.
99
+
100
+ The new behavioral semantics are explained in more detail in the
101
+ :ref: `user guide about Copy-on-Write <copy_on_write >`.
102
+
103
+ A secondary goal is to improve performance by avoiding unnecessary copies. As
104
+ mentioned above, every new DataFrame or Series returned from an indexing
105
+ operation or method *behaves * as a copy, but under the hood pandas will use
106
+ views as much as possible, and only copy when needed to guarantee the "behaves
107
+ as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
108
+ implementation detail).
109
+
110
+ Some of the behaviour changes described above are breaking changes in pandas
111
+ 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
112
+ 2.3 to get deprecation warnings for a subset of those changes. The
113
+ :ref: `migration guide <copy_on_write.migration_guide >` explains the upgrade
114
+ process in more detail.
115
+
116
+ .. seealso ::
117
+
118
+ `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html >`__
21
119
22
120
.. _whatsnew_300.enhancements.enhancement2 :
23
121
0 commit comments