Skip to content

Commit 2a48068

Browse files
Merge remote-tracking branch 'upstream/main' into string-dtype-isdigit
2 parents cf26a93 + 3940df8 commit 2a48068

File tree

25 files changed

+211
-37
lines changed

25 files changed

+211
-37
lines changed

.github/workflows/wheels.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ jobs:
189189
# installing wheel here because micromamba step was skipped
190190
if: matrix.buildplat[1] == 'win_arm64'
191191
shell: bash -el {0}
192-
run: python -m pip install wheel
192+
run: python -m pip install wheel anaconda-client
193193

194194
- name: Validate wheel RECORD
195195
shell: bash -el {0}

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@
1919
**pandas** is a Python package that provides fast, flexible, and expressive data
2020
structures designed to make working with "relational" or "labeled" data both
2121
easy and intuitive. It aims to be the fundamental high-level building block for
22-
doing practical, **real world** data analysis in Python. Additionally, it has
23-
the broader goal of becoming **the most powerful and flexible open source data
24-
analysis / manipulation tool available in any language**. It is already well on
22+
doing practical, **real-world** data analysis in Python. Additionally, it has
23+
the broader goal of becoming **the most powerful and flexible open-source data
24+
analysis/manipulation tool available in any language**. It is already well on
2525
its way towards this goal.
2626

2727
## Table of Contents
@@ -64,7 +64,7 @@ Here are just a few of the things that pandas does well:
6464
data sets
6565
- [**Hierarchical**][mi] labeling of axes (possible to have multiple
6666
labels per tick)
67-
- Robust IO tools for loading data from [**flat files**][flat-files]
67+
- Robust I/O tools for loading data from [**flat files**][flat-files]
6868
(CSV and delimited), [**Excel files**][excel], [**databases**][db],
6969
and saving/loading data from the ultrafast [**HDF5 format**][hdfstore]
7070
- [**Time series**][timeseries]-specific functionality: date range
@@ -138,7 +138,7 @@ or for installing in [development mode](https://pip.pypa.io/en/latest/cli/pip_in
138138

139139

140140
```sh
141-
python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true
141+
python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true
142142
```
143143

144144
See the full instructions for [installing from source](https://pandas.pydata.org/docs/dev/development/contributing_environment.html).
@@ -155,7 +155,7 @@ has been under active development since then.
155155

156156
## Getting Help
157157

158-
For usage questions, the best place to go to is [StackOverflow](https://stackoverflow.com/questions/tagged/pandas).
158+
For usage questions, the best place to go to is [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas).
159159
Further, general questions and discussions can also take place on the [pydata mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata).
160160

161161
## Discussion and Development

doc/source/whatsnew/v2.3.2.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ Bug fixes
2828
"string" type in the JSON Table Schema for :class:`StringDtype` columns
2929
(:issue:`61889`)
3030
- Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
31+
- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
32+
with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
3133

3234
.. ---------------------------------------------------------------------------
3335
.. _whatsnew_232.contributors:

doc/source/whatsnew/v3.0.0.rst

Lines changed: 101 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,108 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_300.enhancements.enhancement1:
17+
.. _whatsnew_300.enhancements.string_dtype:
1818

19-
Enhancement1
20-
^^^^^^^^^^^^
19+
Dedicated string data type by default
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
Historically, pandas represented string columns with NumPy ``object`` data type.
23+
This representation has numerous problems: it is not specific to strings (any
24+
Python object can be stored in an ``object``-dtype array, not just strings) and
25+
it is often not very efficient (both performance wise and for memory usage).
26+
27+
Starting with pandas 3.0, a dedicated string data type is enabled by default
28+
(backed by PyArrow under the hood, if installed, otherwise falling back to being
29+
backed by NumPy ``object``-dtype). This means that pandas will start inferring
30+
columns containing string data as the new ``str`` data type when creating pandas
31+
objects, such as in constructors or IO functions.
32+
33+
Old behavior:
34+
35+
.. code-block:: python
36+
37+
>>> ser = pd.Series(["a", "b"])
38+
0 a
39+
1 b
40+
dtype: object
41+
42+
New behavior:
43+
44+
.. code-block:: python
45+
46+
>>> ser = pd.Series(["a", "b"])
47+
0 a
48+
1 b
49+
dtype: str
50+
51+
The string data type that is used in these scenarios will mostly behave as NumPy
52+
object would, including missing value semantics and general operations on these
53+
columns.
54+
55+
The main characteristic of the new string data type:
56+
57+
- Inferred by default for string data (instead of object dtype)
58+
- The ``str`` dtype can only hold strings (or missing values), in contrast to
59+
``object`` dtype. (setitem with non string fails)
60+
- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
61+
missing value semantics as the other default dtypes.
62+
63+
Those intentional changes can have breaking consequences, for example when checking
64+
for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
65+
See the :ref:`string_migration_guide` for more details on the behaviour changes
66+
and how to adapt your code to the new default.
67+
68+
.. seealso::
69+
70+
`PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
71+
72+
73+
.. _whatsnew_300.enhancements.copy_on_write:
74+
75+
Copy-on-Write
76+
^^^^^^^^^^^^^
77+
78+
The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
79+
how pandas operates with respect to copies and views. A summary of the changes:
80+
81+
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
82+
i.e. including accessing a DataFrame column as a Series) or any method returning a
83+
new DataFrame or Series, always *behaves as if* it were a copy in terms of user
84+
API.
85+
2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
86+
to do this is to directly modify that object itself.
87+
88+
The main goal of this change is to make the user API more consistent and
89+
predictable. There is now a clear rule: *any* subset or returned
90+
series/dataframe **always** behaves as a copy of the original, and thus never
91+
modifies the original (before pandas 3.0, whether a derived object would be a
92+
copy or a view depended on the exact operation performed, which was often
93+
confusing).
94+
95+
Because every single indexing step now behaves as a copy, this also means that
96+
"chained assignment" (updating a DataFrame with multiple setitem steps) will
97+
stop working. Because this now consistently never works, the
98+
``SettingWithCopyWarning`` is removed.
99+
100+
The new behavioral semantics are explained in more detail in the
101+
:ref:`user guide about Copy-on-Write <copy_on_write>`.
102+
103+
A secondary goal is to improve performance by avoiding unnecessary copies. As
104+
mentioned above, every new DataFrame or Series returned from an indexing
105+
operation or method *behaves* as a copy, but under the hood pandas will use
106+
views as much as possible, and only copy when needed to guarantee the "behaves
107+
as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
108+
implementation detail).
109+
110+
Some of the behaviour changes described above are breaking changes in pandas
111+
3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
112+
2.3 to get deprecation warnings for a subset of those changes. The
113+
:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
114+
process in more detail.
115+
116+
.. seealso::
117+
118+
`PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
21119

22120
.. _whatsnew_300.enhancements.enhancement2:
23121

pandas/_config/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -693,8 +693,8 @@ def _get_registered_option(key: str):
693693

694694
def _translate_key(key: str) -> str:
695695
"""
696-
if key id deprecated and a replacement key defined, will return the
697-
replacement key, otherwise returns `key` as - is
696+
if `key` is deprecated and a replacement key defined, will return the
697+
replacement key, otherwise returns `key` as-is
698698
"""
699699
d = _get_deprecated_option(key)
700700
if d:

pandas/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -581,7 +581,7 @@ def render_git_describe(pieces):
581581
def render_git_describe_long(pieces):
582582
"""TAG-DISTANCE-gHEX[-dirty].
583583
584-
Like 'git describe --tags --dirty --always -long'.
584+
Like 'git describe --tags --dirty --always --long'.
585585
The distance/hash is unconditional.
586586
587587
Exceptions:

pandas/core/accessor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ def _add_delegate_accessors(
8888
cls
8989
Class to add the methods/properties to.
9090
delegate
91-
Class to get methods/properties and doc-strings.
91+
Class to get methods/properties and docstrings.
9292
accessors : list of str
9393
List of accessors to add.
9494
typ : {'property', 'method'}
@@ -159,7 +159,7 @@ def delegate_names(
159159
Parameters
160160
----------
161161
delegate : object
162-
The class to get methods/properties & doc-strings.
162+
The class to get methods/properties & docstrings.
163163
accessors : Sequence[str]
164164
List of accessor to add.
165165
typ : {'property', 'method'}

pandas/core/arrays/_arrow_string_mixins.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -309,23 +309,29 @@ def _str_contains(
309309

310310
def _str_match(
311311
self,
312-
pat: str,
312+
pat: str | re.Pattern,
313313
case: bool = True,
314314
flags: int = 0,
315315
na: Scalar | lib.NoDefault = lib.no_default,
316316
):
317-
if not pat.startswith("^"):
317+
if isinstance(pat, re.Pattern):
318+
# GH#61952
319+
pat = pat.pattern
320+
if isinstance(pat, str) and not pat.startswith("^"):
318321
pat = f"^{pat}"
319322
return self._str_contains(pat, case, flags, na, regex=True)
320323

321324
def _str_fullmatch(
322325
self,
323-
pat,
326+
pat: str | re.Pattern,
324327
case: bool = True,
325328
flags: int = 0,
326329
na: Scalar | lib.NoDefault = lib.no_default,
327330
):
328-
if not pat.endswith("$") or pat.endswith("\\$"):
331+
if isinstance(pat, re.Pattern):
332+
# GH#61952
333+
pat = pat.pattern
334+
if isinstance(pat, str) and (not pat.endswith("$") or pat.endswith("\\$")):
329335
pat = f"{pat}$"
330336
return self._str_match(pat, case, flags, na)
331337

pandas/core/arrays/boolean.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,7 @@ def _logical_method(self, other, op): # type: ignore[override]
378378
elif is_list_like(other):
379379
other = np.asarray(other, dtype="bool")
380380
if other.ndim > 1:
381-
raise NotImplementedError("can only perform ops with 1-d structures")
381+
return NotImplemented
382382
other, mask = coerce_to_array(other, copy=False)
383383
elif isinstance(other, np.bool_):
384384
other = other.item()

pandas/core/arrays/string_arrow.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,8 @@ def _str_contains(
346346
):
347347
if flags:
348348
return super()._str_contains(pat, case, flags, na, regex)
349+
if isinstance(pat, re.Pattern):
350+
pat = pat.pattern
349351

350352
return ArrowStringArrayMixin._str_contains(self, pat, case, flags, na, regex)
351353

0 commit comments

Comments
 (0)