Skip to content

Commit dc8ec97

Browse files
committed
Merge remote-tracking branch 'upstream/main' into read-csv-from-directory
2 parents aaa95ae + 1bd75cc commit dc8ec97

File tree

9 files changed

+171
-13
lines changed

9 files changed

+171
-13
lines changed

doc/source/whatsnew/v2.3.2.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ Bug fixes
2626
"string" type in the JSON Table Schema for :class:`StringDtype` columns
2727
(:issue:`61889`)
2828
- Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
29+
- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
30+
with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
2931

3032
.. ---------------------------------------------------------------------------
3133
.. _whatsnew_232.contributors:

doc/source/whatsnew/v3.0.0.rst

Lines changed: 101 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,108 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_300.enhancements.enhancement1:
17+
.. _whatsnew_300.enhancements.string_dtype:
1818

19-
Enhancement1
20-
^^^^^^^^^^^^
19+
Dedicated string data type by default
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
Historically, pandas represented string columns with NumPy ``object`` data type.
23+
This representation has numerous problems: it is not specific to strings (any
24+
Python object can be stored in an ``object``-dtype array, not just strings) and
25+
it is often not very efficient (both performance wise and for memory usage).
26+
27+
Starting with pandas 3.0, a dedicated string data type is enabled by default
28+
(backed by PyArrow under the hood, if installed, otherwise falling back to being
29+
backed by NumPy ``object``-dtype). This means that pandas will start inferring
30+
columns containing string data as the new ``str`` data type when creating pandas
31+
objects, such as in constructors or IO functions.
32+
33+
Old behavior:
34+
35+
.. code-block:: python
36+
37+
>>> ser = pd.Series(["a", "b"])
38+
0 a
39+
1 b
40+
dtype: object
41+
42+
New behavior:
43+
44+
.. code-block:: python
45+
46+
>>> ser = pd.Series(["a", "b"])
47+
0 a
48+
1 b
49+
dtype: str
50+
51+
The string data type that is used in these scenarios will mostly behave as NumPy
52+
object would, including missing value semantics and general operations on these
53+
columns.
54+
55+
The main characteristic of the new string data type:
56+
57+
- Inferred by default for string data (instead of object dtype)
58+
- The ``str`` dtype can only hold strings (or missing values), in contrast to
59+
``object`` dtype. (setitem with non string fails)
60+
- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
61+
missing value semantics as the other default dtypes.
62+
63+
Those intentional changes can have breaking consequences, for example when checking
64+
for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
65+
See the :ref:`string_migration_guide` for more details on the behaviour changes
66+
and how to adapt your code to the new default.
67+
68+
.. seealso::
69+
70+
`PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
71+
72+
73+
.. _whatsnew_300.enhancements.copy_on_write:
74+
75+
Copy-on-Write
76+
^^^^^^^^^^^^^
77+
78+
The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
79+
how pandas operates with respect to copies and views. A summary of the changes:
80+
81+
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
82+
i.e. including accessing a DataFrame column as a Series) or any method returning a
83+
new DataFrame or Series, always *behaves as if* it were a copy in terms of user
84+
API.
85+
2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
86+
to do this is to directly modify that object itself.
87+
88+
The main goal of this change is to make the user API more consistent and
89+
predictable. There is now a clear rule: *any* subset or returned
90+
series/dataframe **always** behaves as a copy of the original, and thus never
91+
modifies the original (before pandas 3.0, whether a derived object would be a
92+
copy or a view depended on the exact operation performed, which was often
93+
confusing).
94+
95+
Because every single indexing step now behaves as a copy, this also means that
96+
"chained assignment" (updating a DataFrame with multiple setitem steps) will
97+
stop working. Because this now consistently never works, the
98+
``SettingWithCopyWarning`` is removed.
99+
100+
The new behavioral semantics are explained in more detail in the
101+
:ref:`user guide about Copy-on-Write <copy_on_write>`.
102+
103+
A secondary goal is to improve performance by avoiding unnecessary copies. As
104+
mentioned above, every new DataFrame or Series returned from an indexing
105+
operation or method *behaves* as a copy, but under the hood pandas will use
106+
views as much as possible, and only copy when needed to guarantee the "behaves
107+
as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
108+
implementation detail).
109+
110+
Some of the behaviour changes described above are breaking changes in pandas
111+
3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
112+
2.3 to get deprecation warnings for a subset of those changes. The
113+
:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
114+
process in more detail.
115+
116+
.. seealso::
117+
118+
`PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
21119

22120
.. _whatsnew_300.enhancements.enhancement2:
23121

pandas/core/arrays/_arrow_string_mixins.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -302,23 +302,29 @@ def _str_contains(
302302

303303
def _str_match(
304304
self,
305-
pat: str,
305+
pat: str | re.Pattern,
306306
case: bool = True,
307307
flags: int = 0,
308308
na: Scalar | lib.NoDefault = lib.no_default,
309309
):
310-
if not pat.startswith("^"):
310+
if isinstance(pat, re.Pattern):
311+
# GH#61952
312+
pat = pat.pattern
313+
if isinstance(pat, str) and not pat.startswith("^"):
311314
pat = f"^{pat}"
312315
return self._str_contains(pat, case, flags, na, regex=True)
313316

314317
def _str_fullmatch(
315318
self,
316-
pat,
319+
pat: str | re.Pattern,
317320
case: bool = True,
318321
flags: int = 0,
319322
na: Scalar | lib.NoDefault = lib.no_default,
320323
):
321-
if not pat.endswith("$") or pat.endswith("\\$"):
324+
if isinstance(pat, re.Pattern):
325+
# GH#61952
326+
pat = pat.pattern
327+
if isinstance(pat, str) and (not pat.endswith("$") or pat.endswith("\\$")):
322328
pat = f"{pat}$"
323329
return self._str_match(pat, case, flags, na)
324330

pandas/core/arrays/boolean.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,7 @@ def _logical_method(self, other, op): # type: ignore[override]
378378
elif is_list_like(other):
379379
other = np.asarray(other, dtype="bool")
380380
if other.ndim > 1:
381-
raise NotImplementedError("can only perform ops with 1-d structures")
381+
return NotImplemented
382382
other, mask = coerce_to_array(other, copy=False)
383383
elif isinstance(other, np.bool_):
384384
other = other.item()

pandas/core/arrays/string_arrow.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,8 @@ def _str_contains(
346346
):
347347
if flags:
348348
return super()._str_contains(pat, case, flags, na, regex)
349+
if isinstance(pat, re.Pattern):
350+
pat = pat.pattern
349351

350352
return ArrowStringArrayMixin._str_contains(self, pat, case, flags, na, regex)
351353

pandas/core/strings/accessor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1361,8 +1361,8 @@ def match(self, pat: str, case: bool = True, flags: int = 0, na=lib.no_default):
13611361
13621362
Parameters
13631363
----------
1364-
pat : str
1365-
Character sequence.
1364+
pat : str or compiled regex
1365+
Character sequence or regular expression.
13661366
case : bool, default True
13671367
If True, case sensitive.
13681368
flags : int, default 0 (no flags)

pandas/core/strings/object_array.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -248,14 +248,15 @@ def rep(x, r):
248248

249249
def _str_match(
250250
self,
251-
pat: str,
251+
pat: str | re.Pattern,
252252
case: bool = True,
253253
flags: int = 0,
254254
na: Scalar | lib.NoDefault = lib.no_default,
255255
):
256256
if not case:
257257
flags |= re.IGNORECASE
258-
258+
if isinstance(pat, re.Pattern):
259+
pat = pat.pattern
259260
regex = re.compile(pat, flags=flags)
260261

261262
f = lambda x: regex.match(x) is not None
@@ -270,7 +271,8 @@ def _str_fullmatch(
270271
):
271272
if not case:
272273
flags |= re.IGNORECASE
273-
274+
if isinstance(pat, re.Pattern):
275+
pat = pat.pattern
274276
regex = re.compile(pat, flags=flags)
275277

276278
f = lambda x: regex.fullmatch(x) is not None

pandas/tests/arithmetic/test_numeric.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -862,6 +862,19 @@ def test_modulo_zero_int(self):
862862
expected = Series([np.nan, 0.0])
863863
tm.assert_series_equal(result, expected)
864864

865+
def test_non_1d_ea_raises_notimplementederror(self):
866+
# GH#61866
867+
ea_array = array([1, 2, 3, 4, 5], dtype="Int64").reshape(5, 1)
868+
np_array = np.array([1, 2, 3, 4, 5], dtype=np.int64).reshape(5, 1)
869+
870+
msg = "can only perform ops with 1-d structures"
871+
872+
with pytest.raises(NotImplementedError, match=msg):
873+
ea_array * np_array
874+
875+
with pytest.raises(NotImplementedError, match=msg):
876+
np_array * ea_array
877+
865878

866879
class TestAdditionSubtraction:
867880
# __add__, __sub__, __radd__, __rsub__, __iadd__, __isub__

pandas/tests/strings/test_find_replace.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,19 @@ def test_contains_nan(any_string_dtype):
281281
tm.assert_series_equal(result, expected)
282282

283283

284+
def test_contains_compiled_regex(any_string_dtype):
285+
# GH#61942
286+
ser = Series(["foo", "bar", "baz"], dtype=any_string_dtype)
287+
pat = re.compile("ba.")
288+
result = ser.str.contains(pat)
289+
290+
expected_dtype = (
291+
np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
292+
)
293+
expected = Series([False, True, True], dtype=expected_dtype)
294+
tm.assert_series_equal(result, expected)
295+
296+
284297
# --------------------------------------------------------------------------------------
285298
# str.startswith
286299
# --------------------------------------------------------------------------------------
@@ -818,6 +831,17 @@ def test_match_case_kwarg(any_string_dtype):
818831
tm.assert_series_equal(result, expected)
819832

820833

834+
def test_match_compiled_regex(any_string_dtype):
835+
# GH#61952
836+
values = Series(["ab", "AB", "abc", "ABC"], dtype=any_string_dtype)
837+
result = values.str.match(re.compile(r"ab"), case=False)
838+
expected_dtype = (
839+
np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
840+
)
841+
expected = Series([True, True, True, True], dtype=expected_dtype)
842+
tm.assert_series_equal(result, expected)
843+
844+
821845
# --------------------------------------------------------------------------------------
822846
# str.fullmatch
823847
# --------------------------------------------------------------------------------------
@@ -887,6 +911,17 @@ def test_fullmatch_case_kwarg(any_string_dtype):
887911
tm.assert_series_equal(result, expected)
888912

889913

914+
def test_fullmatch_compiled_regex(any_string_dtype):
915+
# GH#61952
916+
values = Series(["ab", "AB", "abc", "ABC"], dtype=any_string_dtype)
917+
result = values.str.fullmatch(re.compile(r"ab"), case=False)
918+
expected_dtype = (
919+
np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
920+
)
921+
expected = Series([True, True, False, False], dtype=expected_dtype)
922+
tm.assert_series_equal(result, expected)
923+
924+
890925
# --------------------------------------------------------------------------------------
891926
# str.findall
892927
# --------------------------------------------------------------------------------------

0 commit comments

Comments
 (0)