Merge remote-tracking branch 'upstream/main' into read-csv-from-directory

fangchenli · fangchenli · commit dc8ec97831dc · 2025-08-15T09:29:21.000-07:00
diff --git a/doc/source/whatsnew/v2.3.2.rst b/doc/source/whatsnew/v2.3.2.rst
@@ -26,6 +26,8 @@ Bug fixes
   "string" type in the JSON Table Schema for :class:`StringDtype` columns
   (:issue:`61889`)
 - Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
+- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
+  with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
 
 .. ---------------------------------------------------------------------------
 .. _whatsnew_232.contributors:
diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst
@@ -14,10 +14,108 @@ including other versions of pandas.
 Enhancements
 ~~~~~~~~~~~~
 
-.. _whatsnew_300.enhancements.enhancement1:
+.. _whatsnew_300.enhancements.string_dtype:
 
-Enhancement1
-^^^^^^^^^^^^
+Dedicated string data type by default
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Historically, pandas represented string columns with NumPy ``object`` data type.
+This representation has numerous problems: it is not specific to strings (any
+Python object can be stored in an ``object``-dtype array, not just strings) and
+it is often not very efficient (both performance wise and for memory usage).
+
+Starting with pandas 3.0, a dedicated string data type is enabled by default
+(backed by PyArrow under the hood, if installed, otherwise falling back to being
+backed by NumPy ``object``-dtype). This means that pandas will start inferring
+columns containing string data as the new ``str`` data type when creating pandas
+objects, such as in constructors or IO functions.
+
+Old behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: object
+
+New behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: str
+
+The string data type that is used in these scenarios will mostly behave as NumPy
+object would, including missing value semantics and general operations on these
+columns.
+
+The main characteristic of the new string data type:
+
+- Inferred by default for string data (instead of object dtype)
+- The ``str`` dtype can only hold strings (or missing values), in contrast to
+  ``object`` dtype. (setitem with non string fails)
+- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
+  missing value semantics as the other default dtypes.
+
+Those intentional changes can have breaking consequences, for example when checking
+for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
+See the :ref:`string_migration_guide` for more details on the behaviour changes
+and how to adapt your code to the new default.
+
+.. seealso::
+
+    `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
+
+
+.. _whatsnew_300.enhancements.copy_on_write:
+
+Copy-on-Write
+^^^^^^^^^^^^^
+
+The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
+how pandas operates with respect to copies and views. A summary of the changes:
+
+1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
+   i.e. including accessing a DataFrame column as a Series) or any method returning a
+   new DataFrame or Series, always *behaves as if* it were a copy in terms of user
+   API.
+2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
+   to do this is to directly modify that object itself.
+
+The main goal of this change is to make the user API more consistent and
+predictable. There is now a clear rule: *any* subset or returned
+series/dataframe **always** behaves as a copy of the original, and thus never
+modifies the original (before pandas 3.0, whether a derived object would be a
+copy or a view depended on the exact operation performed, which was often
+confusing).
+
+Because every single indexing step now behaves as a copy, this also means that
+"chained assignment" (updating a DataFrame with multiple setitem steps) will
+stop working. Because this now consistently never works, the
+``SettingWithCopyWarning`` is removed.
+
+The new behavioral semantics are explained in more detail in the
+:ref:`user guide about Copy-on-Write <copy_on_write>`.
+
+A secondary goal is to improve performance by avoiding unnecessary copies. As
+mentioned above, every new DataFrame or Series returned from an indexing
+operation or method *behaves* as a copy, but under the hood pandas will use
+views as much as possible, and only copy when needed to guarantee the "behaves
+as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
+implementation detail).
+
+Some of the behaviour changes described above are breaking changes in pandas
+3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
+2.3 to get deprecation warnings for a subset of those changes. The
+:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
+process in more detail.
+
+.. seealso::
+
+    `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
 
 .. _whatsnew_300.enhancements.enhancement2:
 
diff --git a/pandas/core/arrays/_arrow_string_mixins.py b/pandas/core/arrays/_arrow_string_mixins.py
@@ -302,23 +302,29 @@ def _str_contains(
 
     def _str_match(
         self,
-        pat: str,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
-        if not pat.startswith("^"):
+        if isinstance(pat, re.Pattern):
+            # GH#61952
+            pat = pat.pattern
+        if isinstance(pat, str) and not pat.startswith("^"):
             pat = f"^{pat}"
         return self._str_contains(pat, case, flags, na, regex=True)
 
     def _str_fullmatch(
         self,
-        pat,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
-        if not pat.endswith("$") or pat.endswith("\\$"):
+        if isinstance(pat, re.Pattern):
+            # GH#61952
+            pat = pat.pattern
+        if isinstance(pat, str) and (not pat.endswith("$") or pat.endswith("\\$")):
             pat = f"{pat}$"
         return self._str_match(pat, case, flags, na)
 
diff --git a/pandas/core/arrays/boolean.py b/pandas/core/arrays/boolean.py
@@ -378,7 +378,7 @@ def _logical_method(self, other, op):  # type: ignore[override]
         elif is_list_like(other):
             other = np.asarray(other, dtype="bool")
             if other.ndim > 1:
-                raise NotImplementedError("can only perform ops with 1-d structures")
+                return NotImplemented
             other, mask = coerce_to_array(other, copy=False)
         elif isinstance(other, np.bool_):
             other = other.item()
diff --git a/pandas/core/arrays/string_arrow.py b/pandas/core/arrays/string_arrow.py
@@ -346,6 +346,8 @@ def _str_contains(
     ):
         if flags:
             return super()._str_contains(pat, case, flags, na, regex)
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
 
         return ArrowStringArrayMixin._str_contains(self, pat, case, flags, na, regex)
 
diff --git a/pandas/core/strings/accessor.py b/pandas/core/strings/accessor.py
@@ -1361,8 +1361,8 @@ def match(self, pat: str, case: bool = True, flags: int = 0, na=lib.no_default):
 
         Parameters
         ----------
-        pat : str
-            Character sequence.
+        pat : str or compiled regex
+            Character sequence or regular expression.
         case : bool, default True
             If True, case sensitive.
         flags : int, default 0 (no flags)
diff --git a/pandas/core/strings/object_array.py b/pandas/core/strings/object_array.py
@@ -248,14 +248,15 @@ def rep(x, r):
 
     def _str_match(
         self,
-        pat: str,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
         if not case:
             flags |= re.IGNORECASE
-
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
         regex = re.compile(pat, flags=flags)
 
         f = lambda x: regex.match(x) is not None
@@ -270,7 +271,8 @@ def _str_fullmatch(
     ):
         if not case:
             flags |= re.IGNORECASE
-
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
         regex = re.compile(pat, flags=flags)
 
         f = lambda x: regex.fullmatch(x) is not None
diff --git a/pandas/tests/arithmetic/test_numeric.py b/pandas/tests/arithmetic/test_numeric.py
@@ -862,6 +862,19 @@ def test_modulo_zero_int(self):
             expected = Series([np.nan, 0.0])
             tm.assert_series_equal(result, expected)
 
+    def test_non_1d_ea_raises_notimplementederror(self):
+        # GH#61866
+        ea_array = array([1, 2, 3, 4, 5], dtype="Int64").reshape(5, 1)
+        np_array = np.array([1, 2, 3, 4, 5], dtype=np.int64).reshape(5, 1)
+
+        msg = "can only perform ops with 1-d structures"
+
+        with pytest.raises(NotImplementedError, match=msg):
+            ea_array * np_array
+
+        with pytest.raises(NotImplementedError, match=msg):
+            np_array * ea_array
+
 
 class TestAdditionSubtraction:
     # __add__, __sub__, __radd__, __rsub__, __iadd__, __isub__
diff --git a/pandas/tests/strings/test_find_replace.py b/pandas/tests/strings/test_find_replace.py
@@ -281,6 +281,19 @@ def test_contains_nan(any_string_dtype):
     tm.assert_series_equal(result, expected)
 
 
+def test_contains_compiled_regex(any_string_dtype):
+    # GH#61942
+    ser = Series(["foo", "bar", "baz"], dtype=any_string_dtype)
+    pat = re.compile("ba.")
+    result = ser.str.contains(pat)
+
+    expected_dtype = (
+        np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
+    )
+    expected = Series([False, True, True], dtype=expected_dtype)
+    tm.assert_series_equal(result, expected)
+
+
 # --------------------------------------------------------------------------------------
 # str.startswith
 # --------------------------------------------------------------------------------------
@@ -818,6 +831,17 @@ def test_match_case_kwarg(any_string_dtype):
     tm.assert_series_equal(result, expected)
 
 
+def test_match_compiled_regex(any_string_dtype):
+    # GH#61952
+    values = Series(["ab", "AB", "abc", "ABC"], dtype=any_string_dtype)
+    result = values.str.match(re.compile(r"ab"), case=False)
+    expected_dtype = (
+        np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
+    )
+    expected = Series([True, True, True, True], dtype=expected_dtype)
+    tm.assert_series_equal(result, expected)
+
+
 # --------------------------------------------------------------------------------------
 # str.fullmatch
 # --------------------------------------------------------------------------------------
@@ -887,6 +911,17 @@ def test_fullmatch_case_kwarg(any_string_dtype):
     tm.assert_series_equal(result, expected)
 
 
+def test_fullmatch_compiled_regex(any_string_dtype):
+    # GH#61952
+    values = Series(["ab", "AB", "abc", "ABC"], dtype=any_string_dtype)
+    result = values.str.fullmatch(re.compile(r"ab"), case=False)
+    expected_dtype = (
+        np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
+    )
+    expected = Series([True, True, False, False], dtype=expected_dtype)
+    tm.assert_series_equal(result, expected)
+
+
 # --------------------------------------------------------------------------------------
 # str.findall
 # --------------------------------------------------------------------------------------