Merge remote-tracking branch 'upstream/main' into string-dtype-isdigit

jorisvandenbossche · jorisvandenbossche · commit 2a480684f700 · 2025-08-19T16:53:34.000+02:00
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -189,7 +189,7 @@ jobs:
         # installing wheel here because micromamba step was skipped
         if: matrix.buildplat[1] == 'win_arm64'
         shell: bash -el {0}
-        run: python -m pip install wheel
+        run: python -m pip install wheel anaconda-client
 
       - name: Validate wheel RECORD
         shell: bash -el {0}
diff --git a/README.md b/README.md
@@ -19,9 +19,9 @@
 **pandas** is a Python package that provides fast, flexible, and expressive data
 structures designed to make working with "relational" or "labeled" data both
 easy and intuitive. It aims to be the fundamental high-level building block for
-doing practical, **real world** data analysis in Python. Additionally, it has
-the broader goal of becoming **the most powerful and flexible open source data
-analysis / manipulation tool available in any language**. It is already well on
+doing practical, **real-world** data analysis in Python. Additionally, it has
+the broader goal of becoming **the most powerful and flexible open-source data
+analysis/manipulation tool available in any language**. It is already well on
 its way towards this goal.
 
 ## Table of Contents
@@ -64,7 +64,7 @@ Here are just a few of the things that pandas does well:
     data sets
   - [**Hierarchical**][mi] labeling of axes (possible to have multiple
     labels per tick)
-  - Robust IO tools for loading data from [**flat files**][flat-files]
+  - Robust I/O tools for loading data from [**flat files**][flat-files]
     (CSV and delimited), [**Excel files**][excel], [**databases**][db],
     and saving/loading data from the ultrafast [**HDF5 format**][hdfstore]
   - [**Time series**][timeseries]-specific functionality: date range
@@ -138,7 +138,7 @@ or for installing in [development mode](https://pip.pypa.io/en/latest/cli/pip_in
 
 
 ```sh
-python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true
+python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true
 ```
 
 See the full instructions for [installing from source](https://pandas.pydata.org/docs/dev/development/contributing_environment.html).
@@ -155,7 +155,7 @@ has been under active development since then.
 
 ## Getting Help
 
-For usage questions, the best place to go to is [StackOverflow](https://stackoverflow.com/questions/tagged/pandas).
+For usage questions, the best place to go to is [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas).
 Further, general questions and discussions can also take place on the [pydata mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata).
 
 ## Discussion and Development
diff --git a/doc/source/whatsnew/v2.3.2.rst b/doc/source/whatsnew/v2.3.2.rst
@@ -28,6 +28,8 @@ Bug fixes
   "string" type in the JSON Table Schema for :class:`StringDtype` columns
   (:issue:`61889`)
 - Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
+- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
+  with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
 
 .. ---------------------------------------------------------------------------
 .. _whatsnew_232.contributors:
diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst
@@ -14,10 +14,108 @@ including other versions of pandas.
 Enhancements
 ~~~~~~~~~~~~
 
-.. _whatsnew_300.enhancements.enhancement1:
+.. _whatsnew_300.enhancements.string_dtype:
 
-Enhancement1
-^^^^^^^^^^^^
+Dedicated string data type by default
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Historically, pandas represented string columns with NumPy ``object`` data type.
+This representation has numerous problems: it is not specific to strings (any
+Python object can be stored in an ``object``-dtype array, not just strings) and
+it is often not very efficient (both performance wise and for memory usage).
+
+Starting with pandas 3.0, a dedicated string data type is enabled by default
+(backed by PyArrow under the hood, if installed, otherwise falling back to being
+backed by NumPy ``object``-dtype). This means that pandas will start inferring
+columns containing string data as the new ``str`` data type when creating pandas
+objects, such as in constructors or IO functions.
+
+Old behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: object
+
+New behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: str
+
+The string data type that is used in these scenarios will mostly behave as NumPy
+object would, including missing value semantics and general operations on these
+columns.
+
+The main characteristic of the new string data type:
+
+- Inferred by default for string data (instead of object dtype)
+- The ``str`` dtype can only hold strings (or missing values), in contrast to
+  ``object`` dtype. (setitem with non string fails)
+- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
+  missing value semantics as the other default dtypes.
+
+Those intentional changes can have breaking consequences, for example when checking
+for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
+See the :ref:`string_migration_guide` for more details on the behaviour changes
+and how to adapt your code to the new default.
+
+.. seealso::
+
+    `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
+
+
+.. _whatsnew_300.enhancements.copy_on_write:
+
+Copy-on-Write
+^^^^^^^^^^^^^
+
+The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
+how pandas operates with respect to copies and views. A summary of the changes:
+
+1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
+   i.e. including accessing a DataFrame column as a Series) or any method returning a
+   new DataFrame or Series, always *behaves as if* it were a copy in terms of user
+   API.
+2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
+   to do this is to directly modify that object itself.
+
+The main goal of this change is to make the user API more consistent and
+predictable. There is now a clear rule: *any* subset or returned
+series/dataframe **always** behaves as a copy of the original, and thus never
+modifies the original (before pandas 3.0, whether a derived object would be a
+copy or a view depended on the exact operation performed, which was often
+confusing).
+
+Because every single indexing step now behaves as a copy, this also means that
+"chained assignment" (updating a DataFrame with multiple setitem steps) will
+stop working. Because this now consistently never works, the
+``SettingWithCopyWarning`` is removed.
+
+The new behavioral semantics are explained in more detail in the
+:ref:`user guide about Copy-on-Write <copy_on_write>`.
+
+A secondary goal is to improve performance by avoiding unnecessary copies. As
+mentioned above, every new DataFrame or Series returned from an indexing
+operation or method *behaves* as a copy, but under the hood pandas will use
+views as much as possible, and only copy when needed to guarantee the "behaves
+as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
+implementation detail).
+
+Some of the behaviour changes described above are breaking changes in pandas
+3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
+2.3 to get deprecation warnings for a subset of those changes. The
+:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
+process in more detail.
+
+.. seealso::
+
+    `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
 
 .. _whatsnew_300.enhancements.enhancement2:
 
diff --git a/pandas/_config/config.py b/pandas/_config/config.py
@@ -693,8 +693,8 @@ def _get_registered_option(key: str):
 
 def _translate_key(key: str) -> str:
     """
-    if key id deprecated and a replacement key defined, will return the
-    replacement key, otherwise returns `key` as - is
+    if `key` is deprecated and a replacement key defined, will return the
+    replacement key, otherwise returns `key` as-is
     """
     d = _get_deprecated_option(key)
     if d:
diff --git a/pandas/_version.py b/pandas/_version.py
@@ -581,7 +581,7 @@ def render_git_describe(pieces):
 def render_git_describe_long(pieces):
     """TAG-DISTANCE-gHEX[-dirty].
 
-    Like 'git describe --tags --dirty --always -long'.
+    Like 'git describe --tags --dirty --always --long'.
     The distance/hash is unconditional.
 
     Exceptions:
diff --git a/pandas/core/accessor.py b/pandas/core/accessor.py
@@ -88,7 +88,7 @@ def _add_delegate_accessors(
         cls
             Class to add the methods/properties to.
         delegate
-            Class to get methods/properties and doc-strings.
+            Class to get methods/properties and docstrings.
         accessors : list of str
             List of accessors to add.
         typ : {'property', 'method'}
@@ -159,7 +159,7 @@ def delegate_names(
     Parameters
     ----------
     delegate : object
-        The class to get methods/properties & doc-strings.
+        The class to get methods/properties & docstrings.
     accessors : Sequence[str]
         List of accessor to add.
     typ : {'property', 'method'}
diff --git a/pandas/core/arrays/_arrow_string_mixins.py b/pandas/core/arrays/_arrow_string_mixins.py
@@ -309,23 +309,29 @@ def _str_contains(
 
     def _str_match(
         self,
-        pat: str,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
-        if not pat.startswith("^"):
+        if isinstance(pat, re.Pattern):
+            # GH#61952
+            pat = pat.pattern
+        if isinstance(pat, str) and not pat.startswith("^"):
             pat = f"^{pat}"
         return self._str_contains(pat, case, flags, na, regex=True)
 
     def _str_fullmatch(
         self,
-        pat,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
-        if not pat.endswith("$") or pat.endswith("\\$"):
+        if isinstance(pat, re.Pattern):
+            # GH#61952
+            pat = pat.pattern
+        if isinstance(pat, str) and (not pat.endswith("$") or pat.endswith("\\$")):
             pat = f"{pat}$"
         return self._str_match(pat, case, flags, na)
 
diff --git a/pandas/core/arrays/boolean.py b/pandas/core/arrays/boolean.py
@@ -378,7 +378,7 @@ def _logical_method(self, other, op):  # type: ignore[override]
         elif is_list_like(other):
             other = np.asarray(other, dtype="bool")
             if other.ndim > 1:
-                raise NotImplementedError("can only perform ops with 1-d structures")
+                return NotImplemented
             other, mask = coerce_to_array(other, copy=False)
         elif isinstance(other, np.bool_):
             other = other.item()
diff --git a/pandas/core/arrays/string_arrow.py b/pandas/core/arrays/string_arrow.py
@@ -346,6 +346,8 @@ def _str_contains(
     ):
         if flags:
             return super()._str_contains(pat, case, flags, na, regex)
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
 
         return ArrowStringArrayMixin._str_contains(self, pat, case, flags, na, regex)
 
diff --git a/pandas/core/base.py b/pandas/core/base.py
@@ -90,7 +90,7 @@
 
 class PandasObject(DirNamesMixin):
     """
-    Baseclass for various pandas objects.
+    Base class for various pandas objects.
     """
 
     # results from calls to methods decorated with cache_readonly get added to _cache
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
@@ -10216,6 +10216,7 @@ def shift(
         suffix : str, optional
             If str and periods is an iterable, this is added after the column
             name and before the shift value for each shifted column name.
+            For `Series` this parameter is unused and defaults to `None`.
 
         Returns
         -------
diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
@@ -1926,7 +1926,7 @@ def _setitem_with_indexer(self, indexer, value, name: str = "iloc") -> None:
                     labels = index.insert(len(index), key)
 
                     # We are expanding the Series/DataFrame values to match
-                    #  the length of thenew index `labels`.  GH#40096 ensure
+                    #  the length of the new index `labels`.  GH#40096 ensure
                     #  this is valid even if the index has duplicates.
                     taker = np.arange(len(index) + 1, dtype=np.intp)
                     taker[-1] = -1
diff --git a/pandas/core/strings/accessor.py b/pandas/core/strings/accessor.py
@@ -1361,8 +1361,8 @@ def match(self, pat: str, case: bool = True, flags: int = 0, na=lib.no_default):
 
         Parameters
         ----------
-        pat : str
-            Character sequence.
+        pat : str or compiled regex
+            Character sequence or regular expression.
         case : bool, default True
             If True, case sensitive.
         flags : int, default 0 (no flags)
diff --git a/pandas/core/strings/object_array.py b/pandas/core/strings/object_array.py
@@ -248,14 +248,15 @@ def rep(x, r):
 
     def _str_match(
         self,
-        pat: str,
+        pat: str | re.Pattern,
         case: bool = True,
         flags: int = 0,
         na: Scalar | lib.NoDefault = lib.no_default,
     ):
         if not case:
             flags |= re.IGNORECASE
-
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
         regex = re.compile(pat, flags=flags)
 
         f = lambda x: regex.match(x) is not None
@@ -270,7 +271,8 @@ def _str_fullmatch(
     ):
         if not case:
             flags |= re.IGNORECASE
-
+        if isinstance(pat, re.Pattern):
+            pat = pat.pattern
         regex = re.compile(pat, flags=flags)
 
         f = lambda x: regex.fullmatch(x) is not None
diff --git a/pandas/io/api.py b/pandas/io/api.py
@@ -1,5 +1,5 @@
 """
-Data IO api
+Data I/O API
 """
 
 from pandas.io.clipboards import read_clipboard
diff --git a/pandas/io/common.py b/pandas/io/common.py
@@ -1,4 +1,4 @@
-"""Common IO api utilities"""
+"""Common I/O API utilities"""
 
 from __future__ import annotations
 
diff --git a/pandas/io/formats/style_render.py b/pandas/io/formats/style_render.py
@@ -6,6 +6,7 @@
     Sequence,
 )
 from functools import partial
+import pathlib
 import re
 from typing import (
     TYPE_CHECKING,
@@ -70,7 +71,9 @@ class StylerRenderer:
     Base class to process rendering a Styler with a specified jinja2 template.
     """
 
-    loader = jinja2.PackageLoader("pandas", "io/formats/templates")
+    this_dir = pathlib.Path(__file__).parent.resolve()
+    template_dir = this_dir / "templates"
+    loader = jinja2.FileSystemLoader(template_dir)
     env = jinja2.Environment(loader=loader, trim_blocks=True)
     template_html = env.get_template("html.tpl")
     template_html_table = env.get_template("html_table.tpl")
diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
@@ -464,8 +464,12 @@ def to_parquet(
 
         .. versionadded:: 2.1.0
 
-    kwargs
-        Additional keyword arguments passed to the engine.
+    **kwargs
+        Additional keyword arguments passed to the engine:
+
+        * For ``engine="pyarrow"``: passed to :func:`pyarrow.parquet.write_table`
+          or :func:`pyarrow.parquet.write_to_dataset` (when using partition_cols)
+        * For ``engine="fastparquet"``: passed to :func:`fastparquet.write`
 
     Returns
     -------
@@ -585,7 +589,11 @@ def read_parquet(
         .. versionadded:: 3.0.0
 
     **kwargs
-        Any additional kwargs are passed to the engine.
+        Additional keyword arguments passed to the engine:
+
+        * For ``engine="pyarrow"``: passed to :func:`pyarrow.parquet.read_table`
+        * For ``engine="fastparquet"``: passed to
+          :meth:`fastparquet.ParquetFile.to_pandas`
 
     Returns
     -------
diff --git a/pandas/tests/arithmetic/test_numeric.py b/pandas/tests/arithmetic/test_numeric.py
diff --git a/pandas/tests/io/formats/style/test_html.py b/pandas/tests/io/formats/style/test_html.py
diff --git a/pandas/tests/strings/test_find_replace.py b/pandas/tests/strings/test_find_replace.py
diff --git a/pandas/tests/tseries/holiday/test_holiday.py b/pandas/tests/tseries/holiday/test_holiday.py
diff --git a/pandas/tests/tslibs/test_parsing.py b/pandas/tests/tslibs/test_parsing.py
diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md