Merge remote-tracking branch 'upstream/main' into bug_boolean_series_with_logical_indexer

yuanx749 · yuanx749 · commit 7dc3ac1e279a · 2025-06-30T21:07:42.000+03:00
diff --git a/AUTHORS.md b/AUTHORS.md
@@ -7,12 +7,12 @@ About the Copyright Holders
     led by Wes McKinney. AQR released the source under this license in 2009.
 *   Copyright (c) 2011-2012, Lambda Foundry, Inc.
 
-    Wes is now an employee of Lambda Foundry, and remains the pandas project
+    Wes became an employee of Lambda Foundry, and remained the pandas project
     lead.
 *   Copyright (c) 2011-2012, PyData Development Team
 
     The PyData Development Team is the collection of developers of the PyData
-    project. This includes all of the PyData sub-projects, including pandas. The
+    project. This includes all of the PyData sub-projects, such as pandas. The
     core team that coordinates development on GitHub can be found here:
     https://github.com/pydata.
 
@@ -23,11 +23,11 @@ Our Copyright Policy
 
 PyData uses a shared copyright model. Each contributor maintains copyright
 over their contributions to PyData. However, it is important to note that
-these contributions are typically only changes to the repositories. Thus,
+these contributions are typically limited to changes to the repositories. Thus,
 the PyData source code, in its entirety, is not the copyright of any single
 person or institution. Instead, it is the collective copyright of the
 entire PyData Development Team. If individual contributors want to maintain
-a record of what changes/contributions they have specific copyright on,
+a record of the specific changes or contributions they hold copyright to,
 they should indicate their copyright in the commit message of the change
 when they commit the change to one of the PyData repositories.
 
@@ -50,7 +50,7 @@ Other licenses can be found in the LICENSES directory.
 License
 =======
 
-pandas is distributed under a 3-clause ("Simplified" or "New") BSD
+pandas is distributed under the 3-clause ("Simplified" or "New") BSD
 license. Parts of NumPy, SciPy, numpydoc, bottleneck, which all have
-BSD-compatible licenses, are included. Their licenses follow the pandas
+BSD-compatible licenses, are included. Their licenses are compatible with the pandas
 license.
diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst
@@ -98,6 +98,7 @@ Conversion
    :toctree: api/
 
    Index.astype
+   Index.infer_objects
    Index.item
    Index.map
    Index.ravel
diff --git a/doc/source/whatsnew/v2.3.0.rst b/doc/source/whatsnew/v2.3.0.rst
@@ -31,39 +31,6 @@ Other enhancements
 - The :meth:`~Series.cumsum`, :meth:`~Series.cummin`, and :meth:`~Series.cummax` reductions are now implemented for :class:`StringDtype` columns (:issue:`60633`)
 - The :meth:`~Series.sum` reduction is now implemented for :class:`StringDtype` columns (:issue:`59853`)
 
-.. ---------------------------------------------------------------------------
-.. _whatsnew_230.notable_bug_fixes:
-
-Notable bug fixes
-~~~~~~~~~~~~~~~~~
-
-These are bug fixes that might have notable behavior changes.
-
-.. _whatsnew_230.notable_bug_fixes.string_comparisons:
-
-Comparisons between different string dtypes
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-In previous versions, comparing :class:`Series` of different string dtypes (e.g. ``pd.StringDtype("pyarrow", na_value=pd.NA)`` against ``pd.StringDtype("python", na_value=np.nan)``) would result in inconsistent resulting dtype or incorrectly raise. pandas will now use the hierarchy
-
-    object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
-
-in determining the result dtype when there are different string dtypes compared. Some examples:
-
-- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.
-- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("pyarrow", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
-- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("python", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
-
-.. _whatsnew_230.api_changes:
-
-API changes
-~~~~~~~~~~~
-
-- When enabling the ``future.infer_string`` option, :class:`Index` set operations (like
-  union or intersection) will now ignore the dtype of an empty :class:`RangeIndex` or
-  empty :class:`Index` with ``object`` dtype when determining the dtype of the resulting
-  Index (:issue:`60797`)
-
 .. ---------------------------------------------------------------------------
 .. _whatsnew_230.deprecations:
 
@@ -85,8 +52,6 @@ Numeric
 
 Strings
 ^^^^^^^
-- Bug in :meth:`.DataFrameGroupBy.min`, :meth:`.DataFrameGroupBy.max`, :meth:`.Resampler.min`, :meth:`.Resampler.max` where all NA values of string dtype would return float instead of string dtype (:issue:`60810`)
-- Bug in :meth:`DataFrame.sum` with ``axis=1``, :meth:`.DataFrameGroupBy.sum` or :meth:`.SeriesGroupBy.sum` with ``skipna=True``, and :meth:`.Resampler.sum` with all NA values of :class:`StringDtype` resulted in ``0`` instead of the empty string ``""`` (:issue:`60229`)
 - Bug in :meth:`Series.__pos__` and :meth:`DataFrame.__pos__` where an ``Exception`` was not raised for :class:`StringDtype` with ``storage="pyarrow"`` (:issue:`60710`)
 - Bug in :meth:`Series.rank` for :class:`StringDtype` with ``storage="pyarrow"`` that incorrectly returned integer results with ``method="average"`` and raised an error if it would truncate results (:issue:`59768`)
 - Bug in :meth:`Series.replace` with :class:`StringDtype` when replacing with a non-string value was not upcasting to ``object`` dtype (:issue:`60282`)
diff --git a/doc/source/whatsnew/v2.3.1.rst b/doc/source/whatsnew/v2.3.1.rst
@@ -9,11 +9,57 @@ including other versions of pandas.
 {{ header }}
 
 .. ---------------------------------------------------------------------------
-.. _whatsnew_231.enhancements:
+.. _whatsnew_231.string_fixes:
+
+Improvements and fixes for the StringDtype
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _whatsnew_231.string_fixes.string_comparisons:
+
+Comparisons between different string dtypes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In previous versions, comparing :class:`Series` of different string dtypes (e.g. ``pd.StringDtype("pyarrow", na_value=pd.NA)`` against ``pd.StringDtype("python", na_value=np.nan)``) would result in inconsistent resulting dtype or incorrectly raise. pandas will now use the hierarchy
+
+    object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
+
+in determining the result dtype when there are different string dtypes compared. Some examples:
+
+- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.
+- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("pyarrow", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
+- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("python", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
+
+.. _whatsnew_231.string_fixes.ignore_empty:
+
+Index set operations ignore empty RangeIndex and object dtype Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When enabling the ``future.infer_string`` option, :class:`Index` set operations (like
+union or intersection) will now ignore the dtype of an empty :class:`RangeIndex` or
+empty :class:`Index` with ``object`` dtype when determining the dtype of the resulting
+Index (:issue:`60797`).
+
+This ensures that combining such empty Index with strings will infer the string dtype
+correctly, rather than defaulting to ``object`` dtype. For example:
+
+.. code-block:: python
+
+    >>> pd.options.mode.infer_string = True
+    >>> df = pd.DataFrame()
+    >>> df.columns.dtype
+    dtype('int64')               # default RangeIndex for empty columns
+    >>> df["a"] = [1, 2, 3]
+    >>> df.columns.dtype
+    <StringDtype(na_value=nan)>  # new columns use string dtype instead of object dtype
+
+.. _whatsnew_231.string_fixes.bugs:
+
+Bug fixes
+^^^^^^^^^
+- Bug in :meth:`.DataFrameGroupBy.min`, :meth:`.DataFrameGroupBy.max`, :meth:`.Resampler.min`, :meth:`.Resampler.max` where all NA values of string dtype would return float instead of string dtype (:issue:`60810`)
+- Bug in :meth:`DataFrame.sum` with ``axis=1``, :meth:`.DataFrameGroupBy.sum` or :meth:`.SeriesGroupBy.sum` with ``skipna=True``, and :meth:`.Resampler.sum` with all NA values of :class:`StringDtype` resulted in ``0`` instead of the empty string ``""`` (:issue:`60229`)
+- Fixed bug in :meth:`DataFrame.explode` and :meth:`Series.explode` where methods would fail with ``dtype="str"`` (:issue:`61623`)
 
-Enhancements
-~~~~~~~~~~~~
--
 
 .. _whatsnew_231.regressions:
 
@@ -26,7 +72,7 @@ Fixed regressions
 
 Bug fixes
 ~~~~~~~~~
-- Fixed bug in :meth:`DataFrame.explode` and :meth:`Series.explode` where methods would fail with ``dtype="str"`` (:issue:`61623`)
+-
 
 .. ---------------------------------------------------------------------------
 .. _whatsnew_231.other:
diff --git a/scripts/validate_docstrings.py b/scripts/validate_docstrings.py
@@ -69,8 +69,10 @@
 }
 ALL_ERRORS = set(NUMPYDOC_ERROR_MSGS).union(set(ERROR_MSGS))
 duplicated_errors = set(NUMPYDOC_ERROR_MSGS).intersection(set(ERROR_MSGS))
-assert not duplicated_errors, (f"Errors {duplicated_errors} exist in both pandas "
-                               "and numpydoc, should they be removed from pandas?")
+assert not duplicated_errors, (
+    f"Errors {duplicated_errors} exist in both pandas "
+    "and numpydoc, should they be removed from pandas?"
+)
 
 
 def pandas_error(code, **kwargs):
@@ -245,7 +247,15 @@ def pandas_validate(func_name: str):
     # Some objects are instances, e.g. IndexSlice, which numpydoc can't validate
     doc_obj = get_doc_object(func_obj, doc=func_obj.__doc__)
     doc = PandasDocstring(func_name, doc_obj)
-    result = validate(doc_obj)
+    if func_obj.__doc__ is not None:
+        result = validate(doc_obj)
+    else:
+        result = {
+            "docstring": "",
+            "file": None,
+            "file_line": None,
+            "errors": [("GL08", "The object does not have a docstring")],
+        }
     mentioned_errs = doc.mentioned_private_classes
     if mentioned_errs:
         result["errors"].append(
@@ -257,7 +267,7 @@ def pandas_validate(func_name: str):
             pandas_error(
                 "SA05",
                 reference_name=rel_name,
-                right_reference=rel_name[len("pandas."):],
+                right_reference=rel_name[len("pandas.") :],
             )
             for rel_name in doc.see_also
             if rel_name.startswith("pandas.")
@@ -365,12 +375,13 @@ def print_validate_all_results(
     for func_name, res in result.items():
         error_messages = dict(res["errors"])
         actual_failures = set(error_messages)
-        expected_failures = (ignore_errors.get(func_name, set())
-                             | ignore_errors.get(None, set()))
+        expected_failures = ignore_errors.get(func_name, set()) | ignore_errors.get(
+            None, set()
+        )
         for err_code in actual_failures - expected_failures:
             sys.stdout.write(
                 f'{prefix}{res["file"]}:{res["file_line"]}:'
-                f'{err_code}:{func_name}:{error_messages[err_code]}\n'
+                f"{err_code}:{func_name}:{error_messages[err_code]}\n"
             )
             exit_status += 1
         for err_code in ignore_errors.get(func_name, set()) - actual_failures:
@@ -384,8 +395,9 @@ def print_validate_all_results(
     return exit_status
 
 
-def print_validate_one_results(func_name: str,
-                               ignore_errors: dict[str, set[str]]) -> int:
+def print_validate_one_results(
+    func_name: str, ignore_errors: dict[str, set[str]]
+) -> int:
     def header(title, width=80, char="#") -> str:
         full_line = char * width
         side_len = (width - len(title) - 2) // 2
@@ -396,8 +408,11 @@ def header(title, width=80, char="#") -> str:
 
     result = pandas_validate(func_name)
 
-    result["errors"] = [(code, message) for code, message in result["errors"]
-                        if code not in ignore_errors.get(None, set())]
+    result["errors"] = [
+        (code, message)
+        for code, message in result["errors"]
+        if code not in ignore_errors.get(None, set())
+    ]
 
     sys.stderr.write(header(f"Docstring ({func_name})"))
     sys.stderr.write(f"{result['docstring']}\n")
@@ -431,14 +446,16 @@ def _format_ignore_errors(raw_ignore_errors):
                     raise ValueError(
                         f"Object `{obj_name}` is present in more than one "
                         "--ignore_errors argument. Please use it once and specify "
-                        "the errors separated by commas.")
+                        "the errors separated by commas."
+                    )
                 ignore_errors[obj_name] = set(error_codes.split(","))
 
                 unknown_errors = ignore_errors[obj_name] - ALL_ERRORS
                 if unknown_errors:
                     raise ValueError(
                         f"Object `{obj_name}` is ignoring errors {unknown_errors} "
-                        f"which are not known. Known errors are: {ALL_ERRORS}")
+                        f"which are not known. Known errors are: {ALL_ERRORS}"
+                    )
 
             # global errors "PR02,ES01"
             else:
@@ -448,27 +465,19 @@ def _format_ignore_errors(raw_ignore_errors):
         if unknown_errors:
             raise ValueError(
                 f"Unknown errors {unknown_errors} specified using --ignore_errors "
-                "Known errors are: {ALL_ERRORS}")
+                "Known errors are: {ALL_ERRORS}"
+            )
 
     return ignore_errors
 
 
-def main(
-    func_name,
-    output_format,
-    prefix,
-    ignore_deprecated,
-    ignore_errors
-):
+def main(func_name, output_format, prefix, ignore_deprecated, ignore_errors):
     """
     Main entry point. Call the validation for one or for all docstrings.
     """
     if func_name is None:
         return print_validate_all_results(
-            output_format,
-            prefix,
-            ignore_deprecated,
-            ignore_errors
+            output_format, prefix, ignore_deprecated, ignore_errors
         )
     else:
         return print_validate_one_results(func_name, ignore_errors)
@@ -524,10 +533,11 @@ def main(
     args = argparser.parse_args(sys.argv[1:])
 
     sys.exit(
-        main(args.function,
-             args.format,
-             args.prefix,
-             args.ignore_deprecated,
-             _format_ignore_errors(args.ignore_errors),
-             )
+        main(
+            args.function,
+            args.format,
+            args.prefix,
+            args.ignore_deprecated,
+            _format_ignore_errors(args.ignore_errors),
+        )
     )