Merge remote-tracking branch 'upstream/main' into issue_61917

Khemkaran · Khemkaran · commit 843539cf6d95 · 2025-07-24T21:53:10.000+05:30
diff --git a/ci/deps/actions-311-downstream_compat.yaml b/ci/deps/actions-311-downstream_compat.yaml
@@ -50,8 +50,7 @@ dependencies:
   - pytz>=2023.4
   - pyxlsb>=1.0.10
   - s3fs>=2023.12.2
-  # TEMP upper pin for scipy (https://github.com/statsmodels/statsmodels/issues/9584)
-  - scipy>=1.12.0,<1.16
+  - scipy>=1.12.0
   - sqlalchemy>=2.0.0
   - tabulate>=0.9.0
   - xarray>=2024.1.1
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
@@ -751,7 +751,7 @@ def factorize(
     array([0, 0, 1])
     >>> uniques
     ['a', 'c']
-    Categories (3, object): ['a', 'b', 'c']
+    Categories (3, str): [a, b, c]
 
     Notice that ``'b'`` is in ``uniques.categories``, despite not being
     present in ``cat.values``.
@@ -764,7 +764,7 @@ def factorize(
     >>> codes
     array([0, 0, 1])
     >>> uniques
-    Index(['a', 'c'], dtype='object')
+    Index(['a', 'c'], dtype='str')
 
     If NaN is in the values, and we want to include NaN in the uniques of the
     values, it can be achieved by setting ``use_na_sentinel=False``.
diff --git a/pandas/core/base.py b/pandas/core/base.py
@@ -323,12 +323,12 @@ def transpose(self, *args, **kwargs) -> Self:
         0     Ant
         1    Bear
         2     Cow
-        dtype: object
+        dtype: str
         >>> s.T
         0     Ant
         1    Bear
         2     Cow
-        dtype: object
+        dtype: str
 
         For Index:
 
@@ -383,7 +383,7 @@ def ndim(self) -> int:
         0     Ant
         1    Bear
         2     Cow
-        dtype: object
+        dtype: str
         >>> s.ndim
         1
 
@@ -452,9 +452,9 @@ def nbytes(self) -> int:
         0     Ant
         1    Bear
         2     Cow
-        dtype: object
+        dtype: str
         >>> s.nbytes
-        24
+        34
 
         For Index:
 
@@ -487,7 +487,7 @@ def size(self) -> int:
         0     Ant
         1    Bear
         2     Cow
-        dtype: object
+        dtype: str
         >>> s.size
         3
 
@@ -567,7 +567,7 @@ def array(self) -> ExtensionArray:
         >>> ser = pd.Series(pd.Categorical(["a", "b", "a"]))
         >>> ser.array
         ['a', 'b', 'a']
-        Categories (2, object): ['a', 'b']
+        Categories (2, str): [a, b]
         """
         raise AbstractMethodError(self)
 
@@ -1076,15 +1076,15 @@ def value_counts(
 
         >>> df.dtypes
         a    category
-        b      object
+        b      str
         c    category
         d    category
         dtype: object
 
         >>> df.dtypes.value_counts()
         category    2
         category    1
-        object      1
+        str         1
         Name: count, dtype: int64
         """
         return algorithms.value_counts_internal(
@@ -1386,7 +1386,7 @@ def factorize(
         ... )
         >>> ser
         ['apple', 'bread', 'bread', 'cheese', 'milk']
-        Categories (4, object): ['apple' < 'bread' < 'cheese' < 'milk']
+        Categories (4, str): [apple < bread < cheese < milk]
 
         >>> ser.searchsorted('bread')
         np.int64(1)
diff --git a/web/pandas/about/roadmap.md b/web/pandas/about/roadmap.md
@@ -34,143 +34,3 @@ For more information about PDEPs, and how to submit one, please refer to
 </ul>
 
 {% endfor %}
-
-## Roadmap points pending a PDEP
-
-<div class="alert alert-warning" role="alert">
-  pandas is in the process of moving roadmap points to PDEPs (implemented in
-  August 2022). During the transition, some roadmap points will exist as PDEPs,
-  while others will exist as sections below.
-</div>
-
-### Extensibility
-
-Pandas `extending.extension-types` allow
-for extending NumPy types with custom data types and array storage.
-Pandas uses extension types internally, and provides an interface for
-3rd-party libraries to define their own custom data types.
-
-Many parts of pandas still unintentionally convert data to a NumPy
-array. These problems are especially pronounced for nested data.
-
-We'd like to improve the handling of extension arrays throughout the
-library, making their behavior more consistent with the handling of
-NumPy arrays. We'll do this by cleaning up pandas' internals and
-adding new methods to the extension array interface.
-
-### Apache Arrow interoperability
-
-[Apache Arrow](https://arrow.apache.org) is a cross-language development
-platform for in-memory data. The Arrow logical types are closely aligned
-with typical pandas use cases.
-
-We'd like to provide better-integrated support for Arrow memory and
-data types within pandas. This will let us take advantage of its I/O
-capabilities and provide for better interoperability with other
-languages and libraries using Arrow.
-
-### Decoupling of indexing and internals
-
-The code for getting and setting values in pandas' data structures
-needs refactoring. In particular, we must clearly separate code that
-converts keys (e.g., the argument to `DataFrame.loc`) to positions from
-code that uses these positions to get or set values. This is related to
-the proposed BlockManager rewrite. Currently, the BlockManager sometimes
-uses label-based, rather than position-based, indexing. We propose that
-it should only work with positional indexing, and the translation of
-keys to positions should be entirely done at a higher level.
-
-Indexing is a complicated API with many subtleties. This refactor will require care
-and attention. The following principles should inspire refactoring of indexing code and
-should result on cleaner, simpler, and more performant code.
-
-1. Label indexing must never involve looking in an axis twice for the same label(s).
-This implies that any validation step must either:
-
-  * limit validation to general features (e.g. dtype/structure of the key/index), or
-  * reuse the result for the actual indexing.
-
-2. Indexers must never rely on an explicit call to other indexers.
-For instance, it is OK to have some internal method of `.loc` call some
-internal method of `__getitem__` (or of their common base class),
-but never in the code flow of `.loc` should `the_obj[something]` appear.
-
-3. Execution of positional indexing must never involve labels (as currently, sadly, happens).
-That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed)
-to `.iloc` should never involve the axes of the object in any way.
-
-4. Indexing must never involve accessing/modifying values (i.e., act on `._data` or `.values`) more than once.
-The following steps must hence be clearly decoupled:
-
-  * find positions we need to access/modify on each axis
-  * (if we are accessing) derive the type of object we need to return (dimensionality)
-  * actually access/modify the values
-  * (if we are accessing) construct the return object
-
-5. As a corollary to the decoupling between 4.i and 4.iii, any code which deals on how data is stored
-(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types)
-must be independent from code that deals with identifying affected rows/columns,
-and take place only once step 4.i is completed.
-
-  * In particular, such code should most probably not live in `pandas/core/indexing.py`
-  * ... and must not depend in any way on the type(s) of axes (e.g. no `MultiIndex` special cases)
-
-6. As a corollary to point 1.i, `Index` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup,
-on the one side, and for any required conversion/adaptation/lookup of label(s), on the other.
-
-7. Use of trial and error should be limited, and anyway restricted to catch only exceptions
-which are actually expected (typically `KeyError`).
-
-  * In particular, code should never (intentionally) raise new exceptions in the `except` portion of a `try... exception`
-
-8. Any code portion which is not specific to setters and getters must be shared,
-and when small differences in behavior are expected (e.g. getting with `.loc` raises for
-missing labels, setting still doesn't), they can be managed with a specific parameter.
-
-### Numba-accelerated operations
-
-[Numba](https://numba.pydata.org) is a JIT compiler for Python code.
-We'd like to provide ways for users to apply their own Numba-jitted
-functions where pandas accepts user-defined functions (for example,
-`Series.apply`,
-`DataFrame.apply`,
-`DataFrame.applymap`, and in groupby and
-window contexts). This will improve the performance of
-user-defined-functions in these operations by staying within compiled
-code.
-
-### Documentation improvements
-
-We'd like to improve the content, structure, and presentation of the
-pandas documentation. Some specific goals include
-
--   Overhaul the HTML theme with a modern, responsive design
-    (`15556`)
--   Improve the "Getting Started" documentation, designing and writing
-    learning paths for users different backgrounds (e.g. brand new to
-    programming, familiar with other languages like R, already familiar
-    with Python).
--   Improve the overall organization of the documentation and specific
-    subsections of the documentation to make navigation and finding
-    content easier.
-
-### Performance monitoring
-
-Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/)
-to monitor for performance regressions. ASV itself is a fabulous tool,
-but requires some additional work to be integrated into an open source
-project's workflow.
-
-The [asv-runner](https://github.com/asv-runner) organization, currently
-made up of pandas maintainers, provides tools built on top of ASV. We
-have a physical machine for running a number of project's benchmarks,
-and tools managing the benchmark runs and reporting on results.
-
-We'd like to fund improvements and maintenance of these tools to
-
--   Be more stable. Currently, they're maintained on the nights and
-    weekends when a maintainer has free time.
--   Tune the system for benchmarks to improve stability, following
-    <https://pyperf.readthedocs.io/en/latest/system.html>
--   Build a GitHub bot to request ASV runs *before* a PR is merged.
-    Currently, the benchmarks are only run nightly.