Skip to content

ENH: Generalize groupby to better support ExtensionArray #53904

@MichaelTiemannOSC

Description

@MichaelTiemannOSC

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have written changes to add uncertainties to Pint (hgrecco/pint#1615) and Pint-Pandas (hgrecco/pint-pandas#140). New developments in Pint and Pint-Pandas now deeply embrace the ExtensionArray API (which I also encouraged), but it's now causing my changes grief.

The uncertainties package uses wrapping functions to interoperate with floats and NumPy (https://pythonhosted.org/uncertainties/index.html). The uncertainty datatype is <class 'uncertainties.core.AffineScalarFunc'>, which is not hashable. I have largely been able to work around this within the EA framework, but I'm stuck on how to make them work with groupby and related. I wonder whether the groupby functionality can be generalized to work better with unhashable EA types.

Feature Description

Here's an example of a small change that allows my EA type to interoperate with groupby. Specifically, it does not force the assumption that a NaN value is np.nan, but is whatever value isna says is a NaN value. In the case of uncertainties, it's typically ufloat(np.nan, 0), but it could be a UFloat with either a np.nan nominal value or np.nan error value, or both.

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 1a17fef071..98e9c53c37 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -3080,7 +3080,10 @@ class GroupBy(BaseGroupBy[NDFrameT]):
                 """Helper function for first item that isn't NA."""
                 arr = x.array[notna(x.array)]
                 if not len(arr):
-                    return np.nan
+                    nan_arr = x.array[isna(x.array)]
+                    if not len(nan_arr):
+                        return np.nan
+                    return nan_arr[0]
                 return arr[0]
 
             if isinstance(obj, DataFrame):

But here's the really sticky problem:

diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index f0e4484f69..8b7f8e1aee 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -587,7 +587,7 @@ class BaseGrouper:
 
     def get_iterator(
         self, data: NDFrameT, axis: AxisInt = 0
-    ) -> Iterator[tuple[Hashable, NDFrameT]]:
+    ) -> Iterator[tuple[Hashable, NDFrameT]]:  # Does not work with non-hashable EA types
         """
         Groupby iterator

In the PintArray world (the ExtensionArray implemented in PintPandas) I've been able to make factorize functionality work independently of any Pandas changes, but the factorized results don't survive subsequent groupby actions (that come from splitting). And that's where I'm stuck.

@andrewgsavage @rhshadrach @lebigot @hgrecco

Alternative Solutions

If the Pandas test framework could xfail unhashable EA types for groupby tests, that might be a workaround acceptable workaround (need to check with Pint and Pint-Pandas maintainers).

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions