Skip to content

Conversation

akj2018
Copy link
Contributor

@akj2018 akj2018 commented Jan 19, 2025

Issue

Series.isin() raises TypeError: boolean value of NA is ambiguous when Series is large enough (>1_000_000) and values contains NA

Reason

  • Series.isin() internally uses np.isin() for large series with smaller values to increase performance but it does not handles the case when values is of dtype=object and contains NA and passes it to np.isin

# GH16012
# Ensure np.isin doesn't get object types or it *may* throw an exception
# Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array),
# isin is faster for small sizes
if (
len(comps_array) > _MINIMUM_COMP_ARR_LEN
and len(values) <= 26
and comps_array.dtype != object
):
# If the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
if isna(values).any():
def f(c, v):
return np.logical_or(np.isin(c, v).ravel(), np.isnan(c))
else:
f = lambda a, b: np.isin(a, b).ravel()
else:
common = np_find_common_type(values.dtype, comps_array.dtype)
values = values.astype(common, copy=False)
comps_array = comps_array.astype(common, copy=False)
f = htable.ismember
return f(comps_array, values)

            mask = np.zeros(len(ar1), dtype=bool)
            for a in ar2:
                mask |= (ar1 == a)
  • Using ar1 == NA raises a TypeError because the boolean value of pd.NA is ambiguous. refer docs.

Fix Implemented

Explicitly checking if values contains NA when large series and small number of values (<= 26) to avoid using np.isin in algorithms.py.

from pandas._libs.missing import NA

def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:

   
    # GH60678
    # Ensure values don't contain <NA>, otherwise it throws exception with np.in1d
    values_contains_NA = False
    
    if comps_array.dtype != object and len(values) <= 26:  
        values_contains_NA = any(v is NA for v in values)

    if (
        len(comps_array) > _MINIMUM_COMP_ARR_LEN
        and len(values) <= 26
        and comps_array.dtype != object
        and values_contains_NA == False
    ):

Testing

Successfully pass all existing test cases in test_isin.py with tests added for large series with dtype as boolean, Int64 and Float64 as follow:

  1. Series dtype==boolean and values contain pd.NA
  2. Series dtype==boolean and values contains mixed data with pd.NA
  3. Series dtype==boolean and values empty
  4. Series dtype==Int64 and values contains pd.NA
  5. Series dtype==Float64 and values contains pd.NA

@akj2018 akj2018 force-pushed the bugfix/60678-pdNA-error branch from 7a3e501 to cb16826 Compare January 20, 2025 01:14
@akj2018 akj2018 requested a review from mroeschke January 22, 2025 20:21
@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 22, 2025
@mroeschke mroeschke added this to the 3.0 milestone Jan 22, 2025
@mroeschke mroeschke merged commit 1d33e4c into pandas-dev:main Jan 22, 2025
60 checks passed
@mroeschke
Copy link
Member

Thanks @akj2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: boolean series .isin([pd.NA])] inconsistent for series length

2 participants