Skip to content

Conversation

NEREUScode
Copy link

Problem

The MultiIndex.difference method fails to remove entries when the index contains PyArrow-backed timestamps (timestamp[ns][pyarrow]). This occurs because direct tuple comparisons with PyArrow scalar types are unreliable during membership checks, causing entries to remain unexpectedly.

Example:

# PyArrow timestamp index
df = DataFrame(...).astype({"date": "timestamp[ns][pyarrow]"}).set_index(["id", "date"])
idx_val = df.index[0]
new_index = df.index.difference([idx_val])  # Fails to remove idx_val

Solution
Code Conversion: Map other values to integer codes compatible with the original index's levels.

Engine Validation: Use the MultiIndex's internal engine for membership checks, ensuring accurate handling of PyArrow types.

Mask-Based Exclusion: Create a boolean mask to filter out matched entries, then reconstruct the index.

Testing
Added a test in pandas/tests/indexes/multi/test_setops.py that:

Creates a MultiIndex with PyArrow timestamps.

Validates difference correctly removes entries.

Skips the test if PyArrow is not installed.

Use Case Impact
Fixes scenarios where users filter hierarchical datasets with PyArrow timestamps, such as:

# Remove specific timestamps from a time-series index
clean_index = raw_index.difference(unwanted_timestamps)

Closes #61382.

@NEREUScode NEREUScode closed this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Multindex difference not working on columns with type Timestamp[ns][pyarrow]

1 participant