Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ including other versions of pandas.
Fixed regressions
~~~~~~~~~~~~~~~~~
- Fixed regression in :func:`merge_ordered` raising ``TypeError`` for ``fill_method="ffill"`` and ``how="left"`` (:issue:`57010`)
- Fixed regression in :meth:`DataFrame.loc` raising ``IndexError`` for non-unique, masked dtype indexes where result has more than 10,000 rows (:issue:`57027`)
- Fixed regression in :meth:`Series.pct_change` raising a ``ValueError`` for an empty :class:`Series` (:issue:`57056`)

.. ---------------------------------------------------------------------------
Expand Down
6 changes: 5 additions & 1 deletion pandas/_libs/index.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1220,6 +1220,7 @@ cdef class MaskedIndexEngine(IndexEngine):
n_alloc *= 2
if n_alloc > max_alloc:
n_alloc = max_alloc
result = np.resize(result, n_alloc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required to resize here? If so can we get rid of the resize that happens after this branch is completed? Might be misreading but it looks duplicative

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is required. The result size isn't known so I believe the resize check needs to be done each time a new value is set. I don't see a way to avoid it with the current algo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea that makes sense. I think its worth making the conditional resize its own function then so we don't have as much duplication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea - done


result[count] = na_idx
count += 1
Expand All @@ -1236,14 +1237,17 @@ cdef class MaskedIndexEngine(IndexEngine):
n_alloc *= 2
if n_alloc > max_alloc:
n_alloc = max_alloc
result = np.resize(result, n_alloc)

result[count] = j
count += 1
continue

# value not found
if count >= n_alloc:
n_alloc += 10_000
n_alloc *= 2
if n_alloc > max_alloc:
n_alloc = max_alloc
result = np.resize(result, n_alloc)
result[count] = -1
count += 1
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/indexing/test_loc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3347,3 +3347,15 @@ def test_getitem_loc_str_periodindex(self):
index = pd.period_range(start="2000", periods=20, freq="B")
series = Series(range(20), index=index)
assert series.loc["2000-01-14"] == 9

def test_loc_nonunique_masked_index(self):
# GH 57027
ids = list(range(11))
index = Index(ids * 1000, dtype="Int64")
df = DataFrame({"val": np.arange(len(index))}, index=index)
result = df.loc[ids]
expected = DataFrame(
{"val": index.argsort(kind="stable")},
index=Index(np.array(ids).repeat(1000), dtype="Int64"),
)
tm.assert_frame_equal(result, expected)