Skip to content

Skip call to .tolist() when creating pd.Index #10619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 13, 2025

Conversation

y4n9squared
Copy link
Contributor

np.tile returns an NDArray and there is no need to convert this to a Python list prior to passing it to pd.MultiIndex. The interface to Pandas requires the object to be array-like, and one of the first things that the constructor does is coerce the list back to an NDArray.

For arrays with large coordinate axes, to_dataframe() is extremely slow due to Pandas needing to iterate through a list object rather than an array.

For an (1000, 500, 20) array -- 10M rows in the cartesian product -- this results in a ~20x speed-up for xr.Dataset.to_dataframe() (tested on x86 and Apple Silicon).

da = xr.DataArray(np.ones((1000, 500, 20)), name="foo")
da.to_dataframe()

Copy link

welcome bot commented Aug 8, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

`np.tile` returns an NDArray and there is no need to convert this to a
Python `list` prior to passing it to `pd.MultiIndex`. The interface to
Pandas requires the object to be array-like, and one of the first things
that the constructor does is coerce the list back to an NDArray.

For arrays with large coordinate axes, `to_dataframe()` is extremely
slow due to Pandas needing to iterate through a `list` object rather
than an array.

For an (1000, 500, 20) array -- 10M rows in the cartesian product --
this results in a ~20x speed-up for `xr.Dataset.to_dataframe()` (tested
on x86 and Apple Silicon).

```python
da = xr.DataArray(np.ones((1000, 500, 20)), name="foo")
da.to_dataframe()
```

Closes pydata#10617
@y4n9squared y4n9squared force-pushed the faster-to-dataframe branch from 81a836e to 24bfcc9 Compare August 8, 2025 14:14
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Aug 8, 2025
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you. Welcome to Xarray!

@dcherian dcherian merged commit 3c9217e into pydata:main Aug 13, 2025
53 of 57 checks passed
Copy link

welcome bot commented Aug 13, 2025

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow topic-performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why call tolist() when constructing pandas MultiIndex?
4 participants