Skip to content

Why call tolist() when constructing pandas MultiIndex? #10617

@y4n9squared

Description

@y4n9squared

What is your issue?

I have a workload that is calling ds.to_dataframe() which takes several seconds because the output DataFrame has 10M+ rows. On my machine, the vast majority of the time (>99%) spent in xr.Dataset.to_dataframe() is constructing the pd.MultiIndex and within that, >80% of the time is spent calling tolist() and forcing the constructor of pd.MultiIndex to iterate through a list rather than an ndarray.

On line L180:

https://github.com/pydata/xarray/blob/main/xarray/core/coordinates.py#L180

is there a reason to call .tolist() rather than just keeping the object as an ndarray? Removing .tolist() results in a significant performance improvement for me.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions