Skip call to .tolist() when creating pd.Index

y4n9squared · y4n9squared · commit 24bfcc9aaf34 · 2025-08-08T10:14:31.000-04:00
`np.tile` returns an NDArray and there is no need to convert this to a Python `list` prior to passing it to `pd.MultiIndex`. The interface to Pandas requires the object to be array-like, and one of the first things that the constructor does is coerce the list back to an NDArray. For arrays with large coordinate axes, `to_dataframe()` is extremely slow due to Pandas needing to iterate through a `list` object rather than an array. For an (1000, 500, 20) array -- 10M rows in the cartesian product -- this results in a ~20x speed-up for `xr.Dataset.to_dataframe()` (tested on x86 and Apple Silicon). ```python da = xr.DataArray(np.ones((1000, 500, 20)), name="foo") da.to_dataframe() ``` Closes #10617
diff --git a/xarray/core/coordinates.py b/xarray/core/coordinates.py
@@ -177,7 +177,7 @@ def to_index(self, ordered_dims: Sequence[Hashable] | None = None) -> pd.Index:
 
                 # compute the cartesian product
                 code_list += [
-                    np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i]).tolist()
+                    np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i])
                     for code in codes
                 ]
                 level_list += levels

Original file line number	Diff line number	Diff line change
`@@ -177,7 +177,7 @@ def to_index(self, ordered_dims: Sequence[Hashable] \| None = None) -> pd.Index:`
`177`	`177`
`178`	`178`	`# compute the cartesian product`
`179`	`179`	`code_list += [`
`180`		`- np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i]).tolist()`
	`180`	`+ np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i])`
`181`	`181`	`for code in codes`
`182`	`182`	`]`
`183`	`183`	`level_list += levels`