-
Notifications
You must be signed in to change notification settings - Fork 1k
Vendor Pandas' to_xarray in cudf.pandas #21175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
9467108 to
70a209a
Compare
|
/ok to test 585504c |
| def _to_xarray(self): | ||
| # Call xarray conversion functions directly with self (the proxy object). | ||
| # We must pass the proxy (self), not the slow pandas object, because xarray | ||
| # does isinstance checks against pd.MultiIndex and pd.api.extensions.ExtensionArray. | ||
| # After cudf.pandas.install(), these refer to proxy classes. The slow object | ||
| # contains real pandas types that don't pass isinstance checks against the proxy | ||
| # classes. | ||
| xr = import_optional_dependency("xarray") | ||
| if self.ndim == 1: | ||
| return xr.DataArray.from_series(self) | ||
| else: | ||
| return xr.Dataset.from_dataframe(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it but can't we get around this problem by implementing to_xarray in cudf.DataFrame and cudf.Series and then raise NotImplementedError() instead of this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that at first, but I think I ran into the problem of xarray running an isinstance check between a real and proxy MultiIndex. Let me try it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think this is correct. If we go with your approach,
DataFrame.to_xarray --> cudf.DataFrame.to_xarray --> Fails and fallsback to slow --> disable module acceleration and then call pd.DataFrame.to_xarray --> This will fail in xarray because of the isinstance checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could probably patch xarray somehow to be more cudf.pandas friendly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think this is correct. If we go with your approach,
DataFrame.to_xarray --> cudf.DataFrame.to_xarray --> Fails and fallsback to slow --> disable module acceleration and then call pd.DataFrame.to_xarray --> This will fail in xarray because of the isinstance checks
I see, got it. Can you link me to one such instance check? The current changes look good for me. We can revisit if patching xarray is needed if there are more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are several others.
| pd.arrays.IntegerArray, | ||
| fast_to_slow=_Unusable(), | ||
| slow_to_fast=_Unusable(), | ||
| bases=(pd.api.extensions.ExtensionArray,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Matt711 Can you see if we can proxy the ExtensionArray and then use that proxy class as base here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will work (going to let CI run). It also exposed another bug: __array_ufunc__ missing on datetime arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proxying ExtensionArrray seems to break alot things. My guess is we run into #19823. Going to revert for now
|
/ok to test 3562282 |
|
/ok to test efb849b |
Description
Contributes to #18659
xref #19827 The majority of xarray tests should be fixed in pandas 3. See #19827 (comment)
Checklist