feat: DataFrame API for custom obs and var#2328
Conversation
❌ 26 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
|
||
|
|
||
| class Dataset2D(Mapping[Hashable, XDataArray | Self]): | ||
| class Dataset2D: |
There was a problem hiding this comment.
adopt
__dataframe__protocol in a meaningful way ideally for writing, but potentially as a solution to the above to cases
So will Dataset2D eventually support the __dataframe__ protocol?
There was a problem hiding this comment.
So will Dataset2D eventually support the dataframe protocol?
Mmmm depends if I actually implement its use here unless you have a use-case (you probably do?). In theory, I don't see why not but I haven't actually investigated what would go into that. There hasn't even been activity in the repo that hosts the protocols for like 2 years but it is documented on the arrow website: https://arrow.apache.org/docs/python/interchange_protocol.html
There was a problem hiding this comment.
It looks like in DataFrame land people have been using PyCapsule for an interchange as of late, here it is in pandas
Specifically Pandas seem to be fully onboard with the Arrow PyCapsule interface.
For new development, we highly recommend using the Arrow C Data Interface alongside the Arrow PyCapsule Interface instead of the interchange protocol. From pandas 3.0 onwards, from_dataframe uses the PyCapsule Interface, only falling back to the interchange protocol if that fails.
From pandas 4.0 onwards, that fallback will no longer be available and only the PyCapsule Interface will be used.
I guess with this now, do you think __dataframe__ is still worth your time and effort, if you'll have to add some backwards compatibility or special handling come pandas 4?
There was a problem hiding this comment.
I guess with this now, do you think dataframe is still worth your time and effort, if you'll have to add some backwards compatibility or special handling come pandas 4?
I'm definitely not tied to the idea, so if that is the better interchange, we should use that instead. Thanks for pointing this out!
7b7d16f to
a69d374
Compare
cc @srivarra @Intron7
possible TODOs:
concatandmergeAPI (or disallow and keep pandas/xarray special-cased, would requireequalimplementation as well)__dataframe__protocol in a meaningful way ideally for writing, but potentially as a solution to the above to cases (i.e., drop intopandasand leave the data there): https://data-apis.org/dataframe-protocol/latest/cuDFor GPU-basedpolars_gen_dataframespecial casingto_numpymethodThis PR is somewhat self-testing because I will have in theory removed all references to both
pandas.DataframeandXDataset/Dataset2Deverywhere except concatenation internally.One downside of this approach is that it puts us somewhat on the hook for
pandas.DataFrameAPI but the flipside is that if we choose to switch to another library, we could do that relatively cleanly now/more easilyDataFrameAPI forobsandvarkeys via runtime-checkableProtocol#2043, closes Native support for cuPy/cuDF backed Anndata #355