Skip to content

Support for sparse arrays with the Arrow Sparse Tensor format? #7377

@JulesGM

Description

@JulesGM

Feature request

AI in biology is becoming a big thing. One thing that would be a huge benefit to the field that Huggingface Datasets doesn't currently have is native support for sparse arrays.

Arrow has support for sparse tensors.
https://arrow.apache.org/docs/format/Other.html#sparse-tensor

It would be a big deal if Hugging Face Datasets supported sparse tensors as a feature type, natively.

Motivation

This is important for example in the field of transcriptomics (modeling and understanding gene expression), because a large fraction of the genes are not expressed (zero). More generally, in science, sparse arrays are very common, so adding support for them would be very benefitial, it would make just using Hugging Face Dataset objects a lot more straightforward and clean.

Your contribution

We can discuss this further once the team comments of what they think about the feature, and if there were previous attempts at making it work, and understanding their evaluation of how hard it would be. My intuition is that it should be fairly straightforward, as the Arrow backend already supports it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions