Skip to content

grnboost2 fails on large datasets #41

@dmalzl

Description

@dmalzl

Hi,

I am currently trying to use grnboost2 to infer GRNs from a dataset of around 120k cells but I am unable to get it to run due to some hard limits imposed by dependencies of the dask distributed package (see here). In brief, dask has a hard limit on the size of the dataset (data chunk) it can serialise, which is 4GB. Anything above this will result in a the following error:

distributed.protocol.core - CRITICAL - Failed to Serialize
ValueError: bytes object is too large

To circumvent this, the developers suggest to move data generation into a separate task to make the workers generate their own data locally. So a workaround would be to be able to provide paths for the data files and move the read to the worker to only have to serialise a couple of strings instead of the whole dataset.

I know this may be a bit more to think about especially to figure out the best strategy to do this (e.g. generate a system that makes data chunks ahead of time writes them to files and then lets the workers read the data back in or something) but maybe worthwhile to support larger datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions