grnboost2 fails on large datasets

Hi,

I am currently trying to use grnboost2 to infer GRNs from a dataset of around 120k cells but I am unable to get it to run due to some hard limits imposed by dependencies of the dask distributed package (see [here](https://github.com/dask/distributed/issues/5498)). In brief, dask has a hard limit on the size of the dataset (data chunk) it can serialise, which is 4GB. Anything above this will result in a the following error:
```
distributed.protocol.core - CRITICAL - Failed to Serialize
ValueError: bytes object is too large
```

To circumvent this, the developers suggest to move data generation into a separate task to make the workers generate their own data locally. So a workaround would be to be able to provide paths for the data files and move the read to the worker to only have to serialise a couple of strings instead of the whole dataset. 

I know this may be a bit more to think about especially to figure out the best strategy to do this (e.g. generate a system that makes data chunks ahead of time writes them to files and then lets the workers read the data back in or something) but maybe worthwhile to support larger datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grnboost2 fails on large datasets #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grnboost2 fails on large datasets #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions