-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
I was interested in testing out TileDB for some in-house single-cell data I've been working with. I created a TileDB datastore in a google cloud bucket using tiledbsoma.io.from_anndata (uploading lots of h5ad files after running tiledbsoma.io.register_anndatas.
I then tried querying data from this TileDB datastore backed by a google cloud bucket (3 million cells in there).
A query like this (10k cells)
logger.info("starting quick check")
inds = range(10000)
with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
with exp.axis_query(
measurement_name,
obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
) as query:
adata = query.to_anndata(
X_name=x_layer_name,
column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
)
logger.info("quick check done")ran in 30 seconds, and I was thrilled!
But as soon as I tried to query 10k random cell indices, I ran into a long delay:
logger.info("starting quick shuffled check")
inds = np.arange(3_000_000)
inds_shuffled = np.random.permutation(inds)
inds = [i for i in inds_shuffled[:10000]]
with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
with exp.axis_query(
measurement_name,
obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
) as query:
adata = query.to_anndata(
X_name=x_layer_name,
column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
)
logger.info("quick check done")The above took an hour to run.
Questions:
- Am I doing something wrong / suboptimal above?
- Is this kind of much longer query time for random access expected? Is it just part of TileDB, where a truly random query forces TileDB to open a ton of tiles, and so it's just gonna take a really long time?
Thanks!!
Metadata
Metadata
Assignees
Labels
No labels