Skip to content

[User question] speed of a random access query #3990

@sjfleming

Description

@sjfleming

I was interested in testing out TileDB for some in-house single-cell data I've been working with. I created a TileDB datastore in a google cloud bucket using tiledbsoma.io.from_anndata (uploading lots of h5ad files after running tiledbsoma.io.register_anndatas.

I then tried querying data from this TileDB datastore backed by a google cloud bucket (3 million cells in there).

A query like this (10k cells)

    logger.info("starting quick check")
    inds = range(10000)
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")

ran in 30 seconds, and I was thrilled!

But as soon as I tried to query 10k random cell indices, I ran into a long delay:

    logger.info("starting quick shuffled check")
    inds = np.arange(3_000_000)
    inds_shuffled = np.random.permutation(inds)
    inds = [i for i in inds_shuffled[:10000]]
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")

The above took an hour to run.

Questions:

  1. Am I doing something wrong / suboptimal above?
  2. Is this kind of much longer query time for random access expected? Is it just part of TileDB, where a truly random query forces TileDB to open a ton of tiles, and so it's just gonna take a really long time?

Thanks!!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions