[User question] speed of a random access query

I was interested in testing out TileDB for some in-house single-cell data I've been working with.  I created a TileDB datastore in a google cloud bucket using `tiledbsoma.io.from_anndata` (uploading lots of h5ad files after running `tiledbsoma.io.register_anndatas`.

I then tried querying data from this TileDB datastore backed by a google cloud bucket (3 million cells in there).

A query like this (10k cells)
```python
    logger.info("starting quick check")
    inds = range(10000)
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")
```
ran in 30 seconds, and I was thrilled!

But as soon as I tried to query 10k random cell indices, I ran into a long delay:
```python
    logger.info("starting quick shuffled check")
    inds = np.arange(3_000_000)
    inds_shuffled = np.random.permutation(inds)
    inds = [i for i in inds_shuffled[:10000]]
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")
```
The above took an hour to run.

**Questions:**

1. Am I doing something wrong / suboptimal above?
2. Is this kind of much longer query time for random access expected?  Is it just part of TileDB, where a truly random query forces TileDB to open a ton of tiles, and so it's just gonna take a really long time?

Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User question] speed of a random access query #3990

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[User question] speed of a random access query #3990

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions