Skip to content

ingest() of SOMA Sparse Array with unspecified dimensions causes ZeroDivisionError #585

@jggatter

Description

@jggatter

Following up on #565, there is still some unsafe math happening because the size and dimensions of a NDSparseArray are assumed to be 9223372036854773759 x 9223372036854773759 instead of the non-empty domain:

        # Compute task parameters for main ingestion.
        if input_vectors_per_work_item == -1:
            # We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
            input_vectors_per_work_item = int(
                DEFAULT_PARTITION_BYTE_SIZE
                / dimensions
                / np.dtype(vector_type).itemsize
            )
        input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))

Reproduction:

index = tvs.ingest(
    index_uri=INDEX_URI,
    index_type='IVF_FLAT',
    source_uri=embedding_uri,
    source_type='TILEDB_SPARSE_ARRAY',
    # size=len(matrix),
    # dimensions=512,
    verbose=True,
)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[64], line 1
----> 1 index = tvs.ingest(
      2     index_uri=INDEX_URI,
      3     index_type='IVF_FLAT',
      4     source_uri=embedding_uri,
      5     source_type='TILEDB_SPARSE_ARRAY',
      6     # size=len(matrix),
      7     # dimensions=512,
      8     verbose=True,
      9 )

File ~/venv/cynapse312/lib/python3.12/site-packages/tiledb/vector_search/ingestion.py:3202, in ingest(index_type, index_uri, input_vectors, source_uri, source_type, external_ids, external_ids_uri, external_ids_type, updates_uri, index_timestamp, config, namespace, size, dimensions, partitions, num_subspaces, l_build, r_max_degree, training_sampling_policy, copy_centroids_uri, training_sample_size, training_input_vectors, training_source_uri, training_source_type, workers, input_vectors_per_work_item, max_tasks_per_stage, input_vectors_per_work_item_during_sampling, max_sampling_tasks, storage_version, verbose, trace_id, use_sklearn, mode, acn, ingest_resources, consolidate_partition_resources, copy_centroids_resources, random_sample_resources, kmeans_resources, compute_new_centroids_resources, assign_points_and_partial_new_centroids_resources, write_centroids_resources, partial_index_resources, distance_metric, normalized, **kwargs)
   3195 if input_vectors_per_work_item == -1:
   3196     # We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
   3197     input_vectors_per_work_item = int(
   3198         DEFAULT_PARTITION_BYTE_SIZE
   3199         / dimensions
   3200         / np.dtype(vector_type).itemsize
   3201     )
-> 3202 input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))
   3203 input_vectors_work_tasks = input_vectors_work_items
   3204 input_vectors_work_items_per_worker = 1

ZeroDivisionError: division by zero
[2025-09-30 15:51:43,641] [ingestion] [setup] [DEBUG] tiledb.cloud=0.14.2, tiledb=(0, 35, 0), libtiledb=(2, 29, 0)
[2025-09-30 15:51:44,131] [ingestion] [ingest] [DEBUG] Using dimensions: 9223372036854773759 (detected: 9223372036854773759)
[2025-09-30 15:51:44,132] [ingestion] [ingest] [DEBUG] Ingesting Vectors into 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg'
[2025-09-30 15:51:44,531] [ingestion] [ingest] [DEBUG] Group 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg' already exists
[2025-09-30 15:51:44,793] [ingestion] [ingest] [DEBUG] Input dataset size 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Input dataset dimensions 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Vector dimension type float32
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Partitions 10000
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Training sample size 0
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Training source uri None and type None
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Number of workers 1

What is happening is input_vectors_per_work_item is being set to 0 because dimensions > DEFAULT_PARTITION_BYTE_SIZE, then this result is being used as the divisor in the definition of input_vectors_work_items:

import math

DEFAULT_PARTITION_BYTE_SIZE = 1073741824
dimensions = 9223372036854773759   # bigger than `DEFAULT_PARTITION_BYTE_SIZE`!
size = 9223372036854773759
vector_type = np.float32

input_vectors_per_work_item = int(
    DEFAULT_PARTITION_BYTE_SIZE
    / dimensions
    / np.dtype(vector_type).itemsize
)
input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))

It's great that dimensions was added as a parameter, but new users don't know that we need to specify size and dimensions.

As a side note, I'm a bit confused by the name dimensions. To me, it doesn't seem to refer to the number of dimensions of the TileDB array (e.g. 2 for 2D array) but rather the non-empty length of the second dimension (i.e. soma_dim_1 for a SOMA array).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions