`ingest()` of SOMA Sparse Array with unspecified `dimensions` causes `ZeroDivisionError`

Following up on #565, there is still some unsafe math happening because the `size` and `dimensions` of a NDSparseArray are assumed to be `9223372036854773759` x `9223372036854773759` instead of the non-empty domain:
```python
        # Compute task parameters for main ingestion.
        if input_vectors_per_work_item == -1:
            # We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
            input_vectors_per_work_item = int(
                DEFAULT_PARTITION_BYTE_SIZE
                / dimensions
                / np.dtype(vector_type).itemsize
            )
        input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))
```

Reproduction:
```python
index = tvs.ingest(
    index_uri=INDEX_URI,
    index_type='IVF_FLAT',
    source_uri=embedding_uri,
    source_type='TILEDB_SPARSE_ARRAY',
    # size=len(matrix),
    # dimensions=512,
    verbose=True,
)
```
```
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[64], line 1
----> 1 index = tvs.ingest(
      2     index_uri=INDEX_URI,
      3     index_type='IVF_FLAT',
      4     source_uri=embedding_uri,
      5     source_type='TILEDB_SPARSE_ARRAY',
      6     # size=len(matrix),
      7     # dimensions=512,
      8     verbose=True,
      9 )

File ~/venv/cynapse312/lib/python3.12/site-packages/tiledb/vector_search/ingestion.py:3202, in ingest(index_type, index_uri, input_vectors, source_uri, source_type, external_ids, external_ids_uri, external_ids_type, updates_uri, index_timestamp, config, namespace, size, dimensions, partitions, num_subspaces, l_build, r_max_degree, training_sampling_policy, copy_centroids_uri, training_sample_size, training_input_vectors, training_source_uri, training_source_type, workers, input_vectors_per_work_item, max_tasks_per_stage, input_vectors_per_work_item_during_sampling, max_sampling_tasks, storage_version, verbose, trace_id, use_sklearn, mode, acn, ingest_resources, consolidate_partition_resources, copy_centroids_resources, random_sample_resources, kmeans_resources, compute_new_centroids_resources, assign_points_and_partial_new_centroids_resources, write_centroids_resources, partial_index_resources, distance_metric, normalized, **kwargs)
   3195 if input_vectors_per_work_item == -1:
   3196     # We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
   3197     input_vectors_per_work_item = int(
   3198         DEFAULT_PARTITION_BYTE_SIZE
   3199         / dimensions
   3200         / np.dtype(vector_type).itemsize
   3201     )
-> 3202 input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))
   3203 input_vectors_work_tasks = input_vectors_work_items
   3204 input_vectors_work_items_per_worker = 1

ZeroDivisionError: division by zero
```
```
[2025-09-30 15:51:43,641] [ingestion] [setup] [DEBUG] tiledb.cloud=0.14.2, tiledb=(0, 35, 0), libtiledb=(2, 29, 0)
[2025-09-30 15:51:44,131] [ingestion] [ingest] [DEBUG] Using dimensions: 9223372036854773759 (detected: 9223372036854773759)
[2025-09-30 15:51:44,132] [ingestion] [ingest] [DEBUG] Ingesting Vectors into 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg'
[2025-09-30 15:51:44,531] [ingestion] [ingest] [DEBUG] Group 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg' already exists
[2025-09-30 15:51:44,793] [ingestion] [ingest] [DEBUG] Input dataset size 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Input dataset dimensions 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Vector dimension type float32
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Partitions 10000
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Training sample size 0
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Training source uri None and type None
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Number of workers 1
```

What is happening is `input_vectors_per_work_item` is being set to `0` because `dimensions > DEFAULT_PARTITION_BYTE_SIZE`, then this result is being used as the divisor in the definition of `input_vectors_work_items`:
```python
import math

DEFAULT_PARTITION_BYTE_SIZE = 1073741824
dimensions = 9223372036854773759   # bigger than `DEFAULT_PARTITION_BYTE_SIZE`!
size = 9223372036854773759
vector_type = np.float32

input_vectors_per_work_item = int(
    DEFAULT_PARTITION_BYTE_SIZE
    / dimensions
    / np.dtype(vector_type).itemsize
)
input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))
```

It's great that `dimensions` was added as a parameter, but new users don't know that we need to specify `size` and `dimensions`.

As a side note, I'm a bit confused by the name `dimensions`. To me, it doesn't seem to refer to the number of dimensions of the TileDB array (e.g. 2 for 2D array) but rather the non-empty length of the second dimension (i.e. `soma_dim_1` for a SOMA array).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ingest()` of SOMA Sparse Array with unspecified `dimensions` causes `ZeroDivisionError` #585

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ingest() of SOMA Sparse Array with unspecified dimensions causes ZeroDivisionError #585

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ingest()` of SOMA Sparse Array with unspecified `dimensions` causes `ZeroDivisionError` #585