-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Following up on #565, there is still some unsafe math happening because the size and dimensions of a NDSparseArray are assumed to be 9223372036854773759 x 9223372036854773759 instead of the non-empty domain:
# Compute task parameters for main ingestion.
if input_vectors_per_work_item == -1:
# We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
input_vectors_per_work_item = int(
DEFAULT_PARTITION_BYTE_SIZE
/ dimensions
/ np.dtype(vector_type).itemsize
)
input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))Reproduction:
index = tvs.ingest(
index_uri=INDEX_URI,
index_type='IVF_FLAT',
source_uri=embedding_uri,
source_type='TILEDB_SPARSE_ARRAY',
# size=len(matrix),
# dimensions=512,
verbose=True,
)---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Cell In[64], line 1
----> 1 index = tvs.ingest(
2 index_uri=INDEX_URI,
3 index_type='IVF_FLAT',
4 source_uri=embedding_uri,
5 source_type='TILEDB_SPARSE_ARRAY',
6 # size=len(matrix),
7 # dimensions=512,
8 verbose=True,
9 )
File ~/venv/cynapse312/lib/python3.12/site-packages/tiledb/vector_search/ingestion.py:3202, in ingest(index_type, index_uri, input_vectors, source_uri, source_type, external_ids, external_ids_uri, external_ids_type, updates_uri, index_timestamp, config, namespace, size, dimensions, partitions, num_subspaces, l_build, r_max_degree, training_sampling_policy, copy_centroids_uri, training_sample_size, training_input_vectors, training_source_uri, training_source_type, workers, input_vectors_per_work_item, max_tasks_per_stage, input_vectors_per_work_item_during_sampling, max_sampling_tasks, storage_version, verbose, trace_id, use_sklearn, mode, acn, ingest_resources, consolidate_partition_resources, copy_centroids_resources, random_sample_resources, kmeans_resources, compute_new_centroids_resources, assign_points_and_partial_new_centroids_resources, write_centroids_resources, partial_index_resources, distance_metric, normalized, **kwargs)
3195 if input_vectors_per_work_item == -1:
3196 # We scale the input_vectors_per_work_item to maintain the DEFAULT_PARTITION_BYTE_SIZE
3197 input_vectors_per_work_item = int(
3198 DEFAULT_PARTITION_BYTE_SIZE
3199 / dimensions
3200 / np.dtype(vector_type).itemsize
3201 )
-> 3202 input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))
3203 input_vectors_work_tasks = input_vectors_work_items
3204 input_vectors_work_items_per_worker = 1
ZeroDivisionError: division by zero
[2025-09-30 15:51:43,641] [ingestion] [setup] [DEBUG] tiledb.cloud=0.14.2, tiledb=(0, 35, 0), libtiledb=(2, 29, 0)
[2025-09-30 15:51:44,131] [ingestion] [ingest] [DEBUG] Using dimensions: 9223372036854773759 (detected: 9223372036854773759)
[2025-09-30 15:51:44,132] [ingestion] [ingest] [DEBUG] Ingesting Vectors into 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg'
[2025-09-30 15:51:44,531] [ingestion] [ingest] [DEBUG] Group 'tiledb://Cellarity-dev/s3://tiledb-dev/groups/vs_index_jg' already exists
[2025-09-30 15:51:44,793] [ingestion] [ingest] [DEBUG] Input dataset size 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Input dataset dimensions 9223372036854773759
[2025-09-30 15:51:44,794] [ingestion] [ingest] [DEBUG] Vector dimension type float32
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Partitions 10000
[2025-09-30 15:51:44,795] [ingestion] [ingest] [DEBUG] Training sample size 0
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Training source uri None and type None
[2025-09-30 15:51:44,796] [ingestion] [ingest] [DEBUG] Number of workers 1
What is happening is input_vectors_per_work_item is being set to 0 because dimensions > DEFAULT_PARTITION_BYTE_SIZE, then this result is being used as the divisor in the definition of input_vectors_work_items:
import math
DEFAULT_PARTITION_BYTE_SIZE = 1073741824
dimensions = 9223372036854773759 # bigger than `DEFAULT_PARTITION_BYTE_SIZE`!
size = 9223372036854773759
vector_type = np.float32
input_vectors_per_work_item = int(
DEFAULT_PARTITION_BYTE_SIZE
/ dimensions
/ np.dtype(vector_type).itemsize
)
input_vectors_work_items = int(math.ceil(size / input_vectors_per_work_item))It's great that dimensions was added as a parameter, but new users don't know that we need to specify size and dimensions.
As a side note, I'm a bit confused by the name dimensions. To me, it doesn't seem to refer to the number of dimensions of the TileDB array (e.g. 2 for 2D array) but rather the non-empty length of the second dimension (i.e. soma_dim_1 for a SOMA array).