Skip to content

Commit 353f2d1

Browse files
authored
Addition of from_numpy Support for mode={ingest,schema_only,append} (#1185)
* Addition of `from_numpy` Support for `mode={ingest,schema_only,append}` * Addition of `start_idx` Parameter for `from_numpy` * Only Call `nonempty_domain` In Append Mode
1 parent ca88c61 commit 353f2d1

File tree

5 files changed

+267
-21
lines changed

5 files changed

+267
-21
lines changed

HISTORY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
## API Changes
1010
* Support `QueryCondition` for dense arrays [#1198](https://github.com/TileDB-Inc/TileDB-Py/pull/1198)
1111
* Querying dense array with `[:]` returns shape that matches nonempty domain, consistent with `.df[:]` and `.multi_index[:]` [#1199](https://github.com/TileDB-Inc/TileDB-Py/pull/1199)
12+
* Addition of `from_numpy` support for `mode={ingest,schema_only,append}` [#1185](https://github.com/TileDB-Inc/TileDB-Py/pull/1185)
1213

1314
# TileDB-Py 0.16.1 Release Notes
1415

tiledb/dataframe_.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -405,7 +405,6 @@ def from_pandas(uri: str, dataframe: "pd.DataFrame", **kwargs):
405405
406406
:param uri: URI for new TileDB array
407407
:param dataframe: pandas DataFrame
408-
:param mode: Creation mode, one of 'ingest' (default), 'schema_only', 'append'
409408
410409
:Keyword Arguments:
411410
@@ -416,7 +415,7 @@ def from_pandas(uri: str, dataframe: "pd.DataFrame", **kwargs):
416415
which `tiledb.read_csv` checks for in order to correctly read a file batchwise.
417416
* **index_dims** (``List[str]``) -- List of column name(s) to use as dimension(s) in TileDB array schema. This is the recommended way to create dimensions.
418417
* **allows_duplicates** - Generated schema should allow duplicates
419-
* **mode** - (default ``ingest``), Ingestion mode: ``ingest``, ``schema_only``, ``append``
418+
* **mode** - Creation mode, one of 'ingest' (default), 'schema_only', 'append'
420419
* **attr_filters** - FilterList to apply to Attributes: FilterList or Dict[str -> FilterList] for any attribute(s). Unspecified attributes will use default.
421420
* **dim_filters** - FilterList to apply to Dimensions: FilterList or Dict[str -> FilterList] for any dimensions(s). Unspecified dimensions will use default.
422421
* **offsets_filters** - FilterList to apply to all offsets
@@ -457,7 +456,7 @@ def _from_pandas(uri, dataframe, tiledb_args):
457456
mode = tiledb_args.get("mode", "ingest")
458457

459458
if mode != "append" and tiledb.array_exists(uri):
460-
raise TileDBError("Array URI '{}' already exists!".format(uri))
459+
raise TileDBError(f"Array URI '{uri}' already exists!")
461460

462461
sparse = tiledb_args["sparse"]
463462
index_dims = tiledb_args.get("index_dims") or ()
@@ -476,7 +475,7 @@ def _from_pandas(uri, dataframe, tiledb_args):
476475
"Cannot append to dense array without 'row_start_idx'"
477476
)
478477
elif mode != "ingest":
479-
raise TileDBError("Invalid mode specified ('{}')".format(mode))
478+
raise TileDBError(f"Invalid mode specified ('{mode}')")
480479

481480
# TODO: disentangle the full_domain logic
482481
full_domain = tiledb_args.get("full_domain", False)
@@ -696,7 +695,7 @@ def from_csv(uri: str, csv_file: Union[str, List[str]], **kwargs):
696695
* **sparse** - (default True) Create sparse schema
697696
* **index_dims** (``List[str]``) -- List of column name(s) to use as dimension(s) in TileDB array schema. This is the recommended way to create dimensions. (note: the Pandas ``read_csv`` argument ``index_col`` will be passed through if provided, which results in indexes that will be converted to dimnesions by default; however ``index_dims`` is preferred).
698697
* **allows_duplicates** - Generated schema should allow duplicates
699-
* **mode** - (default ``ingest``), Ingestion mode: ``ingest``, ``schema_only``, ``append``
698+
* **mode** - Creation mode, one of 'ingest' (default), 'schema_only', 'append'
700699
* **attr_filters** - FilterList to apply to Attributes: FilterList or Dict[str -> FilterList] for any attribute(s). Unspecified attributes will use default.
701700
* **dim_filters** - FilterList to apply to Dimensions: FilterList or Dict[str -> FilterList] for any dimensions(s). Unspecified dimensions will use default.
702701
* **offsets_filters** - FilterList to apply to all offsets

tiledb/highlevel.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,14 @@ def from_numpy(uri, array, config=None, ctx=None, **kwargs):
7777
:raises TypeError: cannot convert ``uri`` to unicode string
7878
:raises: :py:exc:`tiledb.TileDBError`
7979
80+
:Keyword Arguments:
81+
82+
* **full_domain** - Dimensions should be created with full range of the dtype (default: False)
83+
* **mode** - Creation mode, one of 'ingest' (default), 'schema_only', 'append'
84+
* **append_dim** - The dimension along which the Numpy array is append (default: 0).
85+
* **start_idx** - The starting index to append to. By default, append to the end of the existing data.
86+
* **timestamp** - Write TileDB array at specific timestamp.
87+
8088
**Example:**
8189
8290
>>> import tiledb, numpy as np, tempfile

tiledb/libtiledb.pyx

Lines changed: 131 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import html
1111
import sys
1212
import warnings
1313
from collections import OrderedDict
14+
from collections.abc import Sequence
1415

1516
from .ctx import default_ctx
1617
from .filter import FilterList
@@ -240,13 +241,48 @@ def schema_like_numpy(array, ctx=None, **kw):
240241
tiling = regularize_tiling(kw.pop('tile', None), array.ndim)
241242

242243
attr_name = kw.pop('attr_name', '')
243-
dim_dtype = kw.pop('dim_dtype', np.uint64)
244+
dim_dtype = kw.pop('dim_dtype', np.dtype("uint64"))
245+
full_domain = kw.pop('full_domain', False)
244246
dims = []
247+
245248
for (dim_num,d) in enumerate(range(array.ndim)):
246249
# support smaller tile extents by kw
247250
# domain is based on full shape
248251
tile_extent = tiling[d] if tiling else array.shape[d]
249-
domain = (0, array.shape[d] - 1)
252+
if full_domain:
253+
if dim_dtype not in (np.bytes_, np.str_):
254+
# Use the full type domain, deferring to the constructor
255+
dtype_min, dtype_max = dtype_range(dim_dtype)
256+
dim_max = dtype_max
257+
if dim_dtype.kind == "M":
258+
date_unit = np.datetime_data(dim_dtype)[0]
259+
dim_min = np.datetime64(dtype_min, date_unit)
260+
tile_max = np.iinfo(np.uint64).max - tile_extent
261+
if np.uint64(dtype_max - dtype_min) > tile_max:
262+
dim_max = np.datetime64(dtype_max - tile_extent, date_unit)
263+
else:
264+
dim_min = dtype_min
265+
266+
if np.issubdtype(dim_dtype, np.integer):
267+
tile_max = np.iinfo(np.uint64).max - tile_extent
268+
if np.uint64(dtype_max - dtype_min) > tile_max:
269+
dim_max = dtype_max - tile_extent
270+
domain = (dim_min, dim_max)
271+
else:
272+
domain = (None, None)
273+
274+
if np.issubdtype(dim_dtype, np.integer) or dim_dtype.kind == "M":
275+
# we can't make a tile larger than the dimension range or lower than 1
276+
tile_extent = max(1, min(tile_extent, np.uint64(dim_max - dim_min)))
277+
elif np.issubdim_dtype(dim_dtype, np.floating):
278+
# this difference can be inf
279+
with np.errstate(over="ignore"):
280+
dim_range = dim_max - dim_min
281+
if dim_range < tile_extent:
282+
tile_extent = np.ceil(dim_range)
283+
else:
284+
domain = (0, array.shape[d] - 1)
285+
250286
dims.append(Dim(domain=domain, tile=tile_extent, dtype=dim_dtype, ctx=ctx))
251287

252288
var = False
@@ -4240,7 +4276,7 @@ cdef class DenseArrayImpl(Array):
42404276
def __init__(self, *args, **kw):
42414277
super().__init__(*args, **kw)
42424278
if self.schema.sparse:
4243-
raise ValueError("Array at {} is not a dense array".format(self.uri))
4279+
raise ValueError(f"Array at {self.uri} is not a dense array")
42444280
return
42454281

42464282
@staticmethod
@@ -4250,20 +4286,38 @@ cdef class DenseArrayImpl(Array):
42504286
"""
42514287
if not ctx:
42524288
ctx = default_ctx()
4289+
4290+
mode = kw.pop("mode", "ingest")
4291+
timestamp = kw.pop("timestamp", None)
42534292

4254-
# pop the write timestamp before creating schema
4255-
timestamp = kw.pop('timestamp', None)
4256-
4257-
schema = schema_like_numpy(array, ctx=ctx, **kw)
4258-
Array.create(uri, schema)
4293+
if mode not in ("ingest", "schema_only", "append"):
4294+
raise TileDBError(f"Invalid mode specified ('{mode}')")
42594295

4296+
if mode in ("ingest", "schema_only"):
4297+
try:
4298+
with Array.load_typed(uri):
4299+
raise TileDBError(f"Array URI '{uri}' already exists!")
4300+
except TileDBError:
4301+
pass
4302+
4303+
if mode == "append":
4304+
kw["append_dim"] = kw.get("append_dim", 0)
4305+
if ArraySchema.load(uri).sparse:
4306+
raise TileDBError("Cannot append to sparse array")
4307+
4308+
if mode in ("ingest", "schema_only"):
4309+
schema = schema_like_numpy(array, ctx=ctx, **kw)
4310+
Array.create(uri, schema)
4311+
4312+
if mode in ("ingest", "append"):
4313+
kw["mode"] = mode
4314+
with DenseArray(uri, mode='w', ctx=ctx, timestamp=timestamp) as arr:
4315+
# <TODO> probably need better typecheck here
4316+
if array.dtype == object:
4317+
arr[:] = array
4318+
else:
4319+
arr.write_direct(np.ascontiguousarray(array), **kw)
42604320

4261-
with DenseArray(uri, mode='w', ctx=ctx, timestamp=timestamp) as arr:
4262-
# <TODO> probably need better typecheck here
4263-
if array.dtype == object:
4264-
arr[:] = array
4265-
else:
4266-
arr.write_direct(np.ascontiguousarray(array))
42674321
return DenseArray(uri, mode='r', ctx=ctx)
42684322

42694323
def __len__(self):
@@ -4687,7 +4741,7 @@ cdef class DenseArrayImpl(Array):
46874741
return array.astype(dtype)
46884742
return array
46894743

4690-
def write_direct(self, np.ndarray array not None):
4744+
def write_direct(self, np.ndarray array not None, **kw):
46914745
"""
46924746
Write directly to given array attribute with minimal checks,
46934747
assumes that the numpy array is the same shape as the array's domain
@@ -4698,6 +4752,10 @@ cdef class DenseArrayImpl(Array):
46984752
:raises: :py:exc:`tiledb.TileDBError`
46994753
47004754
"""
4755+
append_dim = kw.pop("append_dim", None)
4756+
mode = kw.pop("mode", "ingest")
4757+
start_idx = kw.pop("start_idx", None)
4758+
47014759
if not self.isopen or self.mode != 'w':
47024760
raise TileDBError("DenseArray is not opened for writing")
47034761
if self.schema.nattr != 1:
@@ -4715,6 +4773,7 @@ cdef class DenseArrayImpl(Array):
47154773

47164774
cdef void* buff_ptr = np.PyArray_DATA(array)
47174775
cdef uint64_t buff_size = array.nbytes
4776+
cdef np.ndarray subarray = np.zeros(2*array.ndim, np.uint64)
47184777

47194778
use_global_order = self.ctx.config().get("py.use_global_order_1d_write", False) == "true"
47204779

@@ -4733,13 +4792,69 @@ cdef class DenseArrayImpl(Array):
47334792
rc = tiledb_query_set_layout(ctx_ptr, query_ptr, layout)
47344793
if rc != TILEDB_OK:
47354794
_raise_ctx_err(ctx_ptr, rc)
4736-
rc = tiledb_query_set_buffer(ctx_ptr, query_ptr, attr_name_ptr, buff_ptr, &buff_size)
4795+
4796+
range_start_idx = start_idx or 0
4797+
for n in range(array.ndim):
4798+
subarray[n*2] = range_start_idx
4799+
subarray[n*2 + 1] = array.shape[n] + range_start_idx - 1
4800+
4801+
if mode == "append":
4802+
with Array.load_typed(self.uri) as A:
4803+
ned = A.nonempty_domain()
4804+
4805+
if array.ndim <= append_dim:
4806+
raise IndexError("`append_dim` out of range")
4807+
4808+
if array.ndim != len(ned):
4809+
raise ValueError(
4810+
"The number of dimension of the TileDB array and "
4811+
"Numpy array to append do not match"
4812+
)
4813+
4814+
for n in range(array.ndim):
4815+
if n == append_dim:
4816+
if start_idx is not None:
4817+
range_start_idx = start_idx
4818+
range_end_idx = array.shape[n] + start_idx -1
4819+
else:
4820+
range_start_idx = ned[n][1] + 1
4821+
range_end_idx = array.shape[n] + ned[n][1]
4822+
4823+
subarray[n*2] = range_start_idx
4824+
subarray[n*2 + 1] = range_end_idx
4825+
else:
4826+
if array.shape[n] != ned[n][1] - ned[n][0] + 1:
4827+
raise ValueError(
4828+
"The input Numpy array must be of the same "
4829+
"shape as the TileDB array, exluding the "
4830+
"`append_dim`, but the Numpy array at index "
4831+
f"{n} has {array.shape[n]} dimension(s) and "
4832+
f"the TileDB array has {ned[n][1]-ned[n][0]}."
4833+
)
4834+
4835+
rc = tiledb_query_set_subarray(
4836+
ctx_ptr,
4837+
query_ptr,
4838+
<void*>np.PyArray_DATA(subarray)
4839+
)
47374840
if rc != TILEDB_OK:
47384841
_raise_ctx_err(ctx_ptr, rc)
4842+
4843+
rc = tiledb_query_set_buffer(
4844+
ctx_ptr,
4845+
query_ptr,
4846+
attr_name_ptr,
4847+
buff_ptr,
4848+
&buff_size
4849+
)
4850+
if rc != TILEDB_OK:
4851+
_raise_ctx_err(ctx_ptr, rc)
4852+
47394853
with nogil:
47404854
rc = tiledb_query_submit(ctx_ptr, query_ptr)
47414855
if rc != TILEDB_OK:
47424856
_raise_ctx_err(ctx_ptr, rc)
4857+
47434858
with nogil:
47444859
rc = tiledb_query_finalize(ctx_ptr, query_ptr)
47454860
if rc != TILEDB_OK:

0 commit comments

Comments
 (0)