-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Description
After receiving some advice on using TileDB, we have decided to switch to Dense Arrays for our TileDB usage. This is being worked on in #114
Key points
TileDBfrom_pandascannot handle flexible ingestion, and requires an entire row for each insertion to the database. This would probably not be feasible for any larger projects, and would require a significant rework of our chunking/tiling.- I made an issue in TileDB-py about this, and implemented a solution but have received no response
SilviMetriccurrently adaptsextractinformation based on all of the shatters that have happened in the past, combining any overlapped data and rerunning those cellsTleDB Dense Arrayswill only show you the most recent information that was input to a cellSparse arraysallowed for duplication, so all cell values are essentially just an array of however many shatter processes touched that cell, and we can rerun easily with the combined point data if necessaryDense arraysdo not allow for duplication, so the only way to get values from separate shatter processes is by iterating through the shatter entries (time travel), combining all of those separate entries into one dataframe , and then doing the same logic as in the sparse arrays
Conclusions:
- Because
from_pandasis not flexible when dealing with dense arrays and we've received no response from the TileDB team, we'll be switching back to the previous usage, which unfortunately means having to work around TileDB's array requirements, documented here
Question still to be answered:
Should the value of a cell when we do extract be...
1. The value that was most recently shattered
2. The value that is the combination of all processes that touched this cell?
Thoughts @bmcgaughey1 and @hobu ?
Metadata
Metadata
Assignees
Labels
No labels