Skip to content

2.0.0: Dense arrays, large dataset optimization, attribute/metric filters

Choose a tag to compare

@kylemann16 kylemann16 released this 05 Feb 22:18
· 1 commit to main since this release
48e7075

Functionality Changes

  • SilviMetric is now writing to a Dense TileDB array. Dense arrays allow us to take more advantage of the benefits that TileDB offers without many drawbacks.
  • Attribute and Metric Filters. We're now writing Attribute and Metric data with TileDB's ZstdFilter, with the level set to 7. Variable length arrays will now take advantage of the PositiveOffsetFilter. These changes will provide size reduction for output data.
  • Storage config now requires a xsize and ysize variable to indicate how big the extents of tiledb tiles should be. This was in response to memory problems from tiledb when it was unspecified.
  • Updated info call:
    • added a metrics option to the cli
    • fixed history output
    • removed necessary info from the concise output.
  • Updated extract call:
    • handle_overlaps speedup
    • removed extent indexing, was too slow and could get them in other ways
  • adding start_datetime and end_datetime to tiledb attributes being written, in similar fashion to count
    • This will allow users to query by start and end time

Behind the Scenes

  • SM no longer writes to a specific timestamp for a write, this turns out to be a TileDB anti-pattern. We now write to the current timestamp and write a start and end timestamp attribute for collection dates of data. These attributes can be queried with normal tiledb operations.
  • Deletions will now be overwrites. TileDB dense arrays don't support deletion operations, so we'll instead be writing new data at the current timestamp over the old data.
  • In order to operate better on larger datasets, SilviMetric will now operate in chunks the size of the TileDB x and y sizes (see note in Functionality Changes about StorageConfig changes). This means there is very little need to consolidate commits to the array, and should increase speed and memory performance.
  • updated metrics:
    • added nan_value member variable
    • added nan_policy member variable
    • added logic to handle bad return values and bad dependency values depending on nan_policy
  • Updated storage config to adjust a relative path to absolute for tdb dir
  • Adjusted aad metrics to use variables that were already created
  • Added nan handling to several metrics in which it was possible

What's Changed

New Contributors

Full Changelog: 1.4.0...2.0.0