Adjustments for large datasets #124

kylemann16 · 2025-11-19T15:12:01Z

This PR represents all of the changes to SilviMetric to make it easier to work with for large datasets.

The biggest of these is moving from sparse to dense arrays in tiledb and the necessary changes to make this possible.

Deletions are now overwrites
SM no longer writes to a specific timestamp for a write, this turns out to be a TileDB anti-pattern. We now write to the current timestamp and write a start and end timestamp attribute for collection dates of data. These attributes can be queried with normal tiledb operations.
Storage config now requires a xsize and ysize variable to indicate how big the extents of tiledb tiles should be. This was in response to memory problems from tiledb when it was unspecified.
Lots of changes and iterations on consolidation efforts to make writing and reading more efficient

…rics generation better

…for potential speedup

… methods

made info and get_history have fewer required parameters

…on return

src/silvimetric/cli/cli.py

hobu · 2025-11-19T15:20:03Z

src/silvimetric/cli/cli.py

            filtered['metrics'] = ms

-        app.log.info(json.dumps(filtered, indent=2))
+        print(json.dumps(filtered, indent=2))


are we not capturing to a redirectable log?

The logging function makes using things like jq impossible since it prepends some boilerplate to stdout

hobu · 2025-11-19T15:20:40Z

src/silvimetric/cli/cli.py

+    '--xsize', type=float, default=1000, help='TileDB X Tile size.'
+)
+@click.option(
+    '--ysize', type=float, default=1000, help='TileDB Y Tile size.'


update help text to tell us size in what units

This was actually a question I meant to ask you about, since these are not referencing units, but the number of cells in X/Y direction to make up a TileDB Tile Extent. I know x/ysize is a bit overloaded, but didn't know what else to call it.

it's really a pixel size, right?

No, it's a TileDB Space Tile for a dense array. Essentially tells TileDB what the size of a working unit is, and tells SilviMetric how to split stuff up for more efficient TileDB IO

hobu · 2025-11-19T15:21:17Z

src/silvimetric/cli/cli.py

            with performance_report(report_path):
                shatter.shatter(config)
-            print(f'Writing report to {report_path}.')
+            app.log.debug(f'Writing report to {report_path}.')


last time we were changing app.log.debug -> print...

hobu · 2025-11-19T15:22:49Z

src/silvimetric/cli/common.py

    # TODO add import similar to metrics
    def convert(self, value, param, ctx) -> list[Attribute]:
-        if isinstance(value, list):
+        attrs: set[Attribute] = set()


TODO: Add more comments tell the reader what we're doing here. Looks like we're making a set of attribute instances but defaulting to some common ones?

hobu · 2025-11-19T15:24:43Z

src/silvimetric/cli/common.py

-                    elif val == 'grid_metrics':
-                        metrics.update(list(grid_metrics.get_grid_metrics().values()))
+                    elif 'grid_metric' in val:
+                        args = val.split('_')


Are we parsing parameters out of grid metric names?

hobu · 2025-11-19T15:26:54Z

src/silvimetric/commands/extract.py

        root=root_bounds,
    )
+    cell_size = 0
+    for a in config.attrs:


loopity loop loop loop ...

Can't this be set as we're creating stuff?

Yeah, could add this as an attribute on the storage object

hobu · 2025-11-19T15:27:39Z

src/silvimetric/commands/extract.py

-        # TODO should output in sections so we don't run into memory problems
-        dtype = final[ma].dtype
+        dtype = schema.attr(ma).dtype
+        nan_val = -9999 if dtype.kind in ['i', 'f'] else 0


nan defaults to -9999 for PDAL, but the users could and quite often will set this to something else. You should parameterize this in a config somewhere.

hobu · 2025-11-19T15:31:23Z

src/silvimetric/commands/scan.py

        next_split_x = (maxx - minx) / 2
        next_split_y = (maxy - miny) / 2

+        def end_early():


describe this method and why it is useful/needed.

hobu · 2025-11-19T15:32:41Z

src/silvimetric/commands/shatter.py

-        config.finished = True
+        # TODO make this a batched operation of like 1000 tasks at a time
+        # similar to how scan does it
+        leaf_batch = list(itertools.batched(leaves, 2000))


2000 seems like a magical number you might want to change if you were running down performance issues, yeah?

Yeah most definitely. I can add it to the app config.

kylemann16 added 30 commits June 18, 2025 14:31

WIP changing time slot/timestamp interactions with database

ad088c9

commiting before attempting to switch up config writing

c876633

fixing tests

555223d

removing commented out code

80c703f

wip

4bbbc54

wip dense arrays with new tiledb setup

360980d

changes to cli to better accept list of attributes and cover grid_met…

0d7c0b9

…rics generation better

wip

41f625a

wip, pushing so we can access from pip install

d99e88d

wip, adding other consolidation modes

71001a2

adding vacuuming to its own function in storage

5259955

moving around consolidation

3e97ecc

lower consolidation increment

38be163

removing an unnecessary mutex

250bf6a

wip

9dd744e

changing to hilbert ordering and adding sort=False for groupby usage …

1914731

…for potential speedup

pushing consolidation to client queue

374a0c6

changing consolidation to fire_and_forget

cc33818

moving around usage of dask, attempting to speed up some long running…

546587d

… methods

simplifying process, making small fixes

26c80c2

remove numba import

df21fb1

tweaks

0dc171b

intermediate commit

d885392

trying new chunking methods

a28a5d7

fixing empty list bug

3c29aed

fixing empty write bug

9e83cd3

fixing chunking

b9f8b46

lowering consolidation to 300MB

0a6d382

made timestamp usage more consistent, fixed grid metrics to use copies,

fea27bb

made info and get_history have fewer required parameters

consolidate in a dask task

4e2b6a9

kylemann16 added 22 commits October 1, 2025 11:44

moving extraction to use latest input for arrays with duplicate values

6412b60

Merge commit 'dc5730642b11c33662bb9f87bd6f7b9c17faa338' into test_merge

e28fd35

dense storage largely working. removing delete still in progress

bfeb8bc

Merge branch 'test_merge' into big_test_dense

44f546c

update gitignore to include vscode workspace files

532f7f3

delete feature working as well as it can

adf0951

committing before dim changeup

ea58feb

dense working with configs, shatter, commands, extract, info

74ad9bd

cli and tests happy with dense changes and move to date attributes

f89c509

getting rid of allowed dims for now

3c1fd35

removing vacuum after every extent, fixing index error from pdal pyth…

211ac87

…on return

wip

9fd255d

wip

d8a91e0

updating environment to latest tiledb for mem fix

6d46ae3

fixing env usage

6dcce26

env update

43328ab

debugged tiledb and bad access problems

e118bba

wip

7e00f84

demo changes for tiling according to tiledb accurately

1282ec6

move deletion to overwriting mode

287dacc

tests working again

79dcc10

linting and test updates

54ce4a5

hobu reviewed Nov 19, 2025

View reviewed changes

kylemann16 added 5 commits November 20, 2025 15:12

fixing missing file path and linting

ed7a9d6

fix method usage in test_remote_creation

c7e05d9

removing itertools.batched for now so that sm works with python3.11

01d415a

using np.isclose to better compare floats

3c445c0

pin jupyter-book to <2.0.0

5959501

kylemann16 merged commit 004b506 into main Dec 16, 2025
15 of 21 checks passed

kylemann16 mentioned this pull request Dec 16, 2025

SilviMetric should switch to Dense TileDB arrays #116

Closed

Adjustments for large datasets #124

Adjustments for large datasets #124

Uh oh!

Conversation

kylemann16 commented Nov 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants