Skip to content

Conversation

@kylemann16
Copy link
Collaborator

This PR represents all of the changes to SilviMetric to make it easier to work with for large datasets.

The biggest of these is moving from sparse to dense arrays in tiledb and the necessary changes to make this possible.

  • Deletions are now overwrites
  • SM no longer writes to a specific timestamp for a write, this turns out to be a TileDB anti-pattern. We now write to the current timestamp and write a start and end timestamp attribute for collection dates of data. These attributes can be queried with normal tiledb operations.
  • Storage config now requires a xsize and ysize variable to indicate how big the extents of tiledb tiles should be. This was in response to memory problems from tiledb when it was unspecified.
  • Lots of changes and iterations on consolidation efforts to make writing and reading more efficient

filtered['metrics'] = ms

app.log.info(json.dumps(filtered, indent=2))
print(json.dumps(filtered, indent=2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not capturing to a redirectable log?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging function makes using things like jq impossible since it prepends some boilerplate to stdout

'--xsize', type=float, default=1000, help='TileDB X Tile size.'
)
@click.option(
'--ysize', type=float, default=1000, help='TileDB Y Tile size.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update help text to tell us size in what units

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a question I meant to ask you about, since these are not referencing units, but the number of cells in X/Y direction to make up a TileDB Tile Extent. I know x/ysize is a bit overloaded, but didn't know what else to call it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really a pixel size, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a TileDB Space Tile for a dense array. Essentially tells TileDB what the size of a working unit is, and tells SilviMetric how to split stuff up for more efficient TileDB IO

with performance_report(report_path):
shatter.shatter(config)
print(f'Writing report to {report_path}.')
app.log.debug(f'Writing report to {report_path}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time we were changing app.log.debug -> print...

# TODO add import similar to metrics
def convert(self, value, param, ctx) -> list[Attribute]:
if isinstance(value, list):
attrs: set[Attribute] = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add more comments tell the reader what we're doing here. Looks like we're making a set of attribute instances but defaulting to some common ones?

elif val == 'grid_metrics':
metrics.update(list(grid_metrics.get_grid_metrics().values()))
elif 'grid_metric' in val:
args = val.split('_')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we parsing parameters out of grid metric names?

root=root_bounds,
)
cell_size = 0
for a in config.attrs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loopity loop loop loop ...

Can't this be set as we're creating stuff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, could add this as an attribute on the storage object

# TODO should output in sections so we don't run into memory problems
dtype = final[ma].dtype
dtype = schema.attr(ma).dtype
nan_val = -9999 if dtype.kind in ['i', 'f'] else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan defaults to -9999 for PDAL, but the users could and quite often will set this to something else. You should parameterize this in a config somewhere.

next_split_x = (maxx - minx) / 2
next_split_y = (maxy - miny) / 2

def end_early():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe this method and why it is useful/needed.

config.finished = True
# TODO make this a batched operation of like 1000 tasks at a time
# similar to how scan does it
leaf_batch = list(itertools.batched(leaves, 2000))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2000 seems like a magical number you might want to change if you were running down performance issues, yeah?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah most definitely. I can add it to the app config.

@kylemann16 kylemann16 merged commit 004b506 into main Dec 16, 2025
15 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants