-
Notifications
You must be signed in to change notification settings - Fork 4
Adjustments for large datasets #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rics generation better
…for potential speedup
made info and get_history have fewer required parameters
| filtered['metrics'] = ms | ||
|
|
||
| app.log.info(json.dumps(filtered, indent=2)) | ||
| print(json.dumps(filtered, indent=2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we not capturing to a redirectable log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging function makes using things like jq impossible since it prepends some boilerplate to stdout
| '--xsize', type=float, default=1000, help='TileDB X Tile size.' | ||
| ) | ||
| @click.option( | ||
| '--ysize', type=float, default=1000, help='TileDB Y Tile size.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update help text to tell us size in what units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a question I meant to ask you about, since these are not referencing units, but the number of cells in X/Y direction to make up a TileDB Tile Extent. I know x/ysize is a bit overloaded, but didn't know what else to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's really a pixel size, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a TileDB Space Tile for a dense array. Essentially tells TileDB what the size of a working unit is, and tells SilviMetric how to split stuff up for more efficient TileDB IO
| with performance_report(report_path): | ||
| shatter.shatter(config) | ||
| print(f'Writing report to {report_path}.') | ||
| app.log.debug(f'Writing report to {report_path}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last time we were changing app.log.debug -> print...
| # TODO add import similar to metrics | ||
| def convert(self, value, param, ctx) -> list[Attribute]: | ||
| if isinstance(value, list): | ||
| attrs: set[Attribute] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add more comments tell the reader what we're doing here. Looks like we're making a set of attribute instances but defaulting to some common ones?
| elif val == 'grid_metrics': | ||
| metrics.update(list(grid_metrics.get_grid_metrics().values())) | ||
| elif 'grid_metric' in val: | ||
| args = val.split('_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we parsing parameters out of grid metric names?
| root=root_bounds, | ||
| ) | ||
| cell_size = 0 | ||
| for a in config.attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loopity loop loop loop ...
Can't this be set as we're creating stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, could add this as an attribute on the storage object
| # TODO should output in sections so we don't run into memory problems | ||
| dtype = final[ma].dtype | ||
| dtype = schema.attr(ma).dtype | ||
| nan_val = -9999 if dtype.kind in ['i', 'f'] else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan defaults to -9999 for PDAL, but the users could and quite often will set this to something else. You should parameterize this in a config somewhere.
| next_split_x = (maxx - minx) / 2 | ||
| next_split_y = (maxy - miny) / 2 | ||
|
|
||
| def end_early(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describe this method and why it is useful/needed.
src/silvimetric/commands/shatter.py
Outdated
| config.finished = True | ||
| # TODO make this a batched operation of like 1000 tasks at a time | ||
| # similar to how scan does it | ||
| leaf_batch = list(itertools.batched(leaves, 2000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2000 seems like a magical number you might want to change if you were running down performance issues, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah most definitely. I can add it to the app config.
This PR represents all of the changes to SilviMetric to make it easier to work with for large datasets.
The biggest of these is moving from sparse to dense arrays in tiledb and the necessary changes to make this possible.
xsizeandysizevariable to indicate how big the extents of tiledb tiles should be. This was in response to memory problems from tiledb when it was unspecified.