-
Notifications
You must be signed in to change notification settings - Fork 0
Add summaries for benchmarking and tools for working with chunked datasets #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0dd8b1a
9edbb8b
c7393b6
6480a65
1a0983b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,8 +13,48 @@ This textbook gives scientists: | |
| - a guide to designing parallel processing algorithms to work efficiently with chunked datasets | ||
| - a guide to exporting chunked datasets to other 'tradditional' datasets | ||
|
|
||
| ## Benchmarking for Zarr | ||
|
|
||
| We created a [set of benchmarks](https://github.com/HEFTIEProject/zarr-benchmarks) for reading / writing data to Zarr with a range of different configurations. These benchmarks provide guidance on how selection of different configurations affect data size and read/write performance. | ||
| The different parameters were: | ||
|
|
||
| - Type of image | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here you're mixing the different parameters (e.g. type of image = heart / dense /sparse) with benchmarking results (e.g. compressor = blosc-zstd provides the best compression ratio). Maybe just list the different parameters options here, and link to the final report for results? Not sure what @dstansby 's preference is. |
||
| - Heart: HiP-CT scan of a heart from the Human Organ Atlas | ||
| - Dense: segmented neurons from electron microscopy | ||
| - Sparse: A few select segmented neurons from electron microscopy | ||
| - Software libraries | ||
| - Tensorstore (fastest for both reading and writing data) | ||
| - zarr-python version 3 | ||
| - zarr-python version 2 (slowest for both reading and writing data) | ||
| - Compressor | ||
| - blosc-zstd provides the best compression ratio, for image and segmentation data. (options were blosc-blosclz, blosc-lz4, blosc-lz4hc, blosc-zlib, blosc-zstd as well as gzip and zstd) | ||
| - Compression level | ||
| - Setting compression levels beyond ~3 results in slightly better data compression but much longer write times. Compression level does not affect read time. | ||
| - Shuffle | ||
| - Setting the shuffle option increases data compression with no adverse effect on read/write times (shuffle, bitshuffle and noshuffle were the 3 options) | ||
| - Zarr format version | ||
| - There was no noticeable difference between Zarr format 2 and Zarr format 3 data | ||
| - Chunk size | ||
| - Setting a low chunk size (below around 90) has an adverse effect on read and write times. | ||
|
|
||
| ## Tools for working with chunked datasets | ||
|
|
||
| Contributions have been made to the zarr-python repository: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need a brief description of what each of these contributions does + why it was useful? Again, not sure how much detail we need to go into here |
||
|
|
||
| - [Add CLI for converting v2 metadata to v3](https://github.com/zarr-developers/zarr-python/pull/3257) | ||
| - [Added ArrayNotFoundError](https://github.com/zarr-developers/zarr-python/pull/3367) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if we want to include them, but I did do a couple of other minor PRs: |
||
| - [Better document acceptable values for StoreLike](https://github.com/zarr-developers/zarr-python/pull/3480) | ||
|
|
||
| PRs have been opened in the zarr-python repository: | ||
|
|
||
| - [Prevent creation of arrays/groups under a parent array](https://github.com/zarr-developers/zarr-python/pull/3407) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR (prevent creation of arrays/groups../) has been merged now - so can move up into 'contributions made' |
||
| - [Holding space - LRUStoreCache] | ||
|
|
||
| PRs have also been opened for: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only issues have been opened for |
||
|
|
||
| - [Document supported file formats for dask_image.imread](https://github.com/dask/dask-image/issues/407) | ||
| - [Document supported file formats for skimage.io](https://github.com/scikit-image/scikit-image/issues/7879) | ||
|
|
||
| ## Improvements to cloud visualisation | ||
|
|
||
| ## Acknowledgements | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the final report?
e.g. 'The final benchmarking results are summarised into a short report to help researchers choose the best options for their image datasets'