Merge pull request #1 from HEFTIEProject/rmg/benchmarks

dstansby · web-flow · commit 8c71dc715858 · 2025-10-14T16:28:27.000+01:00
Add summaries for benchmarking and tools for working with chunked datasets
diff --git a/README.md b/README.md
@@ -13,8 +13,48 @@ This textbook gives scientists:
 - a guide to designing parallel processing algorithms to work efficiently with chunked datasets
 - a guide to exporting chunked datasets to other 'tradditional' datasets
 
+## Benchmarking for Zarr
+
+We created a [set of benchmarks](https://github.com/HEFTIEProject/zarr-benchmarks) for reading / writing data to Zarr with a range of different configurations. These benchmarks provide guidance on how selection of different configurations affect data size and read/write performance. 
+The different parameters were:
+
+- Type of image
+  - Heart: HiP-CT scan of a heart from the Human Organ Atlas
+  - Dense: segmented neurons from electron microscopy
+  - Sparse: A few select segmented neurons from electron microscopy
+- Software libraries
+  - Tensorstore (fastest for both reading and writing data)
+  - zarr-python version 3
+  - zarr-python version 2 (slowest for both reading and writing data)
+- Compressor
+  - blosc-zstd provides the best compression ratio, for image and segmentation data. (options were blosc-blosclz, blosc-lz4, blosc-lz4hc, blosc-zlib, blosc-zstd as well as gzip and zstd)
+- Compression level
+  - Setting compression levels beyond ~3 results in slightly better data compression but much longer write times. Compression level does not affect read time.
+- Shuffle
+  - Setting the shuffle option increases data compression with no adverse effect on read/write times (shuffle, bitshuffle and noshuffle were the 3 options)
+- Zarr format version
+  - There was no noticeable difference between Zarr format 2 and Zarr format 3 data
+- Chunk size
+  - Setting a low chunk size (below around 90) has an adverse effect on read and write times.
+
 ## Tools for working with chunked datasets
 
+Contributions have been made to the zarr-python repository:
+
+- [Add CLI for converting v2 metadata to v3](https://github.com/zarr-developers/zarr-python/pull/3257)
+- [Added ArrayNotFoundError](https://github.com/zarr-developers/zarr-python/pull/3367)
+- [Better document acceptable values for StoreLike](https://github.com/zarr-developers/zarr-python/pull/3480)
+
+PRs have been opened in the zarr-python repository:
+
+- [Prevent creation of arrays/groups under a parent array](https://github.com/zarr-developers/zarr-python/pull/3407)
+- [Holding space - LRUStoreCache]
+
+PRs have also been opened for:
+
+- [Document supported file formats for dask_image.imread](https://github.com/dask/dask-image/issues/407)
+- [Document supported file formats for skimage.io](https://github.com/scikit-image/scikit-image/issues/7879)
+
 ## Improvements to cloud visualisation
 
 ## Acknowledgements