|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Zarr Python 2.3 release" |
| 4 | +date: 2019-05-23 |
| 5 | +categories: zarr python release |
| 6 | +--- |
| 7 | + |
| 8 | +Recently we released version 2.3 of the [Python Zarr |
| 9 | +package](https://zarr.readthedocs.io/en/stable/), which implements the |
| 10 | +Zarr protocol for storing N-dimensional typed arrays, and is designed |
| 11 | +for use in distributed and parallel computing. This post provides an |
| 12 | +overview of new features in this release, and some information about |
| 13 | +future directions for Zarr. |
| 14 | + |
| 15 | +## New storage options for distributed and cloud computing |
| 16 | + |
| 17 | +A key feature of the Zarr protocol is that the underlying storage |
| 18 | +system is decoupled from other components via a simple key/value |
| 19 | +interface. In Python, this interface corresponds to the |
| 20 | +[`MutableMapping` |
| 21 | +interface](https://docs.python.org/3/glossary.html#term-mapping), |
| 22 | +which is the interface that Python |
| 23 | +[`dict`](https://docs.python.org/3/library/stdtypes.html#dict) |
| 24 | +implements. I.e., anything `dict`-like can be used to store Zarr |
| 25 | +data. The simplicity of this interface means it is relatively |
| 26 | +straightforward to add support for a range of different storage |
| 27 | +systems. The 2.3 release adds support for storage using [SQLite]( |
| 28 | +https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore |
| 29 | +), [Redis]( |
| 30 | +https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.RedisStore |
| 31 | +), [MongoDB]( |
| 32 | +https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.MongoDBStore |
| 33 | +) and [Azure Blob Storage]( |
| 34 | +https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ABSStore |
| 35 | +). |
| 36 | + |
| 37 | +For example, here's code that creates an array using MongoDB: |
| 38 | + |
| 39 | +{% highlight python %} |
| 40 | +import zarr |
| 41 | +store = zarr.MongoDBStore('localhost') |
| 42 | +root = zarr.group(store=store, overwrite=True) |
| 43 | +foo = bar.create_group('foo') |
| 44 | +bar = foo.create_dataset('bar', shape=(10000, 1000), chunks=(1000, 100)) |
| 45 | +bar[:] = 42 |
| 46 | +store.close() |
| 47 | +{% endhighlight %} |
| 48 | + |
| 49 | +To do the same thing but storing the data in the cloud via Azure |
| 50 | +Blob Storage, replace the instantiation of the `store` object with: |
| 51 | + |
| 52 | +{% highlight python %} |
| 53 | +store = zarr.ABSStore(container='test', account_name='foo', account_key='bar') |
| 54 | +{% endhighlight %} |
| 55 | + |
| 56 | +Support for other cloud object storage storage services was already |
| 57 | +available via other packages, with Amazon S3 supported via the [s3fs]( |
| 58 | +http://s3fs.readthedocs.io/en/latest/ ) package, and Google Cloud |
| 59 | +Storage supported via the [gcsfs]( |
| 60 | +https://gcsfs.readthedocs.io/en/latest/ ) package. Further notes on |
| 61 | +using cloud storage are available from the [Zarr |
| 62 | +tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage). |
| 63 | + |
| 64 | +The attraction of cloud storage is that total I/O bandwidth scales |
| 65 | +linearly with the size of a computing cluster, so there are no |
| 66 | +technical limits to the size of the data or computation you can scale |
| 67 | +up to. Here's a slide from a recent presentation by Ryan Abernathey |
| 68 | +showing how I/O scales when using Zarr over Google Cloud Storage: |
| 69 | + |
| 70 | +<script async class="speakerdeck-embed" data-slide="22" data-id="1621118c5987411fb55fdcf503cb331d" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script> |
| 71 | + |
| 72 | +## Optimisations for cloud storage: consolidated metadata |
| 73 | + |
| 74 | +One issue with using cloud object storage is that, although total I/O |
| 75 | +throughput can be high, the latency involved in each request to read |
| 76 | +the contents of an object can be >100 ms, even when reading from |
| 77 | +compute nodes within the same data centre. This latency can add up |
| 78 | +when reading metadata from many arrays, because in Zarr each array has |
| 79 | +its own metadata stored in a separate object. |
| 80 | + |
| 81 | +To work around this, the 2.3 release adds an experimental feature to |
| 82 | +consolidate metadata for all arrays and groups within a hierarchy into |
| 83 | +a single object, which can be read once via a single request. Although |
| 84 | +this is not suitable for rapidly changing datasets, it can be good for |
| 85 | +large datasets which are relatively static. |
| 86 | + |
| 87 | +To use this feature, two new convenience functions have been |
| 88 | +added. The |
| 89 | +[`consolidate_metadata()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata) |
| 90 | +function performs the initial consolidation, reading all metadata and |
| 91 | +combining them into a single object. Once you have done that and |
| 92 | +deployed the data to a cloud object store, the |
| 93 | +[`open_consolidated()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.open_consolidated) |
| 94 | +function can be used to read data, making use of the consolidated |
| 95 | +metadata. |
| 96 | + |
| 97 | +Support for the new consolidated metadata feature is also now |
| 98 | +available via |
| 99 | +[xarray](http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html) |
| 100 | +and |
| 101 | +[intake-xarray](https://intake-xarray.readthedocs.io/en/latest/index.html) |
| 102 | +(see [this blog |
| 103 | +post](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/) |
| 104 | +for an introduction to intake), and many of the datasets in [Pangeo's |
| 105 | +cloud data catalog](https://pangeo-data.github.io/pangeo-datastore/) |
| 106 | +use Zarr with consolidated metadata. |
| 107 | + |
| 108 | +Here's an example of how to open a Zarr dataset from Pangeo's data |
| 109 | +catalog via intake: |
| 110 | + |
| 111 | +{% highlight python %} |
| 112 | +import intake |
| 113 | +cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml' |
| 114 | +cat = intake.Catalog(cat_url) |
| 115 | +ds = cat.atmosphere.gmet_v1.to_dask() |
| 116 | +{% endhighlight %} |
| 117 | + |
| 118 | +...and [here's the underlying catalog |
| 119 | +entry](https://github.com/pangeo-data/pangeo-datastore/blob/aa3f12bcc3be9584c1a9071235874c9d6af94a4e/intake-catalogs/atmosphere.yaml#L6). |
| 120 | + |
| 121 | + |
| 122 | +## Compatibility with N5 |
| 123 | + |
| 124 | +Around the same time that development on Zarr was getting started, a |
| 125 | +separate team led by [Stephan |
| 126 | +Saafeld](https://www.janelia.org/lab/saalfeld-lab) at the Janelia |
| 127 | +research campus was experiencing similar challenges storing and |
| 128 | +computing with large amounts of neural imaging data, and developed a |
| 129 | +software library called [N5](https://github.com/saalfeldlab/n5). N5 is |
| 130 | +implemented in Java but is very similar to Zarr in the approach it |
| 131 | +takes to storing both metadata and data chunks, and to decoupling the |
| 132 | +storage backend to enable efficient use of cloud storage. |
| 133 | + |
| 134 | +There is a lot of commonality between Zarr and N5 and we are working |
| 135 | +jointly to bring the two approaches together. As a first experimental |
| 136 | +step towards that goal, the Zarr 2.3 release includes an [N5 storage |
| 137 | +adapter](https://zarr.readthedocs.io/en/stable/api/n5.html#zarr.n5.N5Store) |
| 138 | +which allows reading and writing of data on disk in the N5 |
| 139 | +format. |
| 140 | + |
| 141 | + |
| 142 | +## Support for the buffer protocol |
| 143 | + |
| 144 | +Zarr is intended to work efficiently across a range of different |
| 145 | +storage systems with different latencies and bandwidth, from cloud |
| 146 | +object stores to local disk and memory. In many of these settings, |
| 147 | +making efficient use of local memory, and avoiding memory copies |
| 148 | +wherever possible, can make a substantial difference to |
| 149 | +performance. This is particularly true within the |
| 150 | +[Numcodecs](http://numcodecs.rtfd.io) package, which is a companion to |
| 151 | +Zarr and provides implementations of compression and filter codecs |
| 152 | +such as Blosc and Zstandard. A key aspect of achieving fewer memory |
| 153 | +copies has been to leverage the Python buffer protocol. |
| 154 | + |
| 155 | +The [Python buffer |
| 156 | +protocol](https://docs.python.org/3/c-api/buffer.html) is a |
| 157 | +specification for how to share large blocks of memory between |
| 158 | +different libraries without copying. This protocol has evolved over |
| 159 | +time from its original introduction in Python 2 and later revamped |
| 160 | +implementation added in Python 3 (with backports to Python 2.6 and |
| 161 | +2.7). Due to the changes in its behavior from Python 2 to Python 3 and |
| 162 | +what objects supported which implementation of the buffer protocol, it |
| 163 | +was a bit challenging to leverage effectively in Zarr. |
| 164 | + |
| 165 | +Thanks to some under-the-hood changes in Zarr 2.3 and Numcodecs 0.6, |
| 166 | +the buffer protocol is now cleanly supported for Python 2/3 in both |
| 167 | +libraries when working with data. In addition to improved memory |
| 168 | +handling and performance, this should make it easier for users |
| 169 | +developing their own stores, compressors, and filters to use with |
| 170 | +Zarr. Also it has cutdown on the amount of code specialized for |
| 171 | +handling different Python versions. |
| 172 | + |
| 173 | + |
| 174 | +## Future developments |
| 175 | + |
| 176 | +There is a growing community of interest around new approaches to |
| 177 | +storage of array-like data, particularly in the cloud. For example, |
| 178 | +Theo McCaie from the UK Met Office Informatics Lab recently wrote a |
| 179 | +series of blog posts about the challenges involved in [storing 200TB |
| 180 | +of "high momentum" weather model data every |
| 181 | +day](https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671). This |
| 182 | +is an exciting space to be working in and we'd like to do what we can |
| 183 | +to build connections and share knowledge and ideas between |
| 184 | +communities. We've started a [regular |
| 185 | +teleconference](https://github.com/zarr-developers/zarr/issues/315) |
| 186 | +which is open to anyone to join, and there is a new [gitter |
| 187 | +channel](https://gitter.im/zarr-developers/community) for general |
| 188 | +discussion. |
| 189 | + |
| 190 | +The main focus of our conversations so far has been setting up work |
| 191 | +towards development of a new set of specifications that support the |
| 192 | +features of both Zarr and N5, and provide a platform for exploration |
| 193 | +and development of new features, while also identifying a minimal core |
| 194 | +protocol that can be implemented in a range of different programming |
| 195 | +languages. It is still relatively early days and there are lots of |
| 196 | +open questions to work through, both on the technical side and in |
| 197 | +terms of how we organise and coordinate efforts. However, the |
| 198 | +community is very friendly and supportive, and anyone is welcome to |
| 199 | +participate, so if you have an interest please do consider getting |
| 200 | +involved. |
| 201 | + |
| 202 | +If you would like to stay in touch with or contribute to new |
| 203 | +developments, keep an eye on the |
| 204 | +[zarr](https://github.com/zarr-developers/zarr) and |
| 205 | +[zarr-specs](https://github.com/zarr-developers/zarr-specs) GitHub |
| 206 | +repositories, and please feel free to raise issues or add comments if |
| 207 | +you have any questions or ideas. |
| 208 | + |
| 209 | + |
| 210 | +## And finally... SciPy! |
| 211 | + |
| 212 | +If you're coming to SciPy this year, we're very pleased to be giving a |
| 213 | +talk on Zarr on [day 1 of the conference (Wednesday 10 |
| 214 | +July)](https://www.eiseverywhere.com/ehome/381993). Several members of |
| 215 | +the Zarr community will be at the conference, and there are sprints |
| 216 | +going on after the conference in a number of related areas, including |
| 217 | +an Xarray sprint on the Saturday. Please do say hi or [drop us a |
| 218 | +comment on this |
| 219 | +issue](https://github.com/zarr-developers/zarr/issues/396) if you'd |
| 220 | +like to connect and discuss anything. |
0 commit comments