Merge pull request #2 from zarr-developers/zarr-2.3-post

alimanfoo · web-flow · commit 52df8c96fea2 · 2019-06-13T09:49:08.000+01:00
Zarr Python 2.3 release post
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -9,7 +9,7 @@ GEM
       eventmachine (>= 0.12.9)
       http_parser.rb (~> 0.6.0)
     eventmachine (1.2.7)
-    ffi (1.10.0)
+    ffi (1.11.1)
     forwardable-extended (2.6.0)
     http_parser.rb (0.6.0)
     i18n (0.9.5)
@@ -31,8 +31,8 @@ GEM
       jekyll (>= 3.7, < 5.0)
     jekyll-sass-converter (1.5.2)
       sass (~> 3.4)
-    jekyll-seo-tag (2.6.0)
-      jekyll (~> 3.3)
+    jekyll-seo-tag (2.6.1)
+      jekyll (>= 3.3, < 5.0)
     jekyll-watch (2.2.1)
       listen (~> 3.0)
     kramdown (1.17.0)
@@ -55,7 +55,7 @@ GEM
     rouge (3.3.0)
     ruby_dep (1.5.0)
     safe_yaml (1.0.5)
-    sass (3.7.3)
+    sass (3.7.4)
       sass-listen (~> 4.0.0)
     sass-listen (4.0.0)
       rb-fsevent (~> 0.9, >= 0.9.4)
diff --git a/_posts/2019-03-29-welcome-to-jekyll.markdown b/_posts/2019-03-29-welcome-to-jekyll.markdown
diff --git a/_posts/2019-05-02-zarr-2.3-release.md b/_posts/2019-05-02-zarr-2.3-release.md
@@ -0,0 +1,220 @@
+---
+layout: post
+title:  "Zarr Python 2.3 release"
+date:   2019-05-23
+categories: zarr python release
+---
+
+Recently we released version 2.3 of the [Python Zarr
+package](https://zarr.readthedocs.io/en/stable/), which implements the
+Zarr protocol for storing N-dimensional typed arrays, and is designed
+for use in distributed and parallel computing. This post provides an
+overview of new features in this release, and some information about
+future directions for Zarr.
+
+## New storage options for distributed and cloud computing
+
+A key feature of the Zarr protocol is that the underlying storage
+system is decoupled from other components via a simple key/value
+interface. In Python, this interface corresponds to the
+[`MutableMapping`
+interface](https://docs.python.org/3/glossary.html#term-mapping),
+which is the interface that Python
+[`dict`](https://docs.python.org/3/library/stdtypes.html#dict)
+implements. I.e., anything `dict`-like can be used to store Zarr
+data. The simplicity of this interface means it is relatively
+straightforward to add support for a range of different storage
+systems. The 2.3 release adds support for storage using [SQLite](
+https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore
+), [Redis](
+https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.RedisStore
+), [MongoDB](
+https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.MongoDBStore
+) and [Azure Blob Storage](
+https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ABSStore
+).
+
+For example, here's code that creates an array using MongoDB:
+
+{% highlight python %}
+import zarr
+store = zarr.MongoDBStore('localhost')
+root = zarr.group(store=store, overwrite=True)
+foo = bar.create_group('foo')
+bar = foo.create_dataset('bar', shape=(10000, 1000), chunks=(1000, 100))
+bar[:] = 42
+store.close()
+{% endhighlight %}
+
+To do the same thing but storing the data in the cloud via Azure
+Blob Storage, replace the instantiation of the `store` object with:
+
+{% highlight python %}
+store = zarr.ABSStore(container='test', account_name='foo', account_key='bar')
+{% endhighlight %}
+
+Support for other cloud object storage storage services was already
+available via other packages, with Amazon S3 supported via the [s3fs](
+http://s3fs.readthedocs.io/en/latest/ ) package, and Google Cloud
+Storage supported via the [gcsfs](
+https://gcsfs.readthedocs.io/en/latest/ ) package. Further notes on
+using cloud storage are available from the [Zarr
+tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage).
+
+The attraction of cloud storage is that total I/O bandwidth scales
+linearly with the size of a computing cluster, so there are no
+technical limits to the size of the data or computation you can scale
+up to. Here's a slide from a recent presentation by Ryan Abernathey
+showing how I/O scales when using Zarr over Google Cloud Storage:
+
+<script async class="speakerdeck-embed" data-slide="22" data-id="1621118c5987411fb55fdcf503cb331d" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script>
+
+## Optimisations for cloud storage: consolidated metadata
+
+One issue with using cloud object storage is that, although total I/O
+throughput can be high, the latency involved in each request to read
+the contents of an object can be >100 ms, even when reading from
+compute nodes within the same data centre. This latency can add up
+when reading metadata from many arrays, because in Zarr each array has
+its own metadata stored in a separate object.
+
+To work around this, the 2.3 release adds an experimental feature to
+consolidate metadata for all arrays and groups within a hierarchy into
+a single object, which can be read once via a single request. Although
+this is not suitable for rapidly changing datasets, it can be good for
+large datasets which are relatively static.
+
+To use this feature, two new convenience functions have been
+added. The
+[`consolidate_metadata()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata)
+function performs the initial consolidation, reading all metadata and
+combining them into a single object. Once you have done that and
+deployed the data to a cloud object store, the
+[`open_consolidated()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.open_consolidated)
+function can be used to read data, making use of the consolidated
+metadata.
+
+Support for the new consolidated metadata feature is also now
+available via
+[xarray](http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html)
+and
+[intake-xarray](https://intake-xarray.readthedocs.io/en/latest/index.html)
+(see [this blog
+post](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/)
+for an introduction to intake), and many of the datasets in [Pangeo's
+cloud data catalog](https://pangeo-data.github.io/pangeo-datastore/)
+use Zarr with consolidated metadata.
+
+Here's an example of how to open a Zarr dataset from Pangeo's data
+catalog via intake:
+
+{% highlight python %}
+import intake
+cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml'
+cat = intake.Catalog(cat_url)
+ds = cat.atmosphere.gmet_v1.to_dask()
+{% endhighlight %}
+
+...and [here's the underlying catalog
+entry](https://github.com/pangeo-data/pangeo-datastore/blob/aa3f12bcc3be9584c1a9071235874c9d6af94a4e/intake-catalogs/atmosphere.yaml#L6).
+
+
+## Compatibility with N5
+
+Around the same time that development on Zarr was getting started, a
+separate team led by [Stephan
+Saafeld](https://www.janelia.org/lab/saalfeld-lab) at the Janelia
+research campus was experiencing similar challenges storing and
+computing with large amounts of neural imaging data, and developed a
+software library called [N5](https://github.com/saalfeldlab/n5). N5 is
+implemented in Java but is very similar to Zarr in the approach it
+takes to storing both metadata and data chunks, and to decoupling the
+storage backend to enable efficient use of cloud storage.
+
+There is a lot of commonality between Zarr and N5 and we are working
+jointly to bring the two approaches together. As a first experimental
+step towards that goal, the Zarr 2.3 release includes an [N5 storage
+adapter](https://zarr.readthedocs.io/en/stable/api/n5.html#zarr.n5.N5Store)
+which allows reading and writing of data on disk in the N5
+format. 
+
+
+## Support for the buffer protocol
+
+Zarr is intended to work efficiently across a range of different
+storage systems with different latencies and bandwidth, from cloud
+object stores to local disk and memory. In many of these settings,
+making efficient use of local memory, and avoiding memory copies
+wherever possible, can make a substantial difference to
+performance. This is particularly true within the
+[Numcodecs](http://numcodecs.rtfd.io) package, which is a companion to
+Zarr and provides implementations of compression and filter codecs
+such as Blosc and Zstandard. A key aspect of achieving fewer memory
+copies has been to leverage the Python buffer protocol.
+
+The [Python buffer
+protocol](https://docs.python.org/3/c-api/buffer.html) is a
+specification for how to share large blocks of memory between
+different libraries without copying. This protocol has evolved over
+time from its original introduction in Python 2 and later revamped
+implementation added in Python 3 (with backports to Python 2.6 and
+2.7). Due to the changes in its behavior from Python 2 to Python 3 and
+what objects supported which implementation of the buffer protocol, it
+was a bit challenging to leverage effectively in Zarr.
+
+Thanks to some under-the-hood changes in Zarr 2.3 and Numcodecs 0.6,
+the buffer protocol is now cleanly supported for Python 2/3 in both
+libraries when working with data. In addition to improved memory
+handling and performance, this should make it easier for users
+developing their own stores, compressors, and filters to use with
+Zarr. Also it has cutdown on the amount of code specialized for
+handling different Python versions.
+
+
+## Future developments
+
+There is a growing community of interest around new approaches to
+storage of array-like data, particularly in the cloud. For example,
+Theo McCaie from the UK Met Office Informatics Lab recently wrote a
+series of blog posts about the challenges involved in [storing 200TB
+of "high momentum" weather model data every
+day](https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671). This
+is an exciting space to be working in and we'd like to do what we can
+to build connections and share knowledge and ideas between
+communities. We've started a [regular
+teleconference](https://github.com/zarr-developers/zarr/issues/315)
+which is open to anyone to join, and there is a new [gitter
+channel](https://gitter.im/zarr-developers/community) for general
+discussion.
+
+The main focus of our conversations so far has been setting up work
+towards development of a new set of specifications that support the
+features of both Zarr and N5, and provide a platform for exploration
+and development of new features, while also identifying a minimal core
+protocol that can be implemented in a range of different programming
+languages. It is still relatively early days and there are lots of
+open questions to work through, both on the technical side and in
+terms of how we organise and coordinate efforts. However, the
+community is very friendly and supportive, and anyone is welcome to
+participate, so if you have an interest please do consider getting
+involved.
+
+If you would like to stay in touch with or contribute to new
+developments, keep an eye on the
+[zarr](https://github.com/zarr-developers/zarr) and
+[zarr-specs](https://github.com/zarr-developers/zarr-specs) GitHub
+repositories, and please feel free to raise issues or add comments if
+you have any questions or ideas.
+
+
+## And finally... SciPy!
+
+If you're coming to SciPy this year, we're very pleased to be giving a
+talk on Zarr on [day 1 of the conference (Wednesday 10
+July)](https://www.eiseverywhere.com/ehome/381993). Several members of
+the Zarr community will be at the conference, and there are sprints
+going on after the conference in a number of related areas, including
+an Xarray sprint on the Saturday. Please do say hi or [drop us a
+comment on this
+issue](https://github.com/zarr-developers/zarr/issues/396) if you'd
+like to connect and discuss anything.