Skip to content

Commit 52df8c9

Browse files
authored
Merge pull request #2 from zarr-developers/zarr-2.3-post
Zarr Python 2.3 release post
2 parents 6a72fb4 + c196e56 commit 52df8c9

File tree

3 files changed

+224
-29
lines changed

3 files changed

+224
-29
lines changed

Gemfile.lock

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ GEM
99
eventmachine (>= 0.12.9)
1010
http_parser.rb (~> 0.6.0)
1111
eventmachine (1.2.7)
12-
ffi (1.10.0)
12+
ffi (1.11.1)
1313
forwardable-extended (2.6.0)
1414
http_parser.rb (0.6.0)
1515
i18n (0.9.5)
@@ -31,8 +31,8 @@ GEM
3131
jekyll (>= 3.7, < 5.0)
3232
jekyll-sass-converter (1.5.2)
3333
sass (~> 3.4)
34-
jekyll-seo-tag (2.6.0)
35-
jekyll (~> 3.3)
34+
jekyll-seo-tag (2.6.1)
35+
jekyll (>= 3.3, < 5.0)
3636
jekyll-watch (2.2.1)
3737
listen (~> 3.0)
3838
kramdown (1.17.0)
@@ -55,7 +55,7 @@ GEM
5555
rouge (3.3.0)
5656
ruby_dep (1.5.0)
5757
safe_yaml (1.0.5)
58-
sass (3.7.3)
58+
sass (3.7.4)
5959
sass-listen (~> 4.0.0)
6060
sass-listen (4.0.0)
6161
rb-fsevent (~> 0.9, >= 0.9.4)

_posts/2019-03-29-welcome-to-jekyll.markdown

Lines changed: 0 additions & 25 deletions
This file was deleted.
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
---
2+
layout: post
3+
title: "Zarr Python 2.3 release"
4+
date: 2019-05-23
5+
categories: zarr python release
6+
---
7+
8+
Recently we released version 2.3 of the [Python Zarr
9+
package](https://zarr.readthedocs.io/en/stable/), which implements the
10+
Zarr protocol for storing N-dimensional typed arrays, and is designed
11+
for use in distributed and parallel computing. This post provides an
12+
overview of new features in this release, and some information about
13+
future directions for Zarr.
14+
15+
## New storage options for distributed and cloud computing
16+
17+
A key feature of the Zarr protocol is that the underlying storage
18+
system is decoupled from other components via a simple key/value
19+
interface. In Python, this interface corresponds to the
20+
[`MutableMapping`
21+
interface](https://docs.python.org/3/glossary.html#term-mapping),
22+
which is the interface that Python
23+
[`dict`](https://docs.python.org/3/library/stdtypes.html#dict)
24+
implements. I.e., anything `dict`-like can be used to store Zarr
25+
data. The simplicity of this interface means it is relatively
26+
straightforward to add support for a range of different storage
27+
systems. The 2.3 release adds support for storage using [SQLite](
28+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore
29+
), [Redis](
30+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.RedisStore
31+
), [MongoDB](
32+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.MongoDBStore
33+
) and [Azure Blob Storage](
34+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ABSStore
35+
).
36+
37+
For example, here's code that creates an array using MongoDB:
38+
39+
{% highlight python %}
40+
import zarr
41+
store = zarr.MongoDBStore('localhost')
42+
root = zarr.group(store=store, overwrite=True)
43+
foo = bar.create_group('foo')
44+
bar = foo.create_dataset('bar', shape=(10000, 1000), chunks=(1000, 100))
45+
bar[:] = 42
46+
store.close()
47+
{% endhighlight %}
48+
49+
To do the same thing but storing the data in the cloud via Azure
50+
Blob Storage, replace the instantiation of the `store` object with:
51+
52+
{% highlight python %}
53+
store = zarr.ABSStore(container='test', account_name='foo', account_key='bar')
54+
{% endhighlight %}
55+
56+
Support for other cloud object storage storage services was already
57+
available via other packages, with Amazon S3 supported via the [s3fs](
58+
http://s3fs.readthedocs.io/en/latest/ ) package, and Google Cloud
59+
Storage supported via the [gcsfs](
60+
https://gcsfs.readthedocs.io/en/latest/ ) package. Further notes on
61+
using cloud storage are available from the [Zarr
62+
tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage).
63+
64+
The attraction of cloud storage is that total I/O bandwidth scales
65+
linearly with the size of a computing cluster, so there are no
66+
technical limits to the size of the data or computation you can scale
67+
up to. Here's a slide from a recent presentation by Ryan Abernathey
68+
showing how I/O scales when using Zarr over Google Cloud Storage:
69+
70+
<script async class="speakerdeck-embed" data-slide="22" data-id="1621118c5987411fb55fdcf503cb331d" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script>
71+
72+
## Optimisations for cloud storage: consolidated metadata
73+
74+
One issue with using cloud object storage is that, although total I/O
75+
throughput can be high, the latency involved in each request to read
76+
the contents of an object can be >100 ms, even when reading from
77+
compute nodes within the same data centre. This latency can add up
78+
when reading metadata from many arrays, because in Zarr each array has
79+
its own metadata stored in a separate object.
80+
81+
To work around this, the 2.3 release adds an experimental feature to
82+
consolidate metadata for all arrays and groups within a hierarchy into
83+
a single object, which can be read once via a single request. Although
84+
this is not suitable for rapidly changing datasets, it can be good for
85+
large datasets which are relatively static.
86+
87+
To use this feature, two new convenience functions have been
88+
added. The
89+
[`consolidate_metadata()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata)
90+
function performs the initial consolidation, reading all metadata and
91+
combining them into a single object. Once you have done that and
92+
deployed the data to a cloud object store, the
93+
[`open_consolidated()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.open_consolidated)
94+
function can be used to read data, making use of the consolidated
95+
metadata.
96+
97+
Support for the new consolidated metadata feature is also now
98+
available via
99+
[xarray](http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html)
100+
and
101+
[intake-xarray](https://intake-xarray.readthedocs.io/en/latest/index.html)
102+
(see [this blog
103+
post](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/)
104+
for an introduction to intake), and many of the datasets in [Pangeo's
105+
cloud data catalog](https://pangeo-data.github.io/pangeo-datastore/)
106+
use Zarr with consolidated metadata.
107+
108+
Here's an example of how to open a Zarr dataset from Pangeo's data
109+
catalog via intake:
110+
111+
{% highlight python %}
112+
import intake
113+
cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml'
114+
cat = intake.Catalog(cat_url)
115+
ds = cat.atmosphere.gmet_v1.to_dask()
116+
{% endhighlight %}
117+
118+
...and [here's the underlying catalog
119+
entry](https://github.com/pangeo-data/pangeo-datastore/blob/aa3f12bcc3be9584c1a9071235874c9d6af94a4e/intake-catalogs/atmosphere.yaml#L6).
120+
121+
122+
## Compatibility with N5
123+
124+
Around the same time that development on Zarr was getting started, a
125+
separate team led by [Stephan
126+
Saafeld](https://www.janelia.org/lab/saalfeld-lab) at the Janelia
127+
research campus was experiencing similar challenges storing and
128+
computing with large amounts of neural imaging data, and developed a
129+
software library called [N5](https://github.com/saalfeldlab/n5). N5 is
130+
implemented in Java but is very similar to Zarr in the approach it
131+
takes to storing both metadata and data chunks, and to decoupling the
132+
storage backend to enable efficient use of cloud storage.
133+
134+
There is a lot of commonality between Zarr and N5 and we are working
135+
jointly to bring the two approaches together. As a first experimental
136+
step towards that goal, the Zarr 2.3 release includes an [N5 storage
137+
adapter](https://zarr.readthedocs.io/en/stable/api/n5.html#zarr.n5.N5Store)
138+
which allows reading and writing of data on disk in the N5
139+
format.
140+
141+
142+
## Support for the buffer protocol
143+
144+
Zarr is intended to work efficiently across a range of different
145+
storage systems with different latencies and bandwidth, from cloud
146+
object stores to local disk and memory. In many of these settings,
147+
making efficient use of local memory, and avoiding memory copies
148+
wherever possible, can make a substantial difference to
149+
performance. This is particularly true within the
150+
[Numcodecs](http://numcodecs.rtfd.io) package, which is a companion to
151+
Zarr and provides implementations of compression and filter codecs
152+
such as Blosc and Zstandard. A key aspect of achieving fewer memory
153+
copies has been to leverage the Python buffer protocol.
154+
155+
The [Python buffer
156+
protocol](https://docs.python.org/3/c-api/buffer.html) is a
157+
specification for how to share large blocks of memory between
158+
different libraries without copying. This protocol has evolved over
159+
time from its original introduction in Python 2 and later revamped
160+
implementation added in Python 3 (with backports to Python 2.6 and
161+
2.7). Due to the changes in its behavior from Python 2 to Python 3 and
162+
what objects supported which implementation of the buffer protocol, it
163+
was a bit challenging to leverage effectively in Zarr.
164+
165+
Thanks to some under-the-hood changes in Zarr 2.3 and Numcodecs 0.6,
166+
the buffer protocol is now cleanly supported for Python 2/3 in both
167+
libraries when working with data. In addition to improved memory
168+
handling and performance, this should make it easier for users
169+
developing their own stores, compressors, and filters to use with
170+
Zarr. Also it has cutdown on the amount of code specialized for
171+
handling different Python versions.
172+
173+
174+
## Future developments
175+
176+
There is a growing community of interest around new approaches to
177+
storage of array-like data, particularly in the cloud. For example,
178+
Theo McCaie from the UK Met Office Informatics Lab recently wrote a
179+
series of blog posts about the challenges involved in [storing 200TB
180+
of "high momentum" weather model data every
181+
day](https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671). This
182+
is an exciting space to be working in and we'd like to do what we can
183+
to build connections and share knowledge and ideas between
184+
communities. We've started a [regular
185+
teleconference](https://github.com/zarr-developers/zarr/issues/315)
186+
which is open to anyone to join, and there is a new [gitter
187+
channel](https://gitter.im/zarr-developers/community) for general
188+
discussion.
189+
190+
The main focus of our conversations so far has been setting up work
191+
towards development of a new set of specifications that support the
192+
features of both Zarr and N5, and provide a platform for exploration
193+
and development of new features, while also identifying a minimal core
194+
protocol that can be implemented in a range of different programming
195+
languages. It is still relatively early days and there are lots of
196+
open questions to work through, both on the technical side and in
197+
terms of how we organise and coordinate efforts. However, the
198+
community is very friendly and supportive, and anyone is welcome to
199+
participate, so if you have an interest please do consider getting
200+
involved.
201+
202+
If you would like to stay in touch with or contribute to new
203+
developments, keep an eye on the
204+
[zarr](https://github.com/zarr-developers/zarr) and
205+
[zarr-specs](https://github.com/zarr-developers/zarr-specs) GitHub
206+
repositories, and please feel free to raise issues or add comments if
207+
you have any questions or ideas.
208+
209+
210+
## And finally... SciPy!
211+
212+
If you're coming to SciPy this year, we're very pleased to be giving a
213+
talk on Zarr on [day 1 of the conference (Wednesday 10
214+
July)](https://www.eiseverywhere.com/ehome/381993). Several members of
215+
the Zarr community will be at the conference, and there are sprints
216+
going on after the conference in a number of related areas, including
217+
an Xarray sprint on the Saturday. Please do say hi or [drop us a
218+
comment on this
219+
issue](https://github.com/zarr-developers/zarr/issues/396) if you'd
220+
like to connect and discuss anything.

0 commit comments

Comments
 (0)