Skip to content

Commit 231f3ea

Browse files
committed
filled in some examples and links
1 parent e32460c commit 231f3ea

File tree

1 file changed

+75
-40
lines changed

1 file changed

+75
-40
lines changed

_posts/2019-05-02-zarr-2.3-release.md

Lines changed: 75 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,81 @@
11
---
22
layout: post
33
title: "Zarr Python 2.3 release"
4-
date: 2019-05-02
4+
date: 2019-05-23
55
categories: zarr python release
66
---
77

8-
Recently we released version 2.3 of the Python Zarr package, which
9-
implements the Zarr protocol for storing N-dimensional typed arrays,
10-
and is designed for use in distributed and parallel computing. This
11-
post provides an overview of new features in this release, and some
12-
information about future directions for Zarr.
8+
Recently we released version 2.3 of the [Python Zarr
9+
package](https://zarr.readthedocs.io/en/stable/), which implements the
10+
Zarr protocol for storing N-dimensional typed arrays, and is designed
11+
for use in distributed and parallel computing. This post provides an
12+
overview of new features in this release, and some information about
13+
future directions for Zarr.
1314

1415
## New storage options for distributed computing
1516

1617
A key feature of the Zarr protocol is that the underlying storage
1718
system is decoupled from other components via a simple key/value
1819
interface. In Python, this interface corresponds to the
19-
[`MutableMapping` interface](@@TODO), which is the interface that
20-
Python dictionaries (`dict`) implement. The simplicity of this
21-
interface means it is relatively straightforward to add support for a
22-
range of different storage systems. The 2.3 release adds support for
23-
storage using [SQLite]( https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore ), [Redis]( https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.RedisStore ), [MongoDB]( https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.MongoDBStore ) and
24-
[Azure Blob Storage]( https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ABSStore ).
25-
26-
For example, here's code that stores an array using Redis:
20+
[`MutableMapping`
21+
interface](https://docs.python.org/3/glossary.html#term-mapping),
22+
which is the interface that Python
23+
[`dict`](https://docs.python.org/3/library/stdtypes.html#dict)
24+
implements. The simplicity of this interface means it is relatively
25+
straightforward to add support for a range of different storage
26+
systems. The 2.3 release adds support for storage using [SQLite](
27+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore
28+
), [Redis](
29+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.RedisStore
30+
), [MongoDB](
31+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.MongoDBStore
32+
) and [Azure Blob Storage](
33+
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ABSStore
34+
).
35+
36+
For example, here's code that creates an array using MongoDB:
2737

2838
{% highlight python %}
29-
TODO
39+
import zarr
40+
store = zarr.MongoDBStore('localhost')
41+
root = zarr.group(store=store, overwrite=True)
42+
foo = bar.create_group('foo')
43+
bar = foo.create_dataset('bar', shape=(10000, 1000), chunks=(1000, 100))
44+
bar[:] = 42
45+
store.close()
3046
{% endhighlight %}
3147

32-
Here's the same example but storing the data in the cloud via Azure
33-
Blob Storage:
48+
To do the same thing but storing the data in the cloud via Azure
49+
Blob Storage, replace the instantiation of the `store` object with:
3450

3551
{% highlight python %}
36-
TODO
52+
store = zarr.ABSStore(container='test', account_name='foo', account_key='bar')
3753
{% endhighlight %}
3854

3955
Support for other cloud object storage storage services was already
40-
available via other packages, with Amazon S3 supported via the
41-
[s3fs]( http://s3fs.readthedocs.io/en/latest/ ) package, and Google Cloud Storage supported via the
42-
[gcsfs]( https://gcsfs.readthedocs.io/en/latest/ ) package.
56+
available via other packages, with Amazon S3 supported via the [s3fs](
57+
http://s3fs.readthedocs.io/en/latest/ ) package, and Google Cloud
58+
Storage supported via the [gcsfs](
59+
https://gcsfs.readthedocs.io/en/latest/ ) package. Further notes on
60+
using cloud storage are available from the [Zarr
61+
tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage).
4362

4463
The attraction of cloud storage is that total I/O bandwidth scales
4564
linearly with the size of a computing cluster, so there are no
4665
technical limits to the size of the data or computation you can scale
47-
up to. Here's a slide from a recent presentation from Ryan Abernathy
48-
showing how I/O scales when using Zarr, and comparing that to reading
49-
data from a remote data service which does not scale in the same way:
66+
up to. Here's a slide from a recent presentation by Ryan Abernathey
67+
showing how I/O scales when using Zarr over Google Cloud Storage:
5068

51-
@@TODO plot
69+
<script async class="speakerdeck-embed" data-slide="22" data-id="1621118c5987411fb55fdcf503cb331d" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script>
5270

5371
## Optimisations for cloud storage: consolidated metadata
5472

5573
One issue with using cloud object storage is that, although total I/O
5674
throughput can be high, the latency involved in each request to read
57-
the contents of an object can be around 100 ms, even when reading from
75+
the contents of an object can be >100 ms, even when reading from
5876
compute nodes within the same data centre. This latency can add up
59-
when reading many arrays, because in Zarr each array has its own
60-
metadata stored in a separate object.
77+
when reading metadata from many arrays, because in Zarr each array has
78+
its own metadata stored in a separate object.
6179

6280
To work around this, the 2.3 release adds an experimental feature to
6381
consolidate metadata for all arrays and groups within a hierarchy into
@@ -66,22 +84,39 @@ this is not suitable for rapidly changing datasets, it can be good for
6684
large datasets which are relatively static.
6785

6886
To use this feature, two new convenience functions have been
69-
added. The [`consolidate_metadata()`](@@TODO) function performs the
70-
initial consolidation, reading all metadata and combining them into a
71-
single object. Once you have done that and deployed the data to a
72-
cloud object store, the [`open_consolidated()`] function can be used
73-
to read data, making use of the consolidated metadata.
74-
75-
Support for this new feature is also now available via the
76-
[xarray](@@TODO) and [intake](@@TODO), and the Pangeo project is using
77-
consolidated metadata for all new Zarr datasets. E.g., here's an
78-
example of how to open a Zarr dataset from Pangeo's data catalog via
79-
intake:
87+
added. The
88+
[`consolidate_metadata()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata)
89+
function performs the initial consolidation, reading all metadata and
90+
combining them into a single object. Once you have done that and
91+
deployed the data to a cloud object store, the
92+
[`open_consolidated()`](https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.open_consolidated)
93+
function can be used to read data, making use of the consolidated
94+
metadata.
95+
96+
Support for the new consolidated metadata feature is also now
97+
available via
98+
[xarray](http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html)
99+
and
100+
[intake-xarray](https://intake-xarray.readthedocs.io/en/latest/index.html)
101+
(see [this blog
102+
post](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/)
103+
for an introduction to intake), and many of the datasets in [Pangeo's
104+
cloud data catalog](https://pangeo-data.github.io/pangeo-datastore/)
105+
use Zarr with consolidated metadata.
106+
107+
Here's an example of how to open a Zarr dataset from Pangeo's data
108+
catalog via intake:
80109

81110
{% highlight python %}
82-
TODO
111+
import intake
112+
cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml'
113+
cat = intake.Catalog(cat_url)
114+
ds = cat.atmosphere.gmet_v1.to_dask()
83115
{% endhighlight %}
84116

117+
...and [here's the underlying catalog
118+
entry](https://github.com/pangeo-data/pangeo-datastore/blob/aa3f12bcc3be9584c1a9071235874c9d6af94a4e/intake-catalogs/atmosphere.yaml#L6).
119+
85120
## Compatibility with N5
86121

87122
Around the same time that development on Zarr was getting started, a

0 commit comments

Comments
 (0)