Skip to content

Commit 8d7c904

Browse files
committed
first complete draft
1 parent 31aa7cb commit 8d7c904

File tree

6 files changed

+108
-57
lines changed

6 files changed

+108
-57
lines changed

slides.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ permalink: /slides/
55
---
66

77
* [SciPy 2019](scipy-2019.html)
8-
* [v3 design update 20190619](v3-update-20190619.html)
8+
* [Zarr protocol spec v3 design update, 19 June 2019](v3-update-20190619.html)
593 KB
Loading
245 KB
Loading

slides/scipy-2019-files/xarray.png

22 KB
Loading

slides/scipy-2019.html

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@
4242
.reveal p, .reveal li {
4343
font-size: 0.9em;
4444
}
45+
.reveal li>p {
46+
margin: 0;
47+
line-height: 1;
48+
}
4549
.reveal table {
4650
font-size: 0.7em;
4751
}

slides/scipy-2019.md

Lines changed: 103 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Zarr - scalable storage of tensor data for parallel and distributed computing
55

66
Alistair Miles ([@alimanfoo](https://github.com/alimanfoo)) - SciPy 2019
77

8-
These slides: @@TODO URL
8+
<small>These slides: https://zarr-developers.github.io/slides/scipy-2019.html</small>
99

1010
====
1111

@@ -23,7 +23,7 @@ These slides: @@TODO URL
2323

2424
There is some computation we want to perform.
2525

26-
Inputs and outputs are tensors.
26+
Inputs and outputs are multidimensional arrays (a.k.a. tensors).
2727

2828
5 key features...
2929

@@ -64,8 +64,7 @@ not parallel.
6464
### (4) Data are compressible
6565

6666
* Compression is a very active area of innovation.
67-
* Modern compressors achieve good compression ratios with high speed.
68-
* Opportunity to trade I/O for computation.
67+
* Modern compressors achieve good compression ratios with very high speed.
6968
* Compression can increase effective I/O bandwidth, sometimes
7069
dramatically.
7170

@@ -79,13 +78,17 @@ not parallel.
7978

8079
* E.g., genome sequencing.
8180

82-
* Modern experiments sequence genomes from 1000s of individuals and
81+
* Now feasible to sequence genomes from 100,000s of individuals and
8382
compare them.
8483

85-
* Each genome is a complete molecular blueprint for an organism.
84+
* Each genome is a complete molecular blueprint for an organism
85+
&rarr; can investigate many different molecular pathways and
86+
processes.
8687

87-
* Each genome is a history book handed down from the beginning of
88-
life on Earth, with each generation making its mark.
88+
* Each genome is a history book handed down through the ages, with
89+
each generation making its mark &rarr; can look back in time and
90+
infer major demographic and evolutionary events in the history of
91+
populations and species.
8992

9093
===
9194

@@ -102,26 +105,24 @@ not parallel.
102105

103106
## Solution
104107

105-
1. Chunked, parallel computing framework.
108+
1. Chunked, parallel tensor computing framework.
106109
2. Chunked, parallel tensor storage library.
107110

108111
Align the chunks!
109112

110-
====
111-
112-
## Aside...
113-
114113
===
115114

116115
<p><img style="max-width:30%; max-height:30%" src="scipy-2019-files/dask.svg"></p>
117116

117+
Parallel computing framework for chunked tensors.
118+
118119
```python
119120
import dask.array as da
120121

121122
a = ... # what goes here?
122123
x = da.from_array(a)
123124
y = (x - x.mean(axis=1)) / x.std(axis=1)
124-
u, s, v = da.svd_compressed(y, 20)
125+
u, s, v = da.linalg.svd_compressed(y, 20)
125126
u = u.compute()
126127
```
127128

@@ -133,7 +134,9 @@ u = u.compute()
133134
<p class="stretch"><img src="scipy-2019-files/pangeo.png"></p>
134135

135136
* Scale up ocean / atmosphere / land / climate science.
136-
* Handle petabyte-scale datasets on HPC and cloud platforms.
137+
* Aim to handle petabyte-scale datasets on HPC and cloud platforms.
138+
* Using Dask.
139+
* Needed a tensor storage solution.
137140
* Interested to use cloud object stores: Amazon S3, Azure Blob Storage, Google Cloud Storage, ...
138141

139142
====
@@ -223,6 +226,17 @@ $ conda install -c conda-forge zarr
223226

224227
===
225228

229+
### Conceptual model based on HDF5
230+
231+
* Multiple arrays (a.k.a. datasets) can be created and organised into
232+
a hierarchy of groups.
233+
234+
* Each array is divided into regular shaped chunks.
235+
236+
* Each chunk is compressed before storage.
237+
238+
===
239+
226240
### Creating a hierarchy
227241

228242
```python
@@ -554,16 +568,16 @@ class ZipStore(MutableMapping):
554568
yield key
555569
```
556570

557-
<small>(Actual implementation is slightly more complicated, but this is the essence.)</small>
571+
<small>(<a href="https://github.com/zarr-developers/zarr-python/blob/e61d6ae77f18e881be0b80e38b5366793f5a2860/zarr/storage.py#L1033">Actual implementation</a> is slightly more complicated, but this is the essence.)</small>
558572

559573
====
560574

561575
## Parallel computing with Zarr
562576

563-
* A Zarr array can have multiple concurrent readers*
564-
* A Zarr array can have multiple concurrent writers*
565-
* Both multi-thread and multi-process parallelism are supported
566-
* GIL is released during critical sections (compression and decompression)
577+
* A Zarr array can have multiple concurrent readers*.
578+
* A Zarr array can have multiple concurrent writers*.
579+
* Both multi-thread and multi-process parallelism are supported.
580+
* GIL is released during critical sections (compression and decompression).
567581

568582
<small>* Depending on the store.</small>
569583

@@ -588,10 +602,14 @@ output = big * 42 + ...
588602
o = output.compute()
589603

590604
# if output is big, compute and write directly to Zarr
591-
output.to_zarr(@@TODO)
605+
da.to_zarr(output, store, component='output')
592606
```
593607

594-
See docs for `da.from_array`, `da.from_zarr`, `da.to_zarr`. @@TODO links
608+
See docs for
609+
[`da.from_array()`](https://docs.dask.org/en/latest/array-api.html#dask.array.from_array),
610+
[`da.from_zarr()`](https://docs.dask.org/en/latest/array-api.html#dask.array.from_zarr),
611+
[`da.to_zarr()`](https://docs.dask.org/en/latest/array-api.html#dask.array.to_zarr),
612+
[`da.store()`](https://docs.dask.org/en/latest/array-api.html#dask.array.store).
595613

596614
===
597615

@@ -619,15 +637,14 @@ See docs for `da.from_array`, `da.from_zarr`, `da.to_zarr`. @@TODO links
619637

620638
* Zarr does support chunk-level write locks for either multi-thread or
621639
multi-process writes.
622-
623640
* But generally easier and better to align writes with chunk
624641
boundaries where possible.
625642

626-
@@TODO link to docs
643+
See Zarr tutorial for [further info on synchronisation](https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization).
627644

628645
====
629646

630-
## Compressors
647+
## Pluggable compressors
631648

632649
===
633650

@@ -639,7 +656,7 @@ See docs for `da.from_array`, `da.from_zarr`, `da.to_zarr`. @@TODO links
639656

640657
===
641658

642-
### Available compressors (via numcodecs)
659+
### Available compressors (via [numcodecs](https://numcodecs.readthedocs.io/en/stable/))
643660

644661
Blosc, Zstandard, LZ4, Zlib, BZ2, LZMA, ...
645662

@@ -657,18 +674,25 @@ big2 = root.zeros('big2',
657674
compressor=compressor)
658675
```
659676

660-
@@TODO check this works
661-
662677
===
663678

664-
### Compressor (codec) interface
679+
### Compressor interface
665680

666-
<p class="stretch">
667-
<img src="scipy-2019-files/codec-api.png" style="float: right">
668-
The numcodecs Codec interface defines the API for filters and compressors for use with Zarr. Built around the Python buffer protocol.
681+
<table class="stretch">
682+
<tr>
683+
<td style="vertical-align: top">
684+
<p>
685+
The numcodecs <a href="https://numcodecs.readthedocs.io/en/stable/abc.html">Codec API</a> defines the interface for filters and compressors for use with Zarr.
669686
</p>
670-
671-
@@TODO link to buffer protocol
687+
<p>
688+
Built around the <a href="https://docs.python.org/3/c-api/buffer.html">Python buffer protocol</a>.
689+
</p>
690+
</td>
691+
<td style="vertical-align: top">
692+
<img src="scipy-2019-files/codec-api.png">
693+
</td>
694+
</tr>
695+
</table>
672696

673697
===
674698

@@ -684,7 +708,7 @@ class Zlib(Codec):
684708
buf = ensure_contiguous_ndarray(buf)
685709

686710
# do compression
687-
return _zlib.compress(buf, self.level)
711+
return zlib.compress(buf, self.level)
688712

689713
def decode(self, buf, out=None):
690714

@@ -694,7 +718,7 @@ class Zlib(Codec):
694718
out = ensure_contiguous_ndarray(out)
695719

696720
# do decompression
697-
dec = _zlib.decompress(buf)
721+
dec = zlib.decompress(buf)
698722

699723
return ndarray_copy(dec, out)
700724

@@ -710,12 +734,10 @@ class Zlib(Codec):
710734

711735
## Other Zarr implementations
712736

713-
* z5 - C++ implementation using xtensor
714-
* Zarr.jl - native Julia implementation
715-
* @@TODO - Scala implementation
716-
* WIP: Zarr support in NetCDF C library
717-
718-
@@TODO links
737+
* [z5](https://github.com/constantinpape/z5) - C++ implementation using xtensor
738+
* [Zarr.jl](https://github.com/meggart/Zarr.jl) - native Julia implementation
739+
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala) - Scala implementation
740+
* WIP: [NetCDF and native cloud storage access via Zarr](https://www.unidata.ucar.edu/blogs/news/entry/netcdf-and-native-cloud-storage)
719741

720742
====
721743

@@ -725,44 +747,69 @@ class Zlib(Codec):
725747

726748
### Xarray, Intake, Pangeo
727749

728-
@@TODO
750+
<img src="scipy-2019-files/xarray.png">
751+
752+
* [xarray.open_zarr()](http://xarray.pydata.org/en/stable/generated/xarray.open_zarr.html#xarray-open-zarr),
753+
[xarray.Dataset.to_zarr()](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html#xarray-dataset-to-zarr).
754+
755+
* [Intake
756+
project](https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/)
757+
for data catalogs has
758+
[intake-xarray](https://intake-xarray.readthedocs.io/en/latest/quickstart.html)
759+
plugin with Zarr support.
760+
761+
* Used by Pangeo for their [cloud
762+
datastore](https://github.com/pangeo-data/pangeo-datastore) ...
763+
764+
```python
765+
import intake
766+
cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml'
767+
cat = intake.Catalog(cat_url)
768+
ds = cat.atmosphere.gmet_v1.to_dask()
769+
```
770+
771+
<small>(Here's the [underlying data catalog entry](https://github.com/pangeo-data/pangeo-datastore/blob/aa3f12bcc3be9584c1a9071235874c9d6af94a4e/intake-catalogs/atmosphere.yaml#L6).)</small>
729772

730773
===
731774

732-
### "High momentum" weather data
775+
<p class="stretch"><img src="scipy-2019-files/weather.png"></p>
733776

734-
@@TODO met office work
777+
<small>https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671</small>
735778

736779
===
737780

738-
### Open microscopy (OME)
781+
### Microscopy (OME)
782+
783+
<p class="stretch"><img src="scipy-2019-files/microscopy.png"></p>
739784

740-
@@TODO
785+
See [OME's position regarding file formats](https://blog.openmicroscopy.org/community/file-formats/2019/06/25/formats/).
741786

742787
===
743788

744789
### Single cell biology
745790

746-
@@TODO
791+
* [Work by Laserson lab](https://github.com/lasersonlab/single-cell-experiments) using Zarr with [ScanPy](https://scanpy.readthedocs.io/en/stable/) and [AnnData](https://icb-anndata.readthedocs-hosted.com/en/stable/index.html) to scale single cell gene expression analyses.
792+
* The [Human Cell Atlas](https://prod.data.humancellatlas.org/) data portal uses Zarr for [storage of gene expression matrices](https://prod.data.humancellatlas.org/pipelines/hca-pipelines/data-processing-pipelines/file-formats).
793+
* Use Zarr for image-based transcriptomics ([starfish](https://spacetx-starfish.readthedocs.io/en/latest/))?
747794

748795
====
749796

750797
## Future
751798

752-
* Zarr/N5 convergence.
753-
* Zarr protocol spec v3.
754-
* Community!
799+
* Zarr/[N5](https://github.com/saalfeldlab/n5) convergence.
800+
* [Zarr protocol spec v3](https://zarr-developers.github.io/zarr/specs/2019/06/19/zarr-v3-update.html).
801+
* [Community!](https://github.com/zarr-developers/community)
755802

756803
====
757804

758-
## Acknowledgments
805+
## Credits
759806

760-
* Thanks to the Zarr core development team.
807+
* [Zarr core development team](https://github.com/orgs/zarr-developers/teams/core-devs/members).
761808

762-
* Thanks to everyone who has contributed code or raised or commented
763-
on an issue or PR.
809+
* Everyone who has contributed code or raised or commented on an issue
810+
or PR, thank you!
764811

765-
* Thanks to UK MRC and Wellcome Trust for supporting @alimanfoo.
812+
* UK MRC and Wellcome Trust for supporting @alimanfoo.
766813

767814
* Zarr is a community-maintained open source project - please think of
768815
it as yours!

0 commit comments

Comments
 (0)