@@ -15,7 +15,13 @@ Alistair Miles ([@alimanfoo](https://github.com/alimanfoo)) - SciPy 2019
1515
1616===
1717
18- @@TODO image of tensor -> compute -> tensor
18+ ### Problem statement
19+
20+ <p class =" stretch " ><img src =" scipy-2019-files/compute1.png " ></p >
21+
22+ There is some computation we want to perform.
23+
24+ Inputs and outputs are tensors.
1925
20265 key features...
2127
@@ -26,22 +32,20 @@ Alistair Miles ([@alimanfoo](https://github.com/alimanfoo)) - SciPy 2019
2632Input and/or output tensors are too big to fit comfortably in main
2733memory.
2834
29- @@TODO image of larger than memory
30-
3135===
3236
3337### (2) Computation can be parallelised
3438
39+ <p class =" stretch " ><img src =" scipy-2019-files/compute2.png " ></p >
40+
3541Some part of the computation can be parallelised by processing data in
3642chunks.
3743
38- @@TODO image of tensor -> parallel compute -> compute -> parallel compute -> tensor
39-
4044===
4145
4246### E.g., embarassingly parallel
4347
44- @ @ TODO image of tensor -> parallel compute -> tensor
48+ < p class = " stretch " >< img src = " scipy-2019-files/compute3.png " ></ p >
4549
4650===
4751
@@ -50,8 +54,6 @@ chunks.
5054Computational complexity is moderate &rarr ; significant amount of time is
5155spent in reading and/or writing data.
5256
53- @@TODO image of tensor -> bottleneck -> parallel compute -> bottleneck -> tensor
54-
5557N.B., bottleneck may be due to (a) limited I/O bandwidth, (b) I/O is
5658not parallel.
5759
@@ -60,11 +62,8 @@ not parallel.
6062### (4) Data are compressible
6163
6264* Compression is a very active area of innovation.
63-
6465* Modern compressors achieve good compression ratios with high speed.
65-
6666* Opportunity to trade I/O for computation.
67-
6867* Compression can increase effective I/O bandwidth, sometimes
6968 dramatically.
7069
@@ -75,17 +74,17 @@ not parallel.
7574* Rich datasets &rarr ; exploratory science &rarr ; interactive analysis
7675 &rarr ; many rounds of summarise, visualise, hypothesise, model,
7776 test, repeat.
78-
77+
7978* E.g., genome sequencing.
8079
81- * Each genome is a complete molecular blueprint for an organism.
82-
83- * Each genome is a history book handed down through the ages, with
84- each generation making its mark.
85-
8680 * Modern experiments sequence genomes from 1000s of individuals and
8781 compare them.
8882
83+ * Each genome is a complete molecular blueprint for an organism.
84+
85+ * Each genome is a history book handed down from the beginning of
86+ life on Earth, with each generation making its mark.
87+
8988===
9089
9190### Problem: key features
@@ -207,11 +206,11 @@ object stores?
207206### Zarr Python
208207
209208``` bash
210- pip install zarr
209+ $ pip install zarr
211210```
212211
213212``` bash
214- conda install -c conda-forge zarr
213+ $ conda install -c conda-forge zarr
215214```
216215
217216``` python
@@ -231,20 +230,20 @@ conda install -c conda-forge zarr
231230< zarr.hierarchy.Group ' /' >
232231```
233232
234- Using DirectoryStore the data will be stored on the local file
235- system.
233+ Using DirectoryStore the data will be stored in a directory on the
234+ local file system.
236235
237236===
238237
239238### Creating an array
240239
241240``` python
242- >> > x = root.zeros(' x ' ,
243- ... shape = (10000 , 10000 ),
244- ... chunks = (1000 , 1000 ),
245- ... dtype = ' <i4' )
246- >> > x
247- < zarr.core.Array ' /x ' (10000 , 10000 ) int32>
241+ >> > hello = root.zeros(' hello ' ,
242+ ... shape = (10000 , 10000 ),
243+ ... chunks = (1000 , 1000 ),
244+ ... dtype = ' <i4' )
245+ >> > hello
246+ < zarr.core.Array ' /hello ' (10000 , 10000 ) int32>
248247```
249248
250249* Creates a 2-dimensional array of 32-bit integers with 10,000 rows
@@ -259,12 +258,12 @@ and 10,000 columns.
259258### Creating an array (h5py-style API)
260259
261260``` python
262- >> > x = root.create_dataset(' x ' ,
263- ... shape = (10000 , 10000 ),
264- ... chunks = (1000 , 1000 ),
265- ... dtype = ' <i4' )
266- >> > x
267- < zarr.core.Array ' /x ' (10000 , 10000 ) int32>
261+ >> > hello = root.create_dataset(' hello ' ,
262+ ... shape = (10000 , 10000 ),
263+ ... chunks = (1000 , 1000 ),
264+ ... dtype = ' <i4' )
265+ >> > hello
266+ < zarr.core.Array ' /hello ' (10000 , 10000 ) int32>
268267```
269268
270269===
@@ -365,7 +364,7 @@ example.zarr
365364│ ├── 0.1
366365│ ├── 1.0
367366│ └── .zarray
368- ├── x
367+ ├── hello
369368│ └── .zarray
370369└── .zgroup
371370
@@ -452,7 +451,7 @@ MemoryError
452451
453452===
454453
455- ### DirectoryStore
454+ ### DirectoryStore (reminder)
456455
457456``` bash
458457$ tree -a example.zarr
@@ -462,7 +461,7 @@ example.zarr
462461│ ├── 0.1
463462│ ├── 1.0
464463│ └── .zarray
465- ├── x
464+ ├── hello
466465│ └── .zarray
467466└── .zgroup
468467
0 commit comments