Skip to content

Commit d1e69ac

Browse files
MarsBarLeenoatamir
andauthored
[BLOG] Add introducing-versioned-hdf5 (#525)
Co-authored-by: Noa Tamir <[email protected]>
1 parent 78f5f20 commit d1e69ac

File tree

5 files changed

+151
-0
lines changed

5 files changed

+151
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
title: 'Introducing Versioned HDF5'
3+
published: August 21, 2020
4+
author: melissa-mendonca
5+
description: 'HDF5 is an open technology that implements a hierarchical structure (similar to a file-system structure) for storing large amounts of possibly heterogeneous data within a single binary file, using regular version control tools (such as git) may prove difficult. The Versioned HDF5 library is a versioned abstraction on top of h5py, that allows you to keep a record of which changes occurred to your HDF5 files, and enables you to recover previous versions of this file.'
6+
category: [PyData ecosystem]
7+
featuredImage:
8+
src: /posts/introducing-versioned-hdf5/feature.png
9+
alt: 'Diagram illustrating the hierarchical nature of an HDF5 file. An HDF container is shown that contains two groups. Each of these groups then contains datasets and/or subgroups. There is associated metadata for both the top-level container as well as each group and dataset.'
10+
hero:
11+
imageSrc: /posts/introducing-versioned-hdf5/blog_hero_var2.svg
12+
imageAlt: 'An illustration of a dark brown hand holding up a microphone, with some graphical elements highlighting the top of the microphone.'
13+
---
14+
15+
16+
The problem of storing and manipulating large amounts of data is a challenge in
17+
many scientific computing and industry applications. One of the standard data
18+
models for this is [HDF5](https://support.hdfgroup.org/HDF5/whatishdf5.html),
19+
an open technology that implements a hierarchical structure (similar to a
20+
file-system structure) for storing large amounts of possibly heterogeneous data
21+
within a single file. Data in an HDF5 file is organized into *groups* and
22+
*datasets*; you can think about these as the folders and files in your local
23+
file system, respectively. You can also optionally store metadata associated
24+
with each item in a file, which makes this a self-describing and powerful data
25+
storage model.
26+
27+
![Diagram illustrating the hierarchical nature of an HDF5 file. An HDF container is shown that contains two groups. Each of these groups then contains datasets and/or subgroups. There is associated metadata for both the top-level container as well as each group and dataset.](/posts/introducing-versioned-hdf5/hdf5_structure4_resized.png)
28+
*Image: Hierarchical Data Format (HDF5) Dataset (From https://www.neonscience.org/about-hdf5)*
29+
30+
Since reading and writing operations in these large data files must be fast,
31+
the HDF5 model includes data compression and *chunking*. This technique allows
32+
the data to be retrieved in subsets that fit the computer's memory or RAM,
33+
which means that it doesn't require the entire file contents to be loaded into
34+
memory at once. All this makes HDF5 a popular format in several domains, and
35+
with [h5py](https://www.h5py.org) it is possible to use a Pythonic interface to
36+
read and write data to a HDF5 file.
37+
38+
Now, let's say you have an HDF5 file with contents that change over time. You
39+
may want to add or remove datasets, change the contents of the data or the
40+
metadata, and keep a record of which changes occurred when, with a way to
41+
recover previous versions of this file. Since HDF5 is a binary file format,
42+
using regular version control tools (such as git) may prove difficult.
43+
44+
Introducing the Versioned HDF5 library
45+
--------------------------------------
46+
47+
The Versioned HDF5 library is a versioned abstraction on top of h5py. Because
48+
of the flexibility of the HDF5 data model, all versioning data is stored in the
49+
file itself, which means that different versions of the same data (including
50+
version metadata) can be stored in a single HDF5 file.
51+
52+
To see how this works in practice, let's say we create a regular HDF5 file with
53+
h5py called `mydata.h5`.
54+
55+
```py
56+
>>> import h5py
57+
>>> fileobject = h5py.File('mydata.h5', 'w')
58+
```
59+
60+
Now, you can create a `VersionedHDF5file` object:
61+
62+
```py
63+
>>> from versioned_hdf5 import VersionedHDF5File
64+
>>> versioned_file = VersionedHDF5File(fileobject)
65+
```
66+
67+
This file still doesn't have any data or versions stored in it. To create a new
68+
version, you can use a context manager:
69+
70+
```py
71+
>>> with versioned_file.stage_version('version1') as group:
72+
. group['mydataset'] = np.ones(10000)
73+
```
74+
75+
The context manager returns a h5py group object, which should be modified
76+
in-place to build the new version. When the context manager exits, the version
77+
will be written to the file. From this moment on, any interaction with the
78+
versioned groups and datasets should be done via the Versioned HDF5 API, rather
79+
than h5py.
80+
81+
Now, the `versioned_file` object can be used to expose versioned data by version name:
82+
83+
```py
84+
>>> v1 = versioned_file['version1']
85+
>>> v1
86+
<Committed InMemoryGroup "/_version_data/versions/version1"/>
87+
>>> v1['mydataset']
88+
<InMemoryArrayDataset "mydataset": shape (10000,), type "<f8"/>
89+
```
90+
91+
To access the actual data stored in version `version1`, we use the same syntax
92+
as h5py:
93+
94+
```py
95+
>>> dataset = v1['mydataset']
96+
>>> dataset[()]
97+
array([1., 1., 1., ..., 1., 1., 1.])
98+
```
99+
100+
Suppose now we want to commit a new version of this data, changing just a slice
101+
of the data. We can do this as follows:
102+
103+
```py
104+
>>> with versioned_file.stage_version('version2') as group:
105+
. group['mydataset'][0] = -10
106+
```
107+
108+
Both versions are now stored in the file, and can be accessed independently.
109+
110+
```py
111+
>>> v2 = versioned_file['version2']
112+
>>> v1['mydataset'][()]
113+
array([1., 1., 1., ..., 1., 1., 1.])]
114+
>>> v2['mydataset'][()]
115+
array([-10., 1., 1., ..., 1., 1., 1.])]
116+
```
117+
118+
119+
Current status
120+
--------------
121+
122+
`versioned-hdf5 1.0` has recently been released, and is available on PyPI and conda-forge. You can install it with
123+
124+
```py
125+
conda install -c conda-forge versioned-hdf5
126+
```
127+
128+
The development is on [GitHub](https://github.com/deshaw/versioned-hdf5).
129+
Currently, the library supports basic use cases, but there is still a lot to
130+
do. We welcome community contributions to the library, including any issues or
131+
feature requests.
132+
133+
For now, you can check out the
134+
[documentation](https://deshaw.github.io/versioned-hdf5/) for more details on
135+
what is supported and how the library is built.
136+
137+
138+
Next steps
139+
----------
140+
141+
This is the first post in a series about the Versioned HDF5 library. Next,
142+
we'll discuss the performance of Versioned HDF5 files, and the design of the
143+
library.
144+
145+
The Versioned HDF5 library was created by the [D. E. Shaw
146+
group](https://www.deshaw.com/) in conjunction with
147+
[Quansight](https://www.quansight.com/).
148+
149+
![D.E. Shaw logo](/posts/introducing-versioned-hdf5/black_logo_417x125.png)
150+
3.76 KB
Loading

β€Žapps/labs/public/posts/introducing-versioned-hdf5/blog_hero_var2.svgβ€Ž

Lines changed: 1 addition & 0 deletions
Loading
33.2 KB
Loading
85 KB
Loading

0 commit comments

Comments
Β (0)