Skip to content

Commit f6614b0

Browse files
committed
document copy functions
1 parent 82160a6 commit f6614b0

File tree

4 files changed

+177
-26
lines changed

4 files changed

+177
-26
lines changed

docs/release.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,13 @@ Enhancements
127127
* **Added support for ``datetime64`` and ``timedelta64`` data types**;
128128
:issue:`85`, :issue:`215`.
129129

130+
* **New copy functions**. The new functions :func:`zarr.convenience.copy` and
131+
:func:`zarr.convenience.copy_all` provide a way to copy groups and/or arrays
132+
between HDF5 and Zarr, or between two Zarr groups. The
133+
:func:`zarr.convenience.copy_store` provides a more efficient way to copy
134+
data directly between two Zarr stores. :issue:`87`, :issue:`113`,
135+
:issue:`137`, :issue:`217`.
136+
130137
Bug fixes
131138
~~~~~~~~~
132139

docs/tutorial.rst

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -768,6 +768,110 @@ Here is an example using S3Map to read an array created previously::
768768
b'Hello from the cloud!'
769769

770770

771+
.. _tutorial_copy:
772+
773+
Copying/migrating data
774+
----------------------
775+
776+
If you have some data in an HDF5 file and would like to copy some or all of it
777+
into a Zarr group, or vice-versa, the :func:`zarr.convenience.copy` and
778+
:func:`zarr.convenience.copyall` functions can be used. Here's an example
779+
copying a group named 'foo' from an HDF5 file to a Zarr group::
780+
781+
>>> import h5py
782+
>>> import zarr
783+
>>> import numpy as np
784+
>>> source = h5py.File('data/example.h5', mode='w')
785+
>>> foo = source.create_group('foo')
786+
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
787+
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
788+
>>> zarr.tree(source)
789+
/
790+
├── foo
791+
│ └── bar
792+
│ └── baz (100,) int64
793+
└── spam (100,) int64
794+
>>> dest = zarr.open_group('data/example.zarr', mode='w')
795+
>>> from sys import stdout
796+
>>> zarr.copy(source['foo'], dest, log=stdout)
797+
copy /foo
798+
copy /foo/bar
799+
copy /foo/bar/baz (100,) int64
800+
all done: 3 copied, 0 skipped, 800 bytes copied
801+
(3, 0, 800)
802+
>>> dest.tree() # N.B., no spam
803+
/
804+
└── foo
805+
└── bar
806+
└── baz (100,) int64
807+
>>> source.close()
808+
809+
If rather than copying a single group or dataset you would like to copy all
810+
groups and datasets, use :func:`zarr.convenience.copyall`, e.g.::
811+
812+
>>> source = h5py.File('data/example.h5', mode='r')
813+
>>> dest = zarr.open_group('data/example2.zarr', mode='w')
814+
>>> zarr.copy_all(source, dest, log=stdout)
815+
copy /foo
816+
copy /foo/bar
817+
copy /foo/bar/baz (100,) int64
818+
copy /spam (100,) int64
819+
all done: 4 copied, 0 skipped, 1,600 bytes copied
820+
(4, 0, 1600)
821+
>>> dest.tree()
822+
/
823+
├── foo
824+
│ └── bar
825+
│ └── baz (100,) int64
826+
└── spam (100,) int64
827+
828+
If you need to copy data between two Zarr groups, the
829+
func:`zarr.convenience.copy` and :func:`zarr.convenience.copy_all` functions can
830+
be used and provide the most flexibility. However, if you want to copy data
831+
in the most efficient way possible, without changing any configuration options,
832+
the :func:`zarr.convenience.copy_store` function can be used. This function
833+
copies data directly between the underlying stores, without any decompression or
834+
re-compression, and so should be faster. E.g.::
835+
836+
>>> import zarr
837+
>>> import numpy as np
838+
>>> store1 = zarr.DirectoryStore('data/example.zarr')
839+
>>> root = zarr.group(store1, overwrite=True)
840+
>>> baz = root.create_dataset('foo/bar/baz', data=np.arange(100), chunks=(50,))
841+
>>> spam = root.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
842+
>>> root.tree()
843+
/
844+
├── foo
845+
│ └── bar
846+
│ └── baz (100,) int64
847+
└── spam (100,) int64
848+
>>> from sys import stdout
849+
>>> store2 = zarr.ZipStore('data/example.zip', mode='w')
850+
>>> zarr.copy_store(store1, store2, log=stdout)
851+
copy .zgroup
852+
copy foo/.zgroup
853+
copy foo/bar/.zgroup
854+
copy foo/bar/baz/.zarray
855+
copy foo/bar/baz/0
856+
copy foo/bar/baz/1
857+
copy spam/.zarray
858+
copy spam/0
859+
copy spam/1
860+
copy spam/2
861+
copy spam/3
862+
all done: 11 copied, 0 skipped, 1,138 bytes copied
863+
(11, 0, 1138)
864+
>>> new_root = zarr.group(store2)
865+
>>> new_root.tree()
866+
/
867+
├── foo
868+
│ └── bar
869+
│ └── baz (100,) int64
870+
└── spam (100,) int64
871+
>>> new_root['foo/bar/baz'][:]
872+
array([ 0, 1, 2, ..., 97, 98, 99])
873+
>>> store2.close() # zip stores need to be closed
874+
771875
.. _tutorial_strings:
772876

773877
String arrays

zarr/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,5 @@
1313
from zarr.codecs import *
1414
from zarr.convenience import (open, save, save_array, save_group, load, copy_store,
1515
copy, copy_all, tree)
16+
from zarr.errors import CopyError, MetadataError, PermissionError
1617
from zarr.version import version as __version__

zarr/convenience.py

Lines changed: 65 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -684,32 +684,70 @@ def copy(source, dest, name=None, shallow=False, without_attrs=False, log=None,
684684
685685
Examples
686686
--------
687-
>>> import h5py
688-
>>> import zarr
689-
>>> import numpy as np
690-
>>> source = h5py.File('data/example.h5', mode='w')
691-
>>> foo = source.create_group('foo')
692-
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
693-
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
694-
>>> zarr.tree(source)
695-
/
696-
├── foo
697-
│ └── bar
698-
│ └── baz (100,) int64
699-
└── spam (100,) int64
700-
>>> dest = zarr.group()
701-
>>> from sys import stdout
702-
>>> zarr.copy(source['foo'], dest, log=stdout)
703-
copy /foo
704-
copy /foo/bar
705-
copy /foo/bar/baz (100,) int64
706-
all done: 3 copied, 0 skipped, 800 bytes copied
707-
(3, 0, 800)
708-
>>> dest.tree() # N.B., no spam
709-
/
710-
└── foo
711-
└── bar
712-
└── baz (100,) int64
687+
Here's an example of copying a group named 'foo' from an HDF5 file to a
688+
Zarr group::
689+
690+
>>> import h5py
691+
>>> import zarr
692+
>>> import numpy as np
693+
>>> source = h5py.File('data/example.h5', mode='w')
694+
>>> foo = source.create_group('foo')
695+
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
696+
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
697+
>>> zarr.tree(source)
698+
/
699+
├── foo
700+
│ └── bar
701+
│ └── baz (100,) int64
702+
└── spam (100,) int64
703+
>>> dest = zarr.group()
704+
>>> from sys import stdout
705+
>>> zarr.copy(source['foo'], dest, log=stdout)
706+
copy /foo
707+
copy /foo/bar
708+
copy /foo/bar/baz (100,) int64
709+
all done: 3 copied, 0 skipped, 800 bytes copied
710+
(3, 0, 800)
711+
>>> dest.tree() # N.B., no spam
712+
/
713+
└── foo
714+
└── bar
715+
└── baz (100,) int64
716+
>>> source.close()
717+
718+
The ``if_exists`` parameter provides options for how to handle pre-existing data in
719+
the destination. Here are some examples of these options, also using
720+
``dry_run=True`` to find out what would happen without actually copying anything::
721+
722+
>>> source = zarr.group()
723+
>>> dest = zarr.group()
724+
>>> baz = source.create_dataset('foo/bar/baz', data=np.arange(100))
725+
>>> spam = source.create_dataset('foo/spam', data=np.arange(1000))
726+
>>> existing_spam = dest.create_dataset('foo/spam', data=np.arange(1000))
727+
>>> from sys import stdout
728+
>>> try:
729+
... zarr.copy(source['foo'], dest, log=stdout, dry_run=True)
730+
... except zarr.CopyError as e:
731+
... print(e)
732+
...
733+
copy /foo
734+
copy /foo/bar
735+
copy /foo/bar/baz (100,) int64
736+
an object 'spam' already exists in destination '/foo'
737+
>>> zarr.copy(source['foo'], dest, log=stdout, if_exists='replace', dry_run=True)
738+
copy /foo
739+
copy /foo/bar
740+
copy /foo/bar/baz (100,) int64
741+
copy /foo/spam (1000,) int64
742+
dry run: 4 copied, 0 skipped
743+
(4, 0, 0)
744+
>>> zarr.copy(source['foo'], dest, log=stdout, if_exists='skip', dry_run=True)
745+
copy /foo
746+
copy /foo/bar
747+
copy /foo/bar/baz (100,) int64
748+
skip /foo/spam (1000,) int64
749+
dry run: 3 copied, 1 skipped
750+
(3, 1, 0)
713751
714752
"""
715753

@@ -978,6 +1016,7 @@ def copy_all(source, dest, shallow=False, without_attrs=False, log=None,
9781016
│ └── bar
9791017
│ └── baz (100,) int64
9801018
└── spam (100,) int64
1019+
>>> source.close()
9811020
9821021
"""
9831022

0 commit comments

Comments
 (0)