Skip to content

Commit 877f521

Browse files
authored
Merge branch 'main' into api-doc-struct
2 parents d6866af + 617e2cd commit 877f521

File tree

13 files changed

+277
-92
lines changed

13 files changed

+277
-92
lines changed

docs/user-guide/arrays.rst

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -574,8 +574,41 @@ Any combination of integer and slice can be used for block indexing::
574574
Sharding
575575
--------
576576

577-
Coming soon.
578-
577+
Using small chunk shapes in very large arrays can lead to a very large number of chunks.
578+
This can become a performance issue for file systems and object storage.
579+
With Zarr format 3, a new sharding feature has been added to address this issue.
580+
581+
With sharding, multiple chunks can be stored in a single storage object (e.g. a file).
582+
Within a shard, chunks are compressed and serialized separately.
583+
This allows individual chunks to be read independently.
584+
However, when writing data, a full shard must be written in one go for optimal
585+
performance and to avoid concurrency issues.
586+
That means that shards are the units of writing and chunks are the units of reading.
587+
Users need to configure the chunk and shard shapes accordingly.
588+
589+
Sharded arrays can be created by providing the ``shards`` parameter to :func:`zarr.create_array`.
590+
591+
>>> a = zarr.create_array('data/example-20.zarr', shape=(10000, 10000), shards=(1000, 1000), chunks=(100, 100), dtype='uint8')
592+
>>> a[:] = (np.arange(10000 * 10000) % 256).astype('uint8').reshape(10000, 10000)
593+
>>> a.info_complete()
594+
Type : Array
595+
Zarr format : 3
596+
Data type : DataType.uint8
597+
Shape : (10000, 10000)
598+
Shard shape : (1000, 1000)
599+
Chunk shape : (100, 100)
600+
Order : C
601+
Read-only : False
602+
Store type : LocalStore
603+
Codecs : [{'chunk_shape': (100, 100), 'codecs': ({'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}), 'index_codecs': ({'endian': <Endian.little: 'little'>}, {}), 'index_location': <ShardingCodecIndexLocation.end: 'end'>}]
604+
No. bytes : 100000000 (95.4M)
605+
No. bytes stored : 3981060
606+
Storage ratio : 25.1
607+
Chunks Initialized : 100
608+
609+
In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is used.
610+
This means that 10*10 chunks are stored in each shard, and there are 10*10 shards in total.
611+
Without the ``shards`` argument, there would be 10,000 chunks stored as individual files.
579612

580613
Missing features in 3.0
581614
-----------------------

docs/user-guide/performance.rst

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,45 @@ will be one single chunk for the array::
6262
>>> z5.chunks
6363
(10000, 10000)
6464

65+
66+
Sharding
67+
~~~~~~~~
68+
69+
If you have large arrays but need small chunks to efficiently access the data, you can
70+
use sharding. Sharding provides a mechanism to store multiple chunks in a single
71+
storage object or file. This can be useful because traditional file systems and object
72+
storage systems may have performance issues storing and accessing many files.
73+
Additionally, small files can be inefficient to store if they are smaller than the
74+
block size of the file system.
75+
76+
Picking a good combination of chunk shape and shard shape is important for performance.
77+
The chunk shape determines what unit of your data can be read independently, while the
78+
shard shape determines what unit of your data can be written efficiently.
79+
80+
For an example, consider you have a 100 GB array and need to read small chunks of 1 MB.
81+
Without sharding, each chunk would be one file resulting in 100,000 files. That can
82+
already cause performance issues on some file systems.
83+
With sharding, you could use a shard size of 1 GB. This would result in 1000 chunks per
84+
file and 100 files in total, which seems manageable for most storage systems.
85+
You would still be able to read each 1 MB chunk independently, but you would need to
86+
write your data in 1 GB increments.
87+
88+
To use sharding, you need to specify the ``shards`` parameter when creating the array.
89+
90+
>>> z6 = zarr.create_array(store={}, shape=(10000, 10000, 1000), shards=(1000, 1000, 1000), chunks=(100, 100, 100), dtype='uint8')
91+
>>> z6.info
92+
Type : Array
93+
Zarr format : 3
94+
Data type : DataType.uint8
95+
Shape : (10000, 10000, 1000)
96+
Shard shape : (1000, 1000, 1000)
97+
Chunk shape : (100, 100, 100)
98+
Order : C
99+
Read-only : False
100+
Store type : MemoryStore
101+
Codecs : [{'chunk_shape': (100, 100, 100), 'codecs': ({'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}), 'index_codecs': ({'endian': <Endian.little: 'little'>}, {}), 'index_location': <ShardingCodecIndexLocation.end: 'end'>}]
102+
No. bytes : 100000000000 (93.1G)
103+
65104
.. _user-guide-chunks-order:
66105

67106
Chunk memory layout

docs/user-guide/v3_migration.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -156,11 +156,11 @@ Dependencies
156156
When installing using ``pip``:
157157

158158
- The new ``remote`` dependency group can be used to install a supported version of
159-
``fsspec``, required for remote data access.
159+
``fsspec``, required for remote data access.
160160
- The new ``gpu`` dependency group can be used to install a supported version of
161-
``cuda``, required for GPU functionality.
161+
``cuda``, required for GPU functionality.
162162
- The ``jupyter`` optional dependency group has been removed, since v3 contains no
163-
jupyter specific functionality.
163+
jupyter specific functionality.
164164

165165
Miscellaneous
166166
~~~~~~~~~~~~~

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,8 @@ ignore = [
397397
checks = [
398398
"GL06",
399399
"GL07",
400-
"GL09",
400+
# Currently broken; see https://github.com/numpy/numpydoc/issues/573
401+
# "GL09",
401402
"GL10",
402403
"SS02",
403404
"SS04",

src/zarr/api/asynchronous.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -508,6 +508,10 @@ async def save_group(
508508
async def tree(grp: AsyncGroup, expand: bool | None = None, level: int | None = None) -> Any:
509509
"""Provide a rich display of the hierarchy.
510510
511+
.. deprecated:: 3.0.0
512+
`zarr.tree()` is deprecated and will be removed in a future release.
513+
Use `group.tree()` instead.
514+
511515
Parameters
512516
----------
513517
grp : Group
@@ -521,10 +525,6 @@ async def tree(grp: AsyncGroup, expand: bool | None = None, level: int | None =
521525
-------
522526
TreeRepr
523527
A pretty-printable object displaying the hierarchy.
524-
525-
.. deprecated:: 3.0.0
526-
`zarr.tree()` is deprecated and will be removed in a future release.
527-
Use `group.tree()` instead.
528528
"""
529529
return await grp.tree(expand=expand, level=level)
530530

src/zarr/api/synchronous.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,10 @@ def save_group(
334334
def tree(grp: Group, expand: bool | None = None, level: int | None = None) -> Any:
335335
"""Provide a rich display of the hierarchy.
336336
337+
.. deprecated:: 3.0.0
338+
`zarr.tree()` is deprecated and will be removed in a future release.
339+
Use `group.tree()` instead.
340+
337341
Parameters
338342
----------
339343
grp : Group
@@ -347,10 +351,6 @@ def tree(grp: Group, expand: bool | None = None, level: int | None = None) -> An
347351
-------
348352
TreeRepr
349353
A pretty-printable object displaying the hierarchy.
350-
351-
.. deprecated:: 3.0.0
352-
`zarr.tree()` is deprecated and will be removed in a future release.
353-
Use `group.tree()` instead.
354354
"""
355355
return sync(async_api.tree(grp._async_group, expand=expand, level=level))
356356

src/zarr/core/_info.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ class ArrayInfo:
8080
_zarr_format: ZarrFormat
8181
_data_type: np.dtype[Any] | DataType
8282
_shape: tuple[int, ...]
83+
_shard_shape: tuple[int, ...] | None = None
8384
_chunk_shape: tuple[int, ...] | None = None
8485
_order: Literal["C", "F"]
8586
_read_only: bool
@@ -96,7 +97,13 @@ def __repr__(self) -> str:
9697
Type : {_type}
9798
Zarr format : {_zarr_format}
9899
Data type : {_data_type}
99-
Shape : {_shape}
100+
Shape : {_shape}""")
101+
102+
if self._shard_shape is not None:
103+
template += textwrap.dedent("""
104+
Shard shape : {_shard_shape}""")
105+
106+
template += textwrap.dedent("""
100107
Chunk shape : {_chunk_shape}
101108
Order : {_order}
102109
Read-only : {_read_only}

src/zarr/core/array.py

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -432,6 +432,9 @@ async def create(
432432
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
433433
"""Method to create a new asynchronous array instance.
434434
435+
.. deprecated:: 3.0.0
436+
Deprecated in favor of :func:`zarr.api.asynchronous.create_array`.
437+
435438
Parameters
436439
----------
437440
store : StoreLike
@@ -509,9 +512,6 @@ async def create(
509512
-------
510513
AsyncArray
511514
The created asynchronous array instance.
512-
513-
.. deprecated:: 3.0.0
514-
Deprecated in favor of :func:`zarr.api.asynchronous.create_array`.
515515
"""
516516
return await cls._create(
517517
store,
@@ -1573,14 +1573,8 @@ def _info(
15731573
else:
15741574
kwargs["_codecs"] = self.metadata.codecs
15751575
kwargs["_data_type"] = self.metadata.data_type
1576-
# just regular?
1577-
chunk_grid = self.metadata.chunk_grid
1578-
if isinstance(chunk_grid, RegularChunkGrid):
1579-
kwargs["_chunk_shape"] = chunk_grid.chunk_shape
1580-
else:
1581-
raise NotImplementedError(
1582-
"'info' is not yet implemented for chunk grids of type {type(self.metadata.chunk_grid)}"
1583-
)
1576+
kwargs["_chunk_shape"] = self.chunks
1577+
kwargs["_shard_shape"] = self.shards
15841578

15851579
return ArrayInfo(
15861580
_zarr_format=self.metadata.zarr_format,
@@ -1637,6 +1631,9 @@ def create(
16371631
) -> Array:
16381632
"""Creates a new Array instance from an initialized store.
16391633
1634+
.. deprecated:: 3.0.0
1635+
Deprecated in favor of :func:`zarr.create_array`.
1636+
16401637
Parameters
16411638
----------
16421639
store : StoreLike
@@ -1704,9 +1701,6 @@ def create(
17041701
-------
17051702
Array
17061703
Array created from the store.
1707-
1708-
.. deprecated:: 3.0.0
1709-
Deprecated in favor of :func:`zarr.create_array`.
17101704
"""
17111705
return cls._create(
17121706
store,

src/zarr/core/codec_pipeline.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -332,12 +332,21 @@ async def write_batch(
332332
drop_axes: tuple[int, ...] = (),
333333
) -> None:
334334
if self.supports_partial_encode:
335-
await self.encode_partial_batch(
336-
[
337-
(byte_setter, value[out_selection], chunk_selection, chunk_spec)
338-
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
339-
],
340-
)
335+
# Pass scalar values as is
336+
if len(value.shape) == 0:
337+
await self.encode_partial_batch(
338+
[
339+
(byte_setter, value, chunk_selection, chunk_spec)
340+
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
341+
],
342+
)
343+
else:
344+
await self.encode_partial_batch(
345+
[
346+
(byte_setter, value[out_selection], chunk_selection, chunk_spec)
347+
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
348+
],
349+
)
341350

342351
else:
343352
# Read existing bytes if not total slice

src/zarr/core/group.py

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1148,6 +1148,9 @@ async def create_dataset(
11481148
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
11491149
"""Create an array.
11501150
1151+
.. deprecated:: 3.0.0
1152+
The h5py compatibility methods will be removed in 3.1.0. Use `AsyncGroup.create_array` instead.
1153+
11511154
Arrays are known as "datasets" in HDF5 terminology. For compatibility
11521155
with h5py, Zarr groups also implement the :func:`zarr.AsyncGroup.require_dataset` method.
11531156
@@ -1161,11 +1164,17 @@ async def create_dataset(
11611164
Returns
11621165
-------
11631166
a : AsyncArray
1164-
1165-
.. deprecated:: 3.0.0
1166-
The h5py compatibility methods will be removed in 3.1.0. Use `AsyncGroup.create_array` instead.
11671167
"""
1168-
return await self.create_array(name, shape=shape, **kwargs)
1168+
data = kwargs.pop("data", None)
1169+
# create_dataset in zarr 2.x requires shape but not dtype if data is
1170+
# provided. Allow this configuration by inferring dtype from data if
1171+
# necessary and passing it to create_array
1172+
if "dtype" not in kwargs and data is not None:
1173+
kwargs["dtype"] = data.dtype
1174+
array = await self.create_array(name, shape=shape, **kwargs)
1175+
if data is not None:
1176+
await array.setitem(slice(None), data)
1177+
return array
11691178

11701179
@deprecated("Use AsyncGroup.require_array instead.")
11711180
async def require_dataset(
@@ -1179,6 +1188,9 @@ async def require_dataset(
11791188
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
11801189
"""Obtain an array, creating if it doesn't exist.
11811190
1191+
.. deprecated:: 3.0.0
1192+
The h5py compatibility methods will be removed in 3.1.0. Use `AsyncGroup.require_dataset` instead.
1193+
11821194
Arrays are known as "datasets" in HDF5 terminology. For compatibility
11831195
with h5py, Zarr groups also implement the :func:`zarr.AsyncGroup.create_dataset` method.
11841196
@@ -1199,9 +1211,6 @@ async def require_dataset(
11991211
Returns
12001212
-------
12011213
a : AsyncArray
1202-
1203-
.. deprecated:: 3.0.0
1204-
The h5py compatibility methods will be removed in 3.1.0. Use `AsyncGroup.require_dataset` instead.
12051214
"""
12061215
return await self.require_array(name, shape=shape, dtype=dtype, exact=exact, **kwargs)
12071216

@@ -2393,6 +2402,10 @@ def create_array(
23932402
def create_dataset(self, name: str, **kwargs: Any) -> Array:
23942403
"""Create an array.
23952404
2405+
.. deprecated:: 3.0.0
2406+
The h5py compatibility methods will be removed in 3.1.0. Use `Group.create_array` instead.
2407+
2408+
23962409
Arrays are known as "datasets" in HDF5 terminology. For compatibility
23972410
with h5py, Zarr groups also implement the :func:`zarr.Group.require_dataset` method.
23982411
@@ -2406,16 +2419,16 @@ def create_dataset(self, name: str, **kwargs: Any) -> Array:
24062419
Returns
24072420
-------
24082421
a : Array
2409-
2410-
.. deprecated:: 3.0.0
2411-
The h5py compatibility methods will be removed in 3.1.0. Use `Group.create_array` instead.
24122422
"""
24132423
return Array(self._sync(self._async_group.create_dataset(name, **kwargs)))
24142424

24152425
@deprecated("Use Group.require_array instead.")
24162426
def require_dataset(self, name: str, *, shape: ShapeLike, **kwargs: Any) -> Array:
24172427
"""Obtain an array, creating if it doesn't exist.
24182428
2429+
.. deprecated:: 3.0.0
2430+
The h5py compatibility methods will be removed in 3.1.0. Use `Group.require_array` instead.
2431+
24192432
Arrays are known as "datasets" in HDF5 terminology. For compatibility
24202433
with h5py, Zarr groups also implement the :func:`zarr.Group.create_dataset` method.
24212434
@@ -2431,9 +2444,6 @@ def require_dataset(self, name: str, *, shape: ShapeLike, **kwargs: Any) -> Arra
24312444
Returns
24322445
-------
24332446
a : Array
2434-
2435-
.. deprecated:: 3.0.0
2436-
The h5py compatibility methods will be removed in 3.1.0. Use `Group.require_array` instead.
24372447
"""
24382448
return Array(self._sync(self._async_group.require_array(name, shape=shape, **kwargs)))
24392449

@@ -2660,6 +2670,9 @@ def array(
26602670
) -> Array:
26612671
"""Create an array within this group.
26622672
2673+
.. deprecated:: 3.0.0
2674+
Use `Group.create_array` instead.
2675+
26632676
This method lightly wraps :func:`zarr.core.array.create_array`.
26642677
26652678
Parameters

0 commit comments

Comments
 (0)