Skip to content

Commit 827cff0

Browse files
committed
performance docs
1 parent 82e5143 commit 827cff0

File tree

3 files changed

+41
-4
lines changed

3 files changed

+41
-4
lines changed

docs/user-guide/arrays.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -595,8 +595,8 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
595595
Zarr format : 3
596596
Data type : DataType.uint8
597597
Shape : (10000, 10000)
598-
Chunk shape : (100, 100)
599598
Shard shape : (1000, 1000)
599+
Chunk shape : (100, 100)
600600
Order : C
601601
Read-only : False
602602
Store type : LocalStore

docs/user-guide/performance.rst

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,43 @@ will be one single chunk for the array::
6262
>>> z5.chunks
6363
(10000, 10000)
6464

65+
66+
Sharding
67+
~~~~~~~~
68+
69+
If you have large arrays but need small chunks to efficiently access the data, you can
70+
use sharding. Sharding provides a mechanism to store multiple chunks in a single
71+
storage object or file. This can be useful because traditional file systems and object
72+
storage systems may have issues with many small files.
73+
74+
Picking a good combination of chunk shape and shard shape is important for performance.
75+
The chunk shape determines what unit of your data can be read independently, while the
76+
shard shape determines what unit of your data can be written efficiently.
77+
78+
For an example, consider you have a 100 GB array and need to read small chunks of 1 MB.
79+
Without sharding, each chunk would be one file resulting in 10000 files. That can
80+
already cause performance issues on some file systems.
81+
With sharding, you could use a shard size of 1 GB. This would result in 1000 chunks per
82+
file and 100 files in total, which seems manageable for most storage systems.
83+
You would still be able to read each 1 MB chunk independently, but you would need to
84+
write your data in 1 GB increments.
85+
86+
To use sharding, you need to specify the ``shards`` parameter when creating the array.
87+
88+
>>> z6 = zarr.create_array(store={}, shape=(10000, 10000, 1000), shards=(1000, 1000, 1000), chunks=(100, 100, 100), dtype='uint8')
89+
>>> z6.info
90+
Type : Array
91+
Zarr format : 3
92+
Data type : DataType.uint8
93+
Shape : (10000, 10000, 1000)
94+
Shard shape : (1000, 1000, 1000)
95+
Chunk shape : (100, 100, 100)
96+
Order : C
97+
Read-only : False
98+
Store type : MemoryStore
99+
Codecs : [{'chunk_shape': (100, 100, 100), 'codecs': ({'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}), 'index_codecs': ({'endian': <Endian.little: 'little'>}, {}), 'index_location': <ShardingCodecIndexLocation.end: 'end'>}]
100+
No. bytes : 100000000000 (93.1G)
101+
65102
.. _user-guide-chunks-order:
66103

67104
Chunk memory layout

src/zarr/core/_info.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@ class ArrayInfo:
8080
_zarr_format: ZarrFormat
8181
_data_type: np.dtype[Any] | DataType
8282
_shape: tuple[int, ...]
83-
_chunk_shape: tuple[int, ...] | None = None
8483
_shard_shape: tuple[int, ...] | None = None
84+
_chunk_shape: tuple[int, ...] | None = None
8585
_order: Literal["C", "F"]
8686
_read_only: bool
8787
_store_type: str
@@ -97,14 +97,14 @@ def __repr__(self) -> str:
9797
Type : {_type}
9898
Zarr format : {_zarr_format}
9999
Data type : {_data_type}
100-
Shape : {_shape}
101-
Chunk shape : {_chunk_shape}""")
100+
Shape : {_shape}""")
102101

103102
if self._shard_shape is not None:
104103
template += textwrap.dedent("""
105104
Shard shape : {_shard_shape}""")
106105

107106
template += textwrap.dedent("""
107+
Chunk shape : {_chunk_shape}
108108
Order : {_order}
109109
Read-only : {_read_only}
110110
Store type : {_store_type}""")

0 commit comments

Comments
 (0)