Skip to content

example that reveals inefficient sharded writes #3421

@d-v-b

Description

@d-v-b

Check out how many times we call get in this example (writing a single shard with 10 chunks):

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
import numpy as np

from zarr.storage._logging import LoggingStore

store  = LoggingStore(store=zarr.storage.MemoryStore())

shape = (10,)
chunks=(1,)
shards=(10,)
data = np.ones(shape)
zarr.create_array(
            store=store,
            data=data,
            chunks=chunks,
            shards=shards,
            fill_value=0,
            overwrite=True,
        )
array = zarr.open_array(store)[:]
2025-09-01 15:50:45,109 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.read_only
2025-09-01 15:50:45,109 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.read_only [0.00 s]
2025-09-01 15:50:45,109 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore._ensure_open
2025-09-01 15:50:45,109 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore._ensure_open [0.00 s]
2025-09-01 15:50:45,110 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.supports_deletes
2025-09-01 15:50:45,110 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.supports_deletes [0.00 s]
2025-09-01 15:50:45,110 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.delete_dir
2025-09-01 15:50:45,110 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.delete_dir [0.00 s]
2025-09-01 15:50:45,111 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(zarr.json)
2025-09-01 15:50:45,111 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(zarr.json) [0.00 s]
2025-09-01 15:50:45,112 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,112 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,112 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,112 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,113 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,114 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,115 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,115 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]
2025-09-01 15:50:45,118 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,118 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,118 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,118 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,118 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,119 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.set(c/0)
2025-09-01 15:50:45,120 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.set(c/0) [0.00 s]
2025-09-01 15:50:45,121 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore._ensure_open
2025-09-01 15:50:45,121 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore._ensure_open [0.00 s]
2025-09-01 15:50:45,121 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(zarr.json)
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(zarr.json) [0.00 s]
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(.zarray)
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(.zarray) [0.00 s]
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(.zattrs)
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(.zattrs) [0.00 s]
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO -  Calling MemoryStore.get(c/0)
2025-09-01 15:50:45,122 - LoggingStore(memory://4309622656) - INFO - Finished MemoryStore.get(c/0) [0.00 s]

In principle we should only call get(c/0) once -- at the very end, when we need to retrieve bytes from it. instead, we call get(c/0) ~10 (i can't count) times in this example. We should also only call set(c/0) once, because we are writing a full shard. Instead, we call set once per chunk, which is extremely inefficient for sharded writes.

I'm still trying to figure out how this is being controlled. I suspect it has to do with the batch size of the codec pipeline class, but I haven't confirmed this. I will update this issue when I get further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions