-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Exact Source of O(n) Cost in iohub Zarr Precreation
Summary
The O(n) scaling issue when creating positions with iohub is caused by repeated read-modify-write cycles of the .zattrs metadata file at both the Plate and Well levels.
Call Stack Trace
When you call store.create_position("A", "1", "000001"):
Plate.create_position()
├── create_well() [if well doesn't exist]
│ ├── self.metadata.wells.append(well_meta) # Add well to plate metadata
│ └── self.dump_meta() # ⚠️ EXPENSIVE!
│ └── self.zattrs.update({"plate": ...})
│ └── zarr.attrs.Attributes._update_nosync()
│ ├── d = self._get_nosync() # 📖 READ entire plate .zattrs from disk
│ ├── d.update(...) # Update in memory
│ └── self._put_nosync(d) # 💾 WRITE entire plate .zattrs to disk
│
└── well.create_position(name, acquisition)
├── self.metadata.images.append(image_meta) # Add position to well metadata
└── self.dump_meta() # ⚠️ EXPENSIVE!
└── self.zattrs.update({"well": ...})
└── zarr.attrs.Attributes._update_nosync()
├── d = self._get_nosync() # 📖 READ entire well .zattrs from disk
├── d.update(...) # Update in memory
└── self._put_nosync(d) # 💾 WRITE entire well .zattrs to disk
The O(n) Mechanism
File Growth
As you create more positions, the metadata files grow:
Plate-level .zattrs example (grows with each well):
{
"plate": {
"acquisitions": [{"id": 0}],
"columns": [{"name": "1"}, {"name": "2"}, ...],
"rows": [{"name": "A"}, {"name": "B"}, ...],
"wells": [
{"path": "A/1", "rowIndex": 0, "columnIndex": 0},
{"path": "A/2", "rowIndex": 0, "columnIndex": 1},
{"path": "A/3", "rowIndex": 0, "columnIndex": 2},
... // Grows with each well
]
}
}Well-level .zattrs example (grows with each position):
{
"well": {
"images": [
{"path": "000001"},
{"path": "000002"},
{"path": "000003"},
... // Grows with each position in this well
],
"version": "0.4"
}
}Time Complexity Analysis
Position 1:
- Plate
.zattrs: 1 KB → Read 1 KB + Parse + Update + Serialize + Write 1 KB - Well
.zattrs: 0.2 KB → Read 0.2 KB + Parse + Update + Serialize + Write 0.2 KB - Total I/O: ~2.4 KB (read + write)
Position 1000:
- Plate
.zattrs: ~50 KB (many wells listed) - Well
.zattrs: ~20 KB (many positions listed) - Total I/O: ~140 KB (read + write)
Position 7000:
- Plate
.zattrs: ~300 KB - Well
.zattrs: ~150 KB - Total I/O: ~900 KB (read + write)
On network storage with ~1ms latency per I/O operation:
- Position 1: ~2 ms for metadata I/O
- Position 1000: ~20 ms for metadata I/O
- Position 7000: ~100 ms for metadata I/O
This compounds across 7000 positions!
Source Code Locations
iohub (v0.x)
File: /path/to/iohub/ngff/nodes.py
Plate.create_well() - Line ~1725:
def create_well(self, row_name, col_name, row_index=None, col_index=None):
# ... create well group ...
well_meta = WellMeta(path=well_path, rowIndex=row_index, columnIndex=col_index)
self.metadata.wells.append(well_meta) # Append to growing list
self.dump_meta() # ⚠️ Writes entire plate metadata!
return Well(group=well_grp, ...)Well.create_position() - Line ~1378:
def create_position(self, name: str, acquisition: int = 0):
pos_grp = self._group.create_group(name, overwrite=self._overwrite)
image_meta = ImageMeta(acquisition=acquisition, path=pos_grp.basename)
self.metadata.images.append(image_meta) # Append to growing list
self.dump_meta() # ⚠️ Writes entire well metadata!
return Position(group=pos_grp, ...)Plate.dump_meta() - Line ~1616:
def dump_meta(self, field_count: bool = False):
"""Dumps metadata JSON to the `.zattrs` file."""
if field_count:
self.metadata.field_count = len(list(self.positions()))
self.zattrs.update( # ⚠️ Triggers read-modify-write!
{"plate": self.metadata.model_dump(**TO_DICT_SETTINGS)}
)zarr-python
File: /path/to/zarr/attrs.py
Attributes._update_nosync():
def _update_nosync(self, *args, **kwargs):
# load existing data
d = self._get_nosync() # 📖 READ entire .zattrs from disk (JSON parse)
# update
if self._version == 2:
d.update(*args, **kwargs)
else:
d["attributes"].update(*args, **kwargs)
# put modified data
self._put_nosync(d) # 💾 WRITE entire .zattrs to disk (JSON serialize)Why Our Fast Method Avoids This
ops_analysis.utils.fast_zarr_precreate.create_hcs_store_fast() avoids the O(n) cost by:
-
Building metadata in memory first:
# Group all positions by well wells = _group_positions_by_well(positions) # One pass # Build complete plate metadata upfront plate_zattrs = { "plate": { "wells": [ {"path": f"{r}/{c}", ...} for (r, c) in wells_by_row_col.keys() # All at once! ] } }
-
Writing each file exactly once:
# Write plate .zattrs once with open(store_path / ".zattrs", "w") as f: json.dump(plate_zattrs, f) # Write each well .zattrs once for well_path, well_positions in wells.items(): well_zattrs = { "well": { "images": [{"path": fid} for fid in well_positions] # All at once! } } with open(well_dir / ".zattrs", "w") as f: json.dump(well_zattrs, f)
-
Result: O(1) per position
- No matter if it's position 1 or position 7000
- Each position writes exactly 4 small JSON files (~2-4 KB total)
- No reading/parsing of existing metadata
- No repeated writes to the same files
Performance Impact
7000 positions on network storage:
| Method | Metadata I/O Pattern | Total Time | Scaling |
|---|---|---|---|
| iohub | 7000 × (read + write growing file) | ~30 minutes | O(n) |
| fast_zarr_precreate | 1 × (write all files) | 14 seconds | O(1) |
| Speedup | 128x |
Validation Commands
To see this in action, run our benchmark:
# Test with increasing position counts
python tests/test_zarr_precreation_benchmark.py --n-positions 100
python tests/test_zarr_precreation_benchmark.py --n-positions 500
python tests/test_zarr_precreation_benchmark.py --n-positions 1000
# Watch the slowdown pattern:
# Position 1: 0.004s
# Position 100: 0.0025s (faster due to caching)
# Position 500: 0.008s (starting to slow down)
# Position 1000: 0.013s (O(n) trend visible)You can also monitor file sizes during creation:
watch -n 1 'ls -lh /tmp/test.zarr/.zattrs'
# File grows with each position!Conclusion
The O(n) cost is definitively located in:
- zarr.attrs.Attributes._update_nosync() - Always does full read-modify-write
- Called by iohub's Plate.dump_meta() and Well.dump_meta() - After every position creation
- Files grow with each position - Making each subsequent read/write slower
- Amplified on network storage - Where I/O latency dominates
Our fast implementation eliminates all of this by building metadata in memory and writing each file exactly once.