Skip to content

Commit 15418c3

Browse files
committed
Address Zarr reviewer feedback: optional metadata fields
- Make size field optional (nullable) for large hierarchical data - Add Performance Considerations section documenting expensive operations - Add Extension Field section clarifying ext is a tooling hint - Add Storage Access Architecture section noting fsspec pluggability - Add comprehensive Zarr and Large Hierarchical Data section - Update ObjectRef dataclass to support optional size - Add test for Zarr-style JSON with null size
1 parent 7ef4e61 commit 15418c3

File tree

3 files changed

+163
-13
lines changed

3 files changed

+163
-13
lines changed

docs/src/design/tables/file-type-spec.md

Lines changed: 130 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -288,18 +288,50 @@ The `object` type is stored as a `JSON` column in MySQL containing:
288288
}
289289
```
290290

291+
**Zarr example (large dataset, metadata fields omitted for performance):**
292+
```json
293+
{
294+
"path": "my_schema/Recording/objects/subject_id=123/session_id=45/neural_data_kM3nP2qR.zarr",
295+
"size": null,
296+
"hash": null,
297+
"ext": ".zarr",
298+
"is_dir": true,
299+
"timestamp": "2025-01-15T10:30:00Z"
300+
}
301+
```
302+
291303
### JSON Schema
292304

293305
| Field | Type | Required | Description |
294306
|-------|------|----------|-------------|
295307
| `path` | string | Yes | Full path/key within storage backend (includes token) |
296-
| `size` | integer | Yes | Total size in bytes (sum for folders) |
308+
| `size` | integer/null | No | Total size in bytes (sum for folders), or null if not computed. See [Performance Considerations](#performance-considerations). |
297309
| `hash` | string/null | Yes | Content hash with algorithm prefix, or null (default) |
298-
| `ext` | string/null | Yes | File extension (e.g., `.dat`, `.zarr`) or null |
299-
| `is_dir` | boolean | Yes | True if stored content is a directory |
310+
| `ext` | string/null | Yes | File extension as tooling hint (e.g., `.dat`, `.zarr`) or null. See [Extension Field](#extension-field). |
311+
| `is_dir` | boolean | Yes | True if stored content is a directory/key-prefix (e.g., Zarr store) |
300312
| `timestamp` | string | Yes | ISO 8601 upload timestamp |
301313
| `mime_type` | string | No | MIME type (files only, auto-detected from extension) |
302-
| `item_count` | integer | No | Number of files (folders only) |
314+
| `item_count` | integer | No | Number of files (folders only), or null if not computed. See [Performance Considerations](#performance-considerations). |
315+
316+
### Extension Field
317+
318+
The `ext` field is a **tooling hint** that preserves the original file extension or provides a conventional suffix for directory-based formats. It is:
319+
320+
- **Not a content-type declaration**: Unlike `mime_type`, it does not attempt to describe the internal content format
321+
- **Useful for tooling**: Enables file browsers, IDEs, and other tools to display appropriate icons or suggest applications
322+
- **Conventional for formats like Zarr**: The `.zarr` extension is recognized by the ecosystem even though a Zarr store contains mixed content (JSON metadata + binary chunks)
323+
324+
For single files, `ext` is extracted from the source filename. For staged inserts (like Zarr), it can be explicitly provided.
325+
326+
### Performance Considerations
327+
328+
For large hierarchical data like Zarr stores, computing certain metadata can be expensive:
329+
330+
- **`size`**: Requires listing all objects and summing their sizes. For stores with millions of chunks, this can take minutes or hours.
331+
- **`item_count`**: Requires listing all objects. Same performance concern as `size`.
332+
- **`hash`**: Requires reading all content. Explicitly not supported for staged inserts.
333+
334+
**These fields are optional** and default to `null` for staged inserts. Users can explicitly request computation when needed, understanding the performance implications.
303335

304336
### Content Hashing
305337

@@ -996,6 +1028,20 @@ gcs = ["gcsfs"]
9961028
azure = ["adlfs"]
9971029
```
9981030

1031+
### Storage Access Architecture
1032+
1033+
The `object` type separates **data declaration** (the JSON metadata stored in the database) from **storage access** (the library used to read/write objects):
1034+
1035+
- **Data declaration**: The JSON schema (path, size, hash, etc.) is a pure data structure with no library dependencies
1036+
- **Storage access**: Currently uses `fsspec` as the default accessor, but the architecture supports alternative backends
1037+
1038+
**Why this matters**: While `fsspec` is a mature and widely-used library, alternatives like [`obstore`](https://github.com/developmentseed/obstore) offer performance advantages for certain workloads. By keeping the data model independent of the access library, future versions can support pluggable storage accessors without schema changes.
1039+
1040+
**Current implementation**: The `ObjectRef` class provides fsspec-based accessors (`fs`, `store` properties). Future versions may add:
1041+
- Pluggable accessor interface
1042+
- Alternative backends (obstore, custom implementations)
1043+
- Backend selection per-operation or per-configuration
1044+
9991045
## Comparison with Existing Types
10001046

10011047
| Feature | `attach@store` | `filepath@store` | `object` |
@@ -1073,6 +1119,86 @@ Each record owns its file exclusively. There is no deduplication or reference co
10731119
- `object` type is additive - new tables only
10741120
- Future: Migration utilities to convert existing external storage
10751121

1122+
## Zarr and Large Hierarchical Data
1123+
1124+
The `object` type is designed with Zarr and similar hierarchical data formats (HDF5 via kerchunk, TileDB) in mind. This section provides guidance for these use cases.
1125+
1126+
### Recommended Workflow
1127+
1128+
For large Zarr stores, use **staged insert** to write directly to object storage:
1129+
1130+
```python
1131+
import zarr
1132+
import numpy as np
1133+
1134+
with Recording.staged_insert1 as staged:
1135+
staged.rec['subject_id'] = 123
1136+
staged.rec['session_id'] = 45
1137+
1138+
# Write Zarr directly to object storage
1139+
store = staged.store('neural_data', '.zarr')
1140+
root = zarr.open(store, mode='w')
1141+
root.create_dataset('spikes', shape=(1000000, 384), chunks=(10000, 384), dtype='f4')
1142+
1143+
# Stream data without local intermediate copy
1144+
for i, chunk in enumerate(acquisition_stream):
1145+
root['spikes'][i*10000:(i+1)*10000] = chunk
1146+
1147+
staged.rec['neural_data'] = root
1148+
1149+
# Metadata recorded, no expensive size/hash computation
1150+
```
1151+
1152+
### JSON Metadata for Zarr
1153+
1154+
For Zarr stores, the recommended JSON metadata omits expensive-to-compute fields:
1155+
1156+
```json
1157+
{
1158+
"path": "schema/Recording/objects/subject_id=123/session_id=45/neural_data_kM3nP2qR.zarr",
1159+
"size": null,
1160+
"hash": null,
1161+
"ext": ".zarr",
1162+
"is_dir": true,
1163+
"timestamp": "2025-01-15T10:30:00Z"
1164+
}
1165+
```
1166+
1167+
**Field notes for Zarr:**
1168+
- **`size`**: Set to `null` - computing total size requires listing all chunks
1169+
- **`hash`**: Always `null` for staged inserts - no merkle tree support currently
1170+
- **`ext`**: Set to `.zarr` as a conventional tooling hint
1171+
- **`is_dir`**: Set to `true` - Zarr stores are key prefixes (logical directories)
1172+
- **`item_count`**: Omitted - counting chunks is expensive and rarely useful
1173+
- **`mime_type`**: Omitted - Zarr contains mixed content types
1174+
1175+
### Reading Zarr Data
1176+
1177+
The `ObjectRef` provides direct access compatible with Zarr and xarray:
1178+
1179+
```python
1180+
record = Recording.fetch1()
1181+
obj_ref = record['neural_data']
1182+
1183+
# Direct Zarr access
1184+
z = zarr.open(obj_ref.store, mode='r')
1185+
print(z['spikes'].shape)
1186+
1187+
# xarray integration
1188+
ds = xr.open_zarr(obj_ref.store)
1189+
1190+
# Dask integration (lazy loading)
1191+
import dask.array as da
1192+
arr = da.from_zarr(obj_ref.store, component='spikes')
1193+
```
1194+
1195+
### Performance Tips
1196+
1197+
1. **Use chunked writes**: Write data in chunks that match your Zarr chunk size
1198+
2. **Avoid metadata computation**: Let `size` and `item_count` default to `null`
1199+
3. **Use appropriate chunk sizes**: Balance between too many small files (overhead) and too few large files (memory)
1200+
4. **Consider compression**: Configure Zarr compression (blosc, zstd) to reduce storage costs
1201+
10761202
## Future Extensions
10771203

10781204
- [ ] Compression options (gzip, lz4, zstd)

src/datajoint/objectref.py

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,17 +35,20 @@ class ObjectRef:
3535
3636
Attributes:
3737
path: Full path/key within storage backend (includes token)
38-
size: Total size in bytes (sum for folders)
38+
size: Total size in bytes (sum for folders), or None if not computed.
39+
For large hierarchical data like Zarr stores, size computation can
40+
be expensive and is optional.
3941
hash: Content hash with algorithm prefix, or None if not computed
40-
ext: File extension (e.g., ".dat", ".zarr") or None
41-
is_dir: True if stored content is a directory
42+
ext: File extension as tooling hint (e.g., ".dat", ".zarr") or None.
43+
This is a conventional suffix for tooling, not a content-type declaration.
44+
is_dir: True if stored content is a directory/key-prefix (e.g., Zarr store)
4245
timestamp: ISO 8601 upload timestamp
4346
mime_type: MIME type (files only, auto-detected from extension)
44-
item_count: Number of files (folders only)
47+
item_count: Number of files (folders only), or None if not computed
4548
"""
4649

4750
path: str
48-
size: int
51+
size: int | None
4952
hash: str | None
5053
ext: str | None
5154
is_dir: bool
@@ -307,10 +310,13 @@ def _verify_file(self) -> bool:
307310
if not self._backend.exists(self.path):
308311
raise IntegrityError(f"File does not exist: {self.path}")
309312

310-
# Check size
311-
actual_size = self._backend.size(self.path)
312-
if actual_size != self.size:
313-
raise IntegrityError(f"Size mismatch for {self.path}: expected {self.size}, got {actual_size}")
313+
# Check size if available
314+
if self.size is not None:
315+
actual_size = self._backend.size(self.path)
316+
if actual_size != self.size:
317+
raise IntegrityError(
318+
f"Size mismatch for {self.path}: expected {self.size}, got {actual_size}"
319+
)
314320

315321
# Check hash if available
316322
if self.hash:

tests/test_object.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,24 @@ def test_from_json_dict(self):
166166
assert obj.is_dir is True
167167
assert obj.item_count == 42
168168

169+
def test_from_json_zarr_style(self):
170+
"""Test creating ObjectRef from Zarr-style JSON with null size."""
171+
data = {
172+
"path": "schema/Recording/objects/id=1/neural_data_abc123.zarr",
173+
"size": None,
174+
"hash": None,
175+
"ext": ".zarr",
176+
"is_dir": True,
177+
"timestamp": "2025-01-15T10:30:00+00:00",
178+
}
179+
obj = ObjectRef.from_json(data)
180+
assert obj.path == "schema/Recording/objects/id=1/neural_data_abc123.zarr"
181+
assert obj.size is None
182+
assert obj.hash is None
183+
assert obj.ext == ".zarr"
184+
assert obj.is_dir is True
185+
assert obj.item_count is None
186+
169187
def test_to_json(self):
170188
"""Test converting ObjectRef to JSON dict."""
171189
from datetime import datetime, timezone

0 commit comments

Comments
 (0)