Skip to content

Commit 4e90c1e

Browse files
committed
Clarify staged insert compatibility: Zarr/TileDB yes, HDF5 no
- HDF5 requires random-access seek/write operations incompatible with object storage's PUT/GET model - Staged inserts work with chunk-based formats (Zarr, TileDB) where each chunk is a separate object - Added compatibility table and HDF5 copy-insert example - Recommend Zarr over HDF5 for cloud-native workflows
1 parent 3e32188 commit 4e90c1e

File tree

1 file changed

+29
-4
lines changed

1 file changed

+29
-4
lines changed

docs/src/design/tables/object-type-spec.md

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Once an object is **finalized** (either via copy-insert or staged-insert complet
2121
| Mode | Use Case | Workflow |
2222
|------|----------|----------|
2323
| **Copy** | Small files, existing data | Local file → copy to storage → insert record |
24-
| **Staged** | Large objects, Zarr/HDF5 | Reserve path → write directly to storage → finalize record |
24+
| **Staged** | Large objects, Zarr, TileDB | Reserve path → write directly to storage → finalize record |
2525

2626
### Augmented Schema vs External References
2727

@@ -1144,11 +1144,36 @@ Each record owns its file exclusively. There is no deduplication or reference co
11441144
- `object` type is additive - new tables only
11451145
- Future: Migration utilities to convert existing external storage
11461146

1147-
## Zarr and Large Hierarchical Data
1147+
## Zarr, TileDB, and Large Hierarchical Data
11481148

1149-
The `object` type is designed with Zarr and similar hierarchical data formats (HDF5 via kerchunk, TileDB) in mind. This section provides guidance for these use cases.
1149+
The `object` type is designed with **chunk-based formats** like Zarr and TileDB in mind. These formats store each chunk as a separate object, which maps naturally to object storage.
11501150

1151-
### Recommended Workflow
1151+
### Staged Insert Compatibility
1152+
1153+
**Staged inserts work with formats that support chunk-based writes:**
1154+
1155+
| Format | Staged Insert | Why |
1156+
|--------|---------------|-----|
1157+
| **Zarr** | ✅ Yes | Each chunk is a separate object |
1158+
| **TileDB** | ✅ Yes | Fragment-based storage maps to objects |
1159+
| **HDF5** | ❌ No | Single monolithic file requires random-access seek/write |
1160+
1161+
**HDF5 limitation**: HDF5 files have internal B-tree structures that require random-access modifications. Object storage only supports full object PUT/GET operations, not partial updates. For HDF5, use **copy insert**:
1162+
1163+
```python
1164+
# HDF5: Write locally, then copy to object storage
1165+
import h5py
1166+
import tempfile
1167+
1168+
with tempfile.NamedTemporaryFile(suffix='.h5', delete=False) as f:
1169+
with h5py.File(f.name, 'w') as h5:
1170+
h5.create_dataset('data', data=large_array)
1171+
Recording.insert1({..., 'data_file': f.name})
1172+
```
1173+
1174+
For cloud-native workflows with large arrays, **Zarr is recommended** over HDF5.
1175+
1176+
### Recommended Workflow (Zarr)
11521177

11531178
For large Zarr stores, use **staged insert** to write directly to object storage:
11541179

0 commit comments

Comments
 (0)