You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify staged insert compatibility: Zarr/TileDB yes, HDF5 no
- HDF5 requires random-access seek/write operations incompatible with
object storage's PUT/GET model
- Staged inserts work with chunk-based formats (Zarr, TileDB) where
each chunk is a separate object
- Added compatibility table and HDF5 copy-insert example
- Recommend Zarr over HDF5 for cloud-native workflows
Copy file name to clipboardExpand all lines: docs/src/design/tables/object-type-spec.md
+29-4Lines changed: 29 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Once an object is **finalized** (either via copy-insert or staged-insert complet
21
21
| Mode | Use Case | Workflow |
22
22
|------|----------|----------|
23
23
|**Copy**| Small files, existing data | Local file → copy to storage → insert record |
24
-
|**Staged**| Large objects, Zarr/HDF5| Reserve path → write directly to storage → finalize record |
24
+
|**Staged**| Large objects, Zarr, TileDB| Reserve path → write directly to storage → finalize record |
25
25
26
26
### Augmented Schema vs External References
27
27
@@ -1144,11 +1144,36 @@ Each record owns its file exclusively. There is no deduplication or reference co
1144
1144
-`object` type is additive - new tables only
1145
1145
- Future: Migration utilities to convert existing external storage
1146
1146
1147
-
## Zarr and Large Hierarchical Data
1147
+
## Zarr, TileDB, and Large Hierarchical Data
1148
1148
1149
-
The `object` type is designed with Zarr and similar hierarchical data formats (HDF5 via kerchunk, TileDB) in mind. This section provides guidance for these use cases.
1149
+
The `object` type is designed with **chunk-based formats** like Zarr and TileDB in mind. These formats store each chunk as a separate object, which maps naturally to object storage.
1150
1150
1151
-
### Recommended Workflow
1151
+
### Staged Insert Compatibility
1152
+
1153
+
**Staged inserts work with formats that support chunk-based writes:**
1154
+
1155
+
| Format | Staged Insert | Why |
1156
+
|--------|---------------|-----|
1157
+
|**Zarr**| ✅ Yes | Each chunk is a separate object |
1158
+
|**TileDB**| ✅ Yes | Fragment-based storage maps to objects |
1159
+
|**HDF5**| ❌ No | Single monolithic file requires random-access seek/write |
1160
+
1161
+
**HDF5 limitation**: HDF5 files have internal B-tree structures that require random-access modifications. Object storage only supports full object PUT/GET operations, not partial updates. For HDF5, use **copy insert**:
1162
+
1163
+
```python
1164
+
# HDF5: Write locally, then copy to object storage
1165
+
import h5py
1166
+
import tempfile
1167
+
1168
+
with tempfile.NamedTemporaryFile(suffix='.h5', delete=False) as f:
1169
+
with h5py.File(f.name, 'w') as h5:
1170
+
h5.create_dataset('data', data=large_array)
1171
+
Recording.insert1({..., 'data_file': f.name})
1172
+
```
1173
+
1174
+
For cloud-native workflows with large arrays, **Zarr is recommended** over HDF5.
1175
+
1176
+
### Recommended Workflow (Zarr)
1152
1177
1153
1178
For large Zarr stores, use **staged insert** to write directly to object storage:
0 commit comments