You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Make size field optional (nullable) for large hierarchical data
- Add Performance Considerations section documenting expensive operations
- Add Extension Field section clarifying ext is a tooling hint
- Add Storage Access Architecture section noting fsspec pluggability
- Add comprehensive Zarr and Large Hierarchical Data section
- Update ObjectRef dataclass to support optional size
- Add test for Zarr-style JSON with null size
|`path`| string | Yes | Full path/key within storage backend (includes token) |
296
-
|`size`| integer |Yes| Total size in bytes (sum for folders) |
308
+
|`size`| integer/null|No| Total size in bytes (sum for folders), or null if not computed. See [Performance Considerations](#performance-considerations).|
297
309
|`hash`| string/null | Yes | Content hash with algorithm prefix, or null (default) |
|`is_dir`| boolean | Yes | True if stored content is a directory |
310
+
|`ext`| string/null | Yes | File extension as tooling hint (e.g., `.dat`, `.zarr`) or null. See [Extension Field](#extension-field).|
311
+
|`is_dir`| boolean | Yes | True if stored content is a directory/key-prefix (e.g., Zarr store)|
300
312
|`timestamp`| string | Yes | ISO 8601 upload timestamp |
301
313
|`mime_type`| string | No | MIME type (files only, auto-detected from extension) |
302
-
|`item_count`| integer | No | Number of files (folders only) |
314
+
|`item_count`| integer | No | Number of files (folders only), or null if not computed. See [Performance Considerations](#performance-considerations). |
315
+
316
+
### Extension Field
317
+
318
+
The `ext` field is a **tooling hint** that preserves the original file extension or provides a conventional suffix for directory-based formats. It is:
319
+
320
+
-**Not a content-type declaration**: Unlike `mime_type`, it does not attempt to describe the internal content format
321
+
-**Useful for tooling**: Enables file browsers, IDEs, and other tools to display appropriate icons or suggest applications
322
+
-**Conventional for formats like Zarr**: The `.zarr` extension is recognized by the ecosystem even though a Zarr store contains mixed content (JSON metadata + binary chunks)
323
+
324
+
For single files, `ext` is extracted from the source filename. For staged inserts (like Zarr), it can be explicitly provided.
325
+
326
+
### Performance Considerations
327
+
328
+
For large hierarchical data like Zarr stores, computing certain metadata can be expensive:
329
+
330
+
-**`size`**: Requires listing all objects and summing their sizes. For stores with millions of chunks, this can take minutes or hours.
331
+
-**`item_count`**: Requires listing all objects. Same performance concern as `size`.
332
+
-**`hash`**: Requires reading all content. Explicitly not supported for staged inserts.
333
+
334
+
**These fields are optional** and default to `null` for staged inserts. Users can explicitly request computation when needed, understanding the performance implications.
303
335
304
336
### Content Hashing
305
337
@@ -996,6 +1028,20 @@ gcs = ["gcsfs"]
996
1028
azure = ["adlfs"]
997
1029
```
998
1030
1031
+
### Storage Access Architecture
1032
+
1033
+
The `object` type separates **data declaration** (the JSON metadata stored in the database) from **storage access** (the library used to read/write objects):
1034
+
1035
+
-**Data declaration**: The JSON schema (path, size, hash, etc.) is a pure data structure with no library dependencies
1036
+
-**Storage access**: Currently uses `fsspec` as the default accessor, but the architecture supports alternative backends
1037
+
1038
+
**Why this matters**: While `fsspec` is a mature and widely-used library, alternatives like [`obstore`](https://github.com/developmentseed/obstore) offer performance advantages for certain workloads. By keeping the data model independent of the access library, future versions can support pluggable storage accessors without schema changes.
1039
+
1040
+
**Current implementation**: The `ObjectRef` class provides fsspec-based accessors (`fs`, `store` properties). Future versions may add:
1041
+
- Pluggable accessor interface
1042
+
- Alternative backends (obstore, custom implementations)
1043
+
- Backend selection per-operation or per-configuration
@@ -1073,6 +1119,86 @@ Each record owns its file exclusively. There is no deduplication or reference co
1073
1119
-`object` type is additive - new tables only
1074
1120
- Future: Migration utilities to convert existing external storage
1075
1121
1122
+
## Zarr and Large Hierarchical Data
1123
+
1124
+
The `object` type is designed with Zarr and similar hierarchical data formats (HDF5 via kerchunk, TileDB) in mind. This section provides guidance for these use cases.
1125
+
1126
+
### Recommended Workflow
1127
+
1128
+
For large Zarr stores, use **staged insert** to write directly to object storage:
0 commit comments