Skip to content

Commit 0ea880a

Browse files
committed
Make content hashing optional, add folder manifests
- Hash is null by default to avoid performance overhead for large objects - Optional hash parameter on insert: hash="sha256", "md5", or "xxhash" - Staged inserts never compute hashes (no local copy to hash from) - Folders get a manifest file (.manifest.json) with file list and sizes - Manifest enables integrity verification without content hashing - Add ObjectRef.verify() method for integrity checking
1 parent 6c6349b commit 0ea880a

File tree

1 file changed

+75
-4
lines changed

1 file changed

+75
-4
lines changed

docs/src/design/tables/file-type-spec.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,19 @@ The `object` type is stored as a `JSON` column in MySQL containing:
245245

246246
**File example:**
247247
```json
248+
{
249+
"path": "my_schema/Recording/objects/subject_id=123/session_id=45/raw_data_Ax7bQ2kM.dat",
250+
"size": 12345,
251+
"hash": null,
252+
"ext": ".dat",
253+
"is_dir": false,
254+
"timestamp": "2025-01-15T10:30:00Z",
255+
"mime_type": "application/octet-stream"
256+
}
257+
```
258+
259+
**File with optional hash:**
260+
```json
248261
{
249262
"path": "my_schema/Recording/objects/subject_id=123/session_id=45/raw_data_Ax7bQ2kM.dat",
250263
"size": 12345,
@@ -261,7 +274,7 @@ The `object` type is stored as a `JSON` column in MySQL containing:
261274
{
262275
"path": "my_schema/Recording/objects/subject_id=123/session_id=45/raw_data_pL9nR4wE",
263276
"size": 567890,
264-
"hash": "sha256:fedcba9876...",
277+
"hash": null,
265278
"ext": null,
266279
"is_dir": true,
267280
"timestamp": "2025-01-15T10:30:00Z",
@@ -275,13 +288,59 @@ The `object` type is stored as a `JSON` column in MySQL containing:
275288
|-------|------|----------|-------------|
276289
| `path` | string | Yes | Full path/key within storage backend (includes token) |
277290
| `size` | integer | Yes | Total size in bytes (sum for folders) |
278-
| `hash` | string | Yes | Content hash with algorithm prefix |
291+
| `hash` | string/null | Yes | Content hash with algorithm prefix, or null (default) |
279292
| `ext` | string/null | Yes | File extension (e.g., `.dat`, `.zarr`) or null |
280293
| `is_dir` | boolean | Yes | True if stored content is a directory |
281294
| `timestamp` | string | Yes | ISO 8601 upload timestamp |
282295
| `mime_type` | string | No | MIME type (files only, auto-detected from extension) |
283296
| `item_count` | integer | No | Number of files (folders only) |
284297

298+
### Content Hashing
299+
300+
By default, **no content hash is computed** to avoid performance overhead for large objects. Storage backend integrity is trusted.
301+
302+
**Optional hashing** can be requested per-insert:
303+
304+
```python
305+
# Default - no hash (fast)
306+
Recording.insert1({..., "raw_data": "/path/to/large.dat"})
307+
308+
# Request hash computation
309+
Recording.insert1({..., "raw_data": "/path/to/important.dat"}, hash="sha256")
310+
```
311+
312+
Supported hash algorithms: `sha256`, `md5`, `xxhash` (xxh3, faster for large files)
313+
314+
**Staged inserts never compute hashes** - data is written directly to storage without a local copy to hash.
315+
316+
### Folder Manifests
317+
318+
For folders (directories), a **manifest file** is created alongside the folder to enable integrity verification without computing content hashes:
319+
320+
```
321+
raw_data_pL9nR4wE/
322+
raw_data_pL9nR4wE.manifest.json
323+
```
324+
325+
**Manifest content:**
326+
```json
327+
{
328+
"files": [
329+
{"path": "file1.dat", "size": 1234},
330+
{"path": "subdir/file2.dat", "size": 5678},
331+
{"path": "subdir/file3.dat", "size": 91011}
332+
],
333+
"total_size": 567890,
334+
"item_count": 42,
335+
"created": "2025-01-15T10:30:00Z"
336+
}
337+
```
338+
339+
The manifest enables:
340+
- Quick verification that all expected files exist
341+
- Size validation without reading file contents
342+
- Detection of missing or extra files
343+
285344
### Filename Convention
286345

287346
The stored filename is **always derived from the field name**:
@@ -736,7 +795,7 @@ file_ref = record["raw_data"]
736795
# Access metadata (no I/O)
737796
print(file_ref.path) # Full storage path
738797
print(file_ref.size) # File size in bytes
739-
print(file_ref.hash) # Content hash
798+
print(file_ref.hash) # Content hash (if computed) or None
740799
print(file_ref.ext) # File extension (e.g., ".dat") or None
741800
print(file_ref.is_dir) # True if stored content is a folder
742801

@@ -840,7 +899,7 @@ class ObjectRef:
840899

841900
path: str
842901
size: int
843-
hash: str
902+
hash: str | None # content hash (if computed) or None
844903
ext: str | None # file extension (e.g., ".dat") or None
845904
is_dir: bool
846905
timestamp: datetime
@@ -875,6 +934,18 @@ class ObjectRef:
875934
# Common operations
876935
def download(self, destination: Path | str, subpath: str | None = None) -> Path: ...
877936
def exists(self, subpath: str | None = None) -> bool: ...
937+
938+
# Integrity verification
939+
def verify(self) -> bool:
940+
"""
941+
Verify object integrity.
942+
943+
For files: checks size matches, and hash if available.
944+
For folders: validates manifest (all files exist with correct sizes).
945+
946+
Returns True if valid, raises IntegrityError with details if not.
947+
"""
948+
...
878949
```
879950

880951
#### fsspec Integration

0 commit comments

Comments
 (0)