Skip to content

Commit 1ad327d

Browse files
Rename content type to hash for clarity
Rename <content@> to <hash@> throughout documentation: - More descriptive: indicates hash-based addressing mechanism - Familiar concept: works like a hash data structure - Storage folder: _content/ → _hash/ - Registry: ContentRegistry → HashRegistry The <hash@> type provides: - SHA256 hash-based addressing - Automatic deduplication - External-only storage (requires @) - Used as dtype by <blob@> and <attach@> Co-authored-by: dimitri-yatsenko <[email protected]>
1 parent d6bdd80 commit 1ad327d

File tree

2 files changed

+67
-67
lines changed

2 files changed

+67
-67
lines changed

docs/src/design/tables/attributes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ The `@` character indicates **external storage** (object store vs database):
9898
NumPy arrays, dicts, lists, datetime objects, and nested structures. Stores in
9999
database. Compatible with MATLAB. See [custom types](customtype.md) for details.
100100

101-
- `<blob@>` / `<blob@store>`: Like `<blob>` but stores externally with content-
101+
- `<blob@>` / `<blob@store>`: Like `<blob>` but stores externally with hash-
102102
addressed deduplication. Use for large arrays that may be duplicated across rows.
103103

104104
**File storage types** - for managed files:
@@ -107,7 +107,7 @@ The `@` character indicates **external storage** (object store vs database):
107107
from primary key. Supports Zarr, HDF5, and direct writes via fsspec. Returns
108108
`ObjectRef` for lazy access. External only. See [object storage](object.md).
109109

110-
- `<content@>` / `<content@store>`: Content-addressed storage for raw bytes with
110+
- `<hash@>` / `<hash@store>`: Hash-addressed storage for raw bytes with
111111
SHA256 deduplication. External only. Use via `<blob@>` or `<attach@>` rather than directly.
112112

113113
**File attachment types** - for file transfer:

docs/src/design/tables/storage-types-spec.md

Lines changed: 65 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This document defines a three-layer type architecture:
1212
┌───────────────────────────────────────────────────────────────────┐
1313
│ AttributeTypes (Layer 3) │
1414
│ │
15-
│ Built-in: <blob> <attach> <object@> <content@> <filepath@> │
15+
│ Built-in: <blob> <attach> <object@> <hash@> <filepath@> │
1616
│ User: <custom> <mytype> ... │
1717
├───────────────────────────────────────────────────────────────────┤
1818
│ Core DataJoint Types (Layer 2) │
@@ -39,7 +39,7 @@ This document defines a three-layer type architecture:
3939
| Region | Path Pattern | Addressing | Use Case |
4040
|--------|--------------|------------|----------|
4141
| Object | `{schema}/{table}/{pk}/` | Primary key | Large objects, Zarr, HDF5 |
42-
| Content | `_content/{hash}` | Content hash | Deduplicated blobs/files |
42+
| Hash | `_hash/{hash}` | SHA256 hash | Deduplicated blobs/files |
4343

4444
### External References
4545

@@ -193,7 +193,7 @@ The `@` character in AttributeType syntax indicates **external storage** (object
193193
- **`@` alone**: Use default store - e.g., `<blob@>`
194194
- **`@name`**: Use named store - e.g., `<blob@cold>`
195195

196-
Some types support both modes (`<blob>`, `<attach>`), others are external-only (`<object@>`, `<content@>`, `<filepath@>`).
196+
Some types support both modes (`<blob>`, `<attach>`), others are external-only (`<object@>`, `<hash@>`, `<filepath@>`).
197197

198198
### Type Resolution and Chaining
199199

@@ -204,16 +204,16 @@ returns the appropriate dtype based on storage mode:
204204
Resolution at declaration time:
205205
206206
<blob> → get_dtype(False) → "bytes" → LONGBLOB/BYTEA
207-
<blob@> → get_dtype(True) → "<content>" → json → JSON/JSONB
208-
<blob@cold> → get_dtype(True) → "<content>" → json (store=cold)
207+
<blob@> → get_dtype(True) → "<hash>" → json → JSON/JSONB
208+
<blob@cold> → get_dtype(True) → "<hash>" → json (store=cold)
209209
210210
<attach> → get_dtype(False) → "bytes" → LONGBLOB/BYTEA
211-
<attach@> → get_dtype(True) → "<content>" → json → JSON/JSONB
211+
<attach@> → get_dtype(True) → "<hash>" → json → JSON/JSONB
212212
213213
<object@> → get_dtype(True) → "json" → JSON/JSONB
214214
<object> → get_dtype(False) → ERROR (external only)
215215
216-
<content@> → get_dtype(True) → "json" → JSON/JSONB
216+
<hash@> → get_dtype(True) → "json" → JSON/JSONB
217217
<filepath@s> → get_dtype(True) → "json" → JSON/JSONB
218218
```
219219

@@ -262,15 +262,15 @@ class ObjectType(AttributeType):
262262
return ObjectRef(store=get_store(stored["store"]), path=stored["path"])
263263
```
264264

265-
### `<content@>` / `<content@store>` - Content-Addressed Storage
265+
### `<hash@>` / `<hash@store>` - Hash-Addressed Storage
266266

267267
**Built-in AttributeType. External only.**
268268

269-
Content-addressed storage with deduplication:
269+
Hash-addressed storage with deduplication:
270270

271271
- **Single blob only**: stores a single file or serialized object (not folders)
272272
- **Per-project scope**: content is shared across all schemas in a project (not per-schema)
273-
- Path derived from content hash: `_content/{hash[:2]}/{hash[2:4]}/{hash}`
273+
- Path derived from content hash: `_hash/{hash[:2]}/{hash[2:4]}/{hash}`
274274
- Many-to-one: multiple rows (even across schemas) can reference same content
275275
- Reference counted for garbage collection
276276
- Deduplication: identical content stored once across the entire project
@@ -282,48 +282,48 @@ store_root/
282282
├── {schema}/{table}/{pk}/ # object storage (path-addressed by PK)
283283
│ └── {attribute}/
284284
285-
└── _content/ # content storage (content-addressed)
285+
└── _hash/ # content storage (hash-addressed)
286286
└── {hash[:2]}/{hash[2:4]}/{hash}
287287
```
288288

289289
#### Implementation
290290

291291
```python
292-
class ContentType(AttributeType):
293-
"""Content-addressed storage. External only."""
294-
type_name = "content"
292+
class HashType(AttributeType):
293+
"""Hash-addressed storage. External only."""
294+
type_name = "hash"
295295

296296
def get_dtype(self, is_external: bool) -> str:
297297
if not is_external:
298-
raise DataJointError("<content> requires @ (external storage only)")
298+
raise DataJointError("<hash> requires @ (external storage only)")
299299
return "json"
300300

301301
def encode(self, data: bytes, *, key=None, store_name=None) -> dict:
302302
"""Store content, return metadata as JSON."""
303-
content_hash = hashlib.sha256(data).hexdigest()
303+
hash_id = hashlib.sha256(data).hexdigest()
304304
store = get_store(store_name or dj.config['stores']['default'])
305-
path = f"_content/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
305+
path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
306306

307307
if not store.exists(path):
308308
store.put(path, data)
309-
ContentRegistry().insert1({
310-
'content_hash': content_hash,
309+
HashRegistry().insert1({
310+
'hash_id': hash_id,
311311
'store': store_name,
312312
'size': len(data)
313313
}, skip_duplicates=True)
314314

315-
return {"hash": content_hash, "store": store_name, "size": len(data)}
315+
return {"hash": hash_id, "store": store_name, "size": len(data)}
316316

317317
def decode(self, stored: dict, *, key=None) -> bytes:
318318
"""Retrieve content by hash."""
319319
store = get_store(stored["store"])
320-
path = f"_content/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
320+
path = f"_hash/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
321321
return store.get(path)
322322
```
323323

324324
#### Database Column
325325

326-
The `<content@>` type stores JSON metadata:
326+
The `<hash@>` type stores JSON metadata:
327327

328328
```sql
329329
-- content column (MySQL)
@@ -442,7 +442,7 @@ column_name JSONB NOT NULL
442442
```
443443

444444
The `json` database type:
445-
- Used as dtype by built-in AttributeTypes (`<object@>`, `<content@>`, `<filepath@store>`)
445+
- Used as dtype by built-in AttributeTypes (`<object@>`, `<hash@>`, `<filepath@store>`)
446446
- Stores arbitrary JSON-serializable data
447447
- Automatically uses appropriate type for database backend
448448
- Supports JSON path queries where available
@@ -457,7 +457,7 @@ Serializes Python objects (NumPy arrays, dicts, lists, etc.) using DataJoint's
457457
blob format. Compatible with MATLAB.
458458

459459
- **`<blob>`**: Stored in database (`bytes``LONGBLOB`/`BYTEA`)
460-
- **`<blob@>`**: Stored externally via `<content@>` with deduplication
460+
- **`<blob@>`**: Stored externally via `<hash@>` with deduplication
461461
- **`<blob@store>`**: Stored in specific named store
462462

463463
```python
@@ -467,7 +467,7 @@ class BlobType(AttributeType):
467467
type_name = "blob"
468468

469469
def get_dtype(self, is_external: bool) -> str:
470-
return "<content>" if is_external else "bytes"
470+
return "<hash>" if is_external else "bytes"
471471

472472
def encode(self, value, *, key=None, store_name=None) -> bytes:
473473
from . import blob
@@ -497,7 +497,7 @@ class ProcessedData(dj.Computed):
497497
Stores files with filename preserved. On fetch, extracts to configured download path.
498498

499499
- **`<attach>`**: Stored in database (`bytes``LONGBLOB`/`BYTEA`)
500-
- **`<attach@>`**: Stored externally via `<content@>` with deduplication
500+
- **`<attach@>`**: Stored externally via `<hash@>` with deduplication
501501
- **`<attach@store>`**: Stored in specific named store
502502

503503
```python
@@ -507,7 +507,7 @@ class AttachType(AttributeType):
507507
type_name = "attach"
508508

509509
def get_dtype(self, is_external: bool) -> str:
510-
return "<content>" if is_external else "bytes"
510+
return "<hash>" if is_external else "bytes"
511511

512512
def encode(self, filepath, *, key=None, store_name=None) -> bytes:
513513
path = Path(filepath)
@@ -567,7 +567,7 @@ class ImageType(AttributeType):
567567
type_name = "image"
568568

569569
def get_dtype(self, is_external: bool) -> str:
570-
return "<content>" if is_external else "bytes"
570+
return "<hash>" if is_external else "bytes"
571571

572572
def encode(self, image, *, key=None, store_name=None) -> bytes:
573573
# Convert PIL Image to PNG bytes
@@ -584,31 +584,31 @@ class ImageType(AttributeType):
584584
| Type | get_dtype | Resolves To | Storage Location | Dedup | Returns |
585585
|------|-----------|-------------|------------------|-------|---------|
586586
| `<blob>` | `bytes` | `LONGBLOB`/`BYTEA` | Database | No | Python object |
587-
| `<blob@>` | `<content>` | `json` | `_content/{hash}` | Yes | Python object |
588-
| `<blob@s>` | `<content>` | `json` | `_content/{hash}` | Yes | Python object |
587+
| `<blob@>` | `<hash>` | `json` | `_hash/{hash}` | Yes | Python object |
588+
| `<blob@s>` | `<hash>` | `json` | `_hash/{hash}` | Yes | Python object |
589589
| `<attach>` | `bytes` | `LONGBLOB`/`BYTEA` | Database | No | Local file path |
590-
| `<attach@>` | `<content>` | `json` | `_content/{hash}` | Yes | Local file path |
591-
| `<attach@s>` | `<content>` | `json` | `_content/{hash}` | Yes | Local file path |
590+
| `<attach@>` | `<hash>` | `json` | `_hash/{hash}` | Yes | Local file path |
591+
| `<attach@s>` | `<hash>` | `json` | `_hash/{hash}` | Yes | Local file path |
592592
| `<object@>` | `json` | `JSON`/`JSONB` | `{schema}/{table}/{pk}/` | No | ObjectRef |
593593
| `<object@s>` | `json` | `JSON`/`JSONB` | `{schema}/{table}/{pk}/` | No | ObjectRef |
594-
| `<content@>` | `json` | `JSON`/`JSONB` | `_content/{hash}` | Yes | bytes |
595-
| `<content@s>` | `json` | `JSON`/`JSONB` | `_content/{hash}` | Yes | bytes |
594+
| `<hash@>` | `json` | `JSON`/`JSONB` | `_hash/{hash}` | Yes | bytes |
595+
| `<hash@s>` | `json` | `JSON`/`JSONB` | `_hash/{hash}` | Yes | bytes |
596596
| `<filepath@s>` | `json` | `JSON`/`JSONB` | Configured store | No | ObjectRef |
597597

598-
## Reference Counting for Content Type
598+
## Reference Counting for Hash Type
599599

600-
The `ContentRegistry` is a **project-level** table that tracks content-addressed objects
600+
The `HashRegistry` is a **project-level** table that tracks hash-addressed objects
601601
across all schemas. This differs from the legacy `~external_*` tables which were per-schema.
602602

603603
```python
604-
class ContentRegistry:
604+
class HashRegistry:
605605
"""
606-
Project-level content registry.
607-
Stored in a designated database (e.g., `{project}_content`).
606+
Project-level hash registry.
607+
Stored in a designated database (e.g., `{project}_hash`).
608608
"""
609609
definition = """
610-
# Content-addressed object registry (project-wide)
611-
content_hash : char(64) # SHA256 hex
610+
# Hash-addressed object registry (project-wide)
611+
hash_id : char(64) # SHA256 hex
612612
---
613613
store : varchar(64) # Store name
614614
size : bigint unsigned # Size in bytes
@@ -620,34 +620,34 @@ Garbage collection scans **all schemas** in the project:
620620

621621
```python
622622
def garbage_collect(project):
623-
"""Remove content not referenced by any table in any schema."""
623+
"""Remove data not referenced by any table in any schema."""
624624
# Get all registered hashes
625-
registered = set(ContentRegistry().fetch('content_hash', 'store'))
625+
registered = set(HashRegistry().fetch('hash_id', 'store'))
626626

627627
# Get all referenced hashes from ALL schemas in the project
628628
referenced = set()
629629
for schema in project.schemas:
630630
for table in schema.tables:
631631
for attr in table.heading.attributes:
632-
if attr.type in ('content', 'content@...'):
632+
if attr.type in ('hash', 'hash@...'):
633633
hashes = table.fetch(attr.name)
634634
referenced.update((h, attr.store) for h in hashes)
635635

636-
# Delete orphaned content
637-
for content_hash, store in (registered - referenced):
636+
# Delete orphaned data
637+
for hash_id, store in (registered - referenced):
638638
store_backend = get_store(store)
639-
store_backend.delete(content_path(content_hash))
640-
(ContentRegistry() & {'content_hash': content_hash}).delete()
639+
store_backend.delete(hash_path(hash_id))
640+
(HashRegistry() & {'hash_id': hash_id}).delete()
641641
```
642642

643643
## Built-in AttributeType Comparison
644644

645-
| Feature | `<blob>` | `<attach>` | `<object@>` | `<content@>` | `<filepath@>` |
645+
| Feature | `<blob>` | `<attach>` | `<object@>` | `<hash@>` | `<filepath@>` |
646646
|---------|----------|------------|-------------|--------------|---------------|
647647
| Storage modes | Both | Both | External only | External only | External only |
648648
| Internal dtype | `bytes` | `bytes` | N/A | N/A | N/A |
649-
| External dtype | `<content>` | `<content>` | `json` | `json` | `json` |
650-
| Addressing | Content hash | Content hash | Primary key | Content hash | Relative path |
649+
| External dtype | `<hash>` | `<hash>` | `json` | `json` | `json` |
650+
| Addressing | Hash | Hash | Primary key | Hash | Relative path |
651651
| Deduplication | Yes (external) | Yes (external) | No | Yes | No |
652652
| Structure | Single blob | Single file | Files, folders | Single blob | Any |
653653
| Returns | Python object | Local path | ObjectRef | bytes | ObjectRef |
@@ -657,7 +657,7 @@ def garbage_collect(project):
657657
- **`<blob>`**: Serialized Python objects (NumPy arrays, dicts). Use `<blob@>` for large/duplicated data
658658
- **`<attach>`**: File attachments with filename preserved. Use `<attach@>` for large files
659659
- **`<object@>`**: Large/complex file structures (Zarr, HDF5) where DataJoint controls organization
660-
- **`<content@>`**: Raw bytes with deduplication (typically used via `<blob@>` or `<attach@>`)
660+
- **`<hash@>`**: Raw bytes with deduplication (typically used via `<blob@>` or `<attach@>`)
661661
- **`<filepath@store>`**: Portable references to externally-managed files
662662
- **`varchar`**: Arbitrary URLs/paths where ObjectRef semantics aren't needed
663663

@@ -671,9 +671,9 @@ def garbage_collect(project):
671671
3. **AttributeTypes use angle brackets**: `<blob>`, `<object@store>`, `<filepath@main>` - distinguishes from core types
672672
4. **`@` indicates external storage**: No `@` = database, `@` present = object store
673673
5. **`get_dtype(is_external)` method**: Types resolve dtype at declaration time based on storage mode
674-
6. **AttributeTypes are composable**: `<blob@>` uses `<content@>`, which uses `json`
674+
6. **AttributeTypes are composable**: `<blob@>` uses `<hash@>`, which uses `json`
675675
7. **Built-in external types use JSON dtype**: Stores metadata (path, hash, store name, etc.)
676-
8. **Two OAS regions**: object (PK-addressed) and content (hash-addressed) within managed stores
676+
8. **Two OAS regions**: object (PK-addressed) and hash (hash-addressed) within managed stores
677677
9. **Filepath for portability**: `<filepath@store>` uses relative paths within stores for environment portability
678678
10. **No `uri` type**: For arbitrary URLs, use `varchar`—simpler and more transparent
679679
11. **Naming conventions**:
@@ -682,7 +682,7 @@ def garbage_collect(project):
682682
- `@` alone = default store
683683
- `@name` = named store
684684
12. **Dual-mode types**: `<blob>` and `<attach>` support both internal and external storage
685-
13. **External-only types**: `<object@>`, `<content@>`, `<filepath@>` require `@`
685+
13. **External-only types**: `<object@>`, `<hash@>`, `<filepath@>` require `@`
686686
14. **Transparent access**: AttributeTypes return Python objects or file paths
687687
15. **Lazy access**: `<object@>` and `<filepath@store>` return ObjectRef
688688

@@ -699,20 +699,20 @@ def garbage_collect(project):
699699
### Migration from Legacy `~external_*` Stores
700700

701701
Legacy external storage used per-schema `~external_{store}` tables. Migration to the new
702-
per-project `ContentRegistry` requires:
702+
per-project `HashRegistry` requires:
703703

704704
```python
705705
def migrate_external_store(schema, store_name):
706706
"""
707-
Migrate legacy ~external_{store} to new ContentRegistry.
707+
Migrate legacy ~external_{store} to new HashRegistry.
708708
709709
1. Read all entries from ~external_{store}
710710
2. For each entry:
711711
- Fetch content from legacy location
712712
- Compute SHA256 hash
713-
- Copy to _content/{hash}/ if not exists
713+
- Copy to _hash/{hash}/ if not exists
714714
- Update table column from UUID to hash
715-
- Register in ContentRegistry
715+
- Register in HashRegistry
716716
3. After all schemas migrated, drop ~external_{store} tables
717717
"""
718718
external_table = schema.external[store_name]
@@ -724,17 +724,17 @@ def migrate_external_store(schema, store_name):
724724
content = external_table.get(legacy_uuid)
725725

726726
# Compute new content hash
727-
content_hash = hashlib.sha256(content).hexdigest()
727+
hash_id = hashlib.sha256(content).hexdigest()
728728

729729
# Store in new location if not exists
730-
new_path = f"_content/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
730+
new_path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
731731
store = get_store(store_name)
732732
if not store.exists(new_path):
733733
store.put(new_path, content)
734734

735-
# Register in project-wide ContentRegistry
736-
ContentRegistry().insert1({
737-
'content_hash': content_hash,
735+
# Register in project-wide HashRegistry
736+
HashRegistry().insert1({
737+
'hash_id': hash_id,
738738
'store': store_name,
739739
'size': len(content)
740740
}, skip_duplicates=True)
@@ -755,4 +755,4 @@ def migrate_external_store(schema, store_name):
755755
## Open Questions
756756

757757
1. How long should the backward compatibility layer support legacy `~external_*` format?
758-
2. Should `<content@>` (without store name) use a default store or require explicit store name?
758+
2. Should `<hash@>` (without store name) use a default store or require explicit store name?

0 commit comments

Comments
 (0)