Skip to content

Commit af7c76c

Browse files
Remove HashRegistry table, use JSON field scanning for GC
Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection now scans all tables to find referenced hashes in JSON fields directly. Co-authored-by: dimitri-yatsenko <[email protected]>
1 parent 7d0a5a5 commit af7c76c

File tree

1 file changed

+17
-35
lines changed

1 file changed

+17
-35
lines changed

docs/src/design/tables/storage-types-spec.md

Lines changed: 17 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -609,49 +609,31 @@ class ImageType(AttributeType):
609609
| `<hash@s>` | `json` | `JSON`/`JSONB` | `_hash/{hash}` | Yes | bytes |
610610
| `<filepath@s>` | `json` | `JSON`/`JSONB` | Configured store | No | ObjectRef |
611611

612-
## Reference Counting for Hash Type
612+
## Garbage Collection for Hash Storage
613613

614-
The `HashRegistry` is a **project-level** table that tracks hash-addressed objects
615-
across all schemas. This differs from the legacy `~external_*` tables which were per-schema.
614+
Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate
615+
registry table is needed. Garbage collection scans all tables to find referenced hashes:
616616

617617
```python
618-
class HashRegistry:
619-
"""
620-
Project-level hash registry.
621-
Stored in a designated database (e.g., `{project}_hash`).
622-
"""
623-
definition = """
624-
# Hash-addressed object registry (project-wide)
625-
hash_id : char(64) # SHA256 hex
626-
---
627-
store : varchar(64) # Store name
628-
size : uint64 # Size in bytes
629-
created = CURRENT_TIMESTAMP : datetime
630-
"""
631-
```
632-
633-
Garbage collection scans **all schemas** in the project:
634-
635-
```python
636-
def garbage_collect(project):
637-
"""Remove data not referenced by any table in any schema."""
638-
# Get all registered hashes
639-
registered = set(HashRegistry().fetch('hash_id', 'store'))
618+
def garbage_collect(store_name):
619+
"""Remove hash-addressed data not referenced by any table."""
620+
# Scan store for all hash files
621+
store = get_store(store_name)
622+
all_hashes = set(store.list_hashes()) # from _hash/ directory
640623

641-
# Get all referenced hashes from ALL schemas in the project
624+
# Scan all tables for referenced hashes
642625
referenced = set()
643626
for schema in project.schemas:
644627
for table in schema.tables:
645628
for attr in table.heading.attributes:
646-
if attr.type in ('hash', 'hash@...'):
647-
hashes = table.fetch(attr.name)
648-
referenced.update((h, attr.store) for h in hashes)
649-
650-
# Delete orphaned data
651-
for hash_id, store in (registered - referenced):
652-
store_backend = get_store(store)
653-
store_backend.delete(hash_path(hash_id))
654-
(HashRegistry() & {'hash_id': hash_id}).delete()
629+
if uses_hash_storage(attr): # <blob@>, <attach@>, <hash@>
630+
for row in table.fetch(attr.name):
631+
if row and row.get('store') == store_name:
632+
referenced.add(row['hash'])
633+
634+
# Delete orphaned files
635+
for hash_id in (all_hashes - referenced):
636+
store.delete(hash_path(hash_id))
655637
```
656638

657639
## Built-in AttributeType Comparison

0 commit comments

Comments
 (0)