Skip to content

Commit 9bd37f6

Browse files
committed
Add DJBlobType and migration utilities for blob columns
Introduces `<djblob>` as an explicit AttributeType for DataJoint's native blob serialization, allowing users to be explicit about serialization behavior in table definitions. Key changes: - Add DJBlobType class with `serializes=True` flag to indicate it handles its own serialization (avoiding double pack/unpack) - Update table.py and fetch.py to respect the `serializes` flag, skipping blob.pack/unpack when adapter handles serialization - Add `dj.migrate` module with utilities for migrating existing schemas to use explicit `<djblob>` type declarations - Add tests for DJBlobType functionality - Document `<djblob>` type and migration procedure The migration is metadata-only - blob data format is unchanged. Existing `longblob` columns continue to work with implicit serialization for backward compatibility.
1 parent af9bd8d commit 9bd37f6

File tree

7 files changed

+572
-14
lines changed

7 files changed

+572
-14
lines changed

docs/src/design/tables/customtype.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -476,3 +476,117 @@ def test_graph_type_roundtrip():
476476

477477
assert set(g.edges) == set(decoded.edges)
478478
```
479+
480+
## Built-in Types
481+
482+
DataJoint includes a built-in type for explicit blob serialization:
483+
484+
### `<djblob>` - DataJoint Blob Serialization
485+
486+
The `<djblob>` type provides explicit control over DataJoint's native binary
487+
serialization. It supports:
488+
489+
- NumPy arrays (compatible with MATLAB)
490+
- Python dicts, lists, tuples, sets
491+
- datetime objects, Decimals, UUIDs
492+
- Nested data structures
493+
- Optional compression
494+
495+
```python
496+
@schema
497+
class ProcessedData(dj.Manual):
498+
definition = """
499+
data_id : int
500+
---
501+
results : <djblob> # Explicit serialization
502+
raw_bytes : longblob # Backward-compatible (auto-serialized)
503+
"""
504+
```
505+
506+
#### When to Use `<djblob>`
507+
508+
- **New tables**: Prefer `<djblob>` for clarity and future-proofing
509+
- **Custom types**: Use `<djblob>` when your type chains to blob storage
510+
- **Migration**: Existing `longblob` columns can be migrated to `<djblob>`
511+
512+
#### Backward Compatibility
513+
514+
For backward compatibility, `longblob` columns without an explicit type
515+
still receive automatic serialization. The behavior is identical to `<djblob>`,
516+
but using `<djblob>` makes the serialization explicit in your code.
517+
518+
## Schema Migration
519+
520+
When upgrading existing schemas to use explicit type declarations, DataJoint
521+
provides migration utilities.
522+
523+
### Analyzing Blob Columns
524+
525+
```python
526+
import datajoint as dj
527+
528+
schema = dj.schema("my_database")
529+
530+
# Check migration status
531+
status = dj.migrate.check_migration_status(schema)
532+
print(f"Blob columns: {status['total_blob_columns']}")
533+
print(f"Already migrated: {status['migrated']}")
534+
print(f"Pending migration: {status['pending']}")
535+
```
536+
537+
### Generating Migration SQL
538+
539+
```python
540+
# Preview migration (dry run)
541+
result = dj.migrate.migrate_blob_columns(schema, dry_run=True)
542+
for sql in result['sql_statements']:
543+
print(sql)
544+
```
545+
546+
### Applying Migration
547+
548+
```python
549+
# Apply migration
550+
result = dj.migrate.migrate_blob_columns(schema, dry_run=False)
551+
print(f"Migrated {result['migrated']} columns")
552+
```
553+
554+
### Migration Details
555+
556+
The migration updates MySQL column comments to include the type declaration.
557+
This is a **metadata-only** change - the actual blob data format is unchanged.
558+
559+
Before migration:
560+
- Column: `longblob`
561+
- Comment: `user comment`
562+
- Behavior: Auto-serialization (implicit)
563+
564+
After migration:
565+
- Column: `longblob`
566+
- Comment: `:<djblob>:user comment`
567+
- Behavior: Explicit serialization via `<djblob>`
568+
569+
### Updating Table Definitions
570+
571+
After database migration, update your Python table definitions for consistency:
572+
573+
```python
574+
# Before
575+
class MyTable(dj.Manual):
576+
definition = """
577+
id : int
578+
---
579+
data : longblob # stored data
580+
"""
581+
582+
# After
583+
class MyTable(dj.Manual):
584+
definition = """
585+
id : int
586+
---
587+
data : <djblob> # stored data
588+
"""
589+
```
590+
591+
Both definitions work identically after migration, but using `<djblob>` makes
592+
the serialization explicit and documents the intended behavior.

src/datajoint/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
]
5959

6060
from . import errors
61+
from . import migrate
6162
from .admin import kill
6263
from .attribute_adapter import AttributeAdapter
6364
from .attribute_type import AttributeType, list_types, register_type

src/datajoint/attribute_type.py

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,10 @@ def decode(self, stored: Any, *, key: dict | None = None) -> Any:
153153
"""
154154
...
155155

156+
# Class attribute: If True, encode() produces final binary data (no blob.pack needed)
157+
# Override in subclasses that handle their own serialization
158+
serializes: bool = False
159+
156160
def validate(self, value: Any) -> None:
157161
"""
158162
Validate a value before encoding.
@@ -409,3 +413,124 @@ def resolve_dtype(dtype: str, seen: set[str] | None = None) -> tuple[str, list[A
409413

410414
# Not a custom type - return as-is
411415
return dtype, chain
416+
417+
418+
# =============================================================================
419+
# Built-in Attribute Types
420+
# =============================================================================
421+
422+
423+
class DJBlobType(AttributeType):
424+
"""
425+
Built-in type for DataJoint's native serialization format.
426+
427+
This type handles serialization of arbitrary Python objects (including NumPy arrays,
428+
dictionaries, lists, etc.) using DataJoint's binary blob format. The format includes:
429+
430+
- Protocol headers (``mYm`` for MATLAB-compatible, ``dj0`` for Python-native)
431+
- Optional compression (zlib)
432+
- Support for NumPy arrays, datetime objects, UUIDs, and nested structures
433+
434+
The ``<djblob>`` type is the explicit way to specify DataJoint's serialization.
435+
It stores data in a MySQL ``LONGBLOB`` column.
436+
437+
Example:
438+
@schema
439+
class ProcessedData(dj.Manual):
440+
definition = '''
441+
data_id : int
442+
---
443+
results : <djblob> # Explicit DataJoint serialization
444+
raw_bytes : longblob # Raw bytes (no serialization)
445+
'''
446+
447+
Note:
448+
For backward compatibility, ``longblob`` columns without an explicit type
449+
still use automatic serialization. Use ``<djblob>`` to be explicit about
450+
serialization behavior.
451+
"""
452+
453+
type_name = "djblob"
454+
dtype = "longblob"
455+
serializes = True # This type handles its own serialization
456+
457+
def encode(self, value: Any, *, key: dict | None = None) -> bytes:
458+
"""
459+
Serialize a Python object to DataJoint's blob format.
460+
461+
Args:
462+
value: Any serializable Python object (dict, list, numpy array, etc.)
463+
key: Primary key values (unused for blob serialization).
464+
465+
Returns:
466+
Serialized bytes with protocol header and optional compression.
467+
"""
468+
from . import blob
469+
470+
return blob.pack(value, compress=True)
471+
472+
def decode(self, stored: bytes, *, key: dict | None = None) -> Any:
473+
"""
474+
Deserialize DataJoint blob format back to a Python object.
475+
476+
Args:
477+
stored: Serialized blob bytes.
478+
key: Primary key values (unused for blob serialization).
479+
480+
Returns:
481+
The deserialized Python object.
482+
"""
483+
from . import blob
484+
485+
return blob.unpack(stored, squeeze=False)
486+
487+
488+
class DJBlobExternalType(AttributeType):
489+
"""
490+
Built-in type for externally-stored DataJoint blobs.
491+
492+
Similar to ``<djblob>`` but stores data in external blob storage instead
493+
of inline in the database. Useful for large objects.
494+
495+
The store name is specified when defining the column type.
496+
497+
Example:
498+
@schema
499+
class LargeData(dj.Manual):
500+
definition = '''
501+
data_id : int
502+
---
503+
large_array : blob@mystore # External storage with auto-serialization
504+
'''
505+
"""
506+
507+
# Note: This type isn't directly usable via <djblob_external> syntax
508+
# It's used internally when blob@store syntax is detected
509+
type_name = "djblob_external"
510+
dtype = "blob@store" # Placeholder - actual store is determined at declaration time
511+
serializes = True # This type handles its own serialization
512+
513+
def encode(self, value: Any, *, key: dict | None = None) -> bytes:
514+
"""Serialize a Python object to DataJoint's blob format."""
515+
from . import blob
516+
517+
return blob.pack(value, compress=True)
518+
519+
def decode(self, stored: bytes, *, key: dict | None = None) -> Any:
520+
"""Deserialize DataJoint blob format back to a Python object."""
521+
from . import blob
522+
523+
return blob.unpack(stored, squeeze=False)
524+
525+
526+
def _register_builtin_types() -> None:
527+
"""
528+
Register DataJoint's built-in attribute types.
529+
530+
Called automatically during module initialization.
531+
"""
532+
register_type(DJBlobType)
533+
534+
535+
# Register built-in types when module is loaded
536+
_register_builtin_types()

src/datajoint/fetch.py

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -88,18 +88,16 @@ def adapt(x):
8888
safe_write(local_filepath, data.split(b"\0", 1)[1])
8989
return adapt(str(local_filepath)) # download file from remote store
9090

91-
return adapt(
92-
uuid.UUID(bytes=data)
93-
if attr.uuid
94-
else (
95-
blob.unpack(
96-
extern.get(uuid.UUID(bytes=data)) if attr.is_external else data,
97-
squeeze=squeeze,
98-
)
99-
if attr.is_blob
100-
else data
101-
)
102-
)
91+
if attr.uuid:
92+
return adapt(uuid.UUID(bytes=data))
93+
elif attr.is_blob:
94+
blob_data = extern.get(uuid.UUID(bytes=data)) if attr.is_external else data
95+
# Skip unpack if adapter handles its own deserialization
96+
if attr.adapter and getattr(attr.adapter, "serializes", False):
97+
return attr.adapter.decode(blob_data, key=None)
98+
return adapt(blob.unpack(blob_data, squeeze=squeeze))
99+
else:
100+
return adapt(data)
103101

104102

105103
class Fetch:

0 commit comments

Comments
 (0)