22
33## Overview
44
5- The ` file ` type is a new DataJoint column data type that provides managed file storage with metadata tracking. Unlike existing attachment types, ` file ` stores structured metadata as JSON while managing file storage in a configurable location.
5+ The ` file ` type introduces a new paradigm for managed file storage in DataJoint. Unlike existing ` attach@store ` and ` filepath@store ` types that reference named stores, the ` file ` type uses a ** unified storage backend** that is tightly coupled with the schema and configured at the pipeline level.
6+
7+ ## Storage Architecture
8+
9+ ### Single Storage Backend Per Pipeline
10+
11+ Each DataJoint pipeline has ** one** associated storage backend configured in ` datajoint.toml ` . DataJoint fully controls the path structure within this backend.
12+
13+ ### Supported Backends
14+
15+ DataJoint uses ** [ ` fsspec ` ] ( https://filesystem-spec.readthedocs.io/en/latest/ ) ** to ensure compatibility across multiple storage backends:
16+
17+ - ** Local storage** – POSIX-compliant file systems (e.g., NFS, SMB)
18+ - ** Cloud-based object storage** – Amazon S3, Google Cloud Storage, Azure Blob, MinIO
19+ - ** Hybrid storage** – Combining local and cloud storage for flexibility
20+
21+ ## Project Structure
22+
23+ A DataJoint project creates a structured hierarchical storage pattern:
24+
25+ ```
26+ 📁 project_name/
27+ ├── datajoint.toml
28+ ├── 📁 schema_name1/
29+ ├── 📁 schema_name2/
30+ ├── 📁 schema_name3/
31+ │ ├── schema.py
32+ │ ├── 📁 tables/
33+ │ │ ├── table1/key1-value1.parquet
34+ │ │ ├── table2/key2-value2.parquet
35+ │ │ ...
36+ │ ├── 📁 objects/
37+ │ │ ├── table1-field1/key3-value3.zarr
38+ │ │ ├── table1-field2/key3-value3.gif
39+ │ │ ...
40+ ```
41+
42+ ### Object Storage Keys
43+
44+ When using cloud object storage:
45+
46+ ```
47+ s3://bucket/project_name/schema_name3/objects/table1/key1-value1.parquet
48+ s3://bucket/project_name/schema_name3/objects/table1-field1/key3-value3.zarr
49+ ```
50+
51+ ## Configuration
52+
53+ ### ` datajoint.toml ` Structure
54+
55+ ``` toml
56+ [project ]
57+ name = " my_project"
58+
59+ [storage ]
60+ backend = " s3" # or "file", "gcs", "azure"
61+ bucket = " my-bucket"
62+ # For local: path = "/data/my_project"
63+
64+ [storage .credentials ]
65+ # Backend-specific credentials (or reference to secrets manager)
66+
67+ [object_storage ]
68+ partition_pattern = " subject{subject_id}/session{session_id}"
69+ ```
70+
71+ ### Partition Pattern
72+
73+ The organizational structure of stored objects is configurable, allowing partitioning based on ** primary key attributes** .
74+
75+ ``` toml
76+ [object_storage ]
77+ partition_pattern = " subject{subject_id}/session{session_id}"
78+ ```
79+
80+ Placeholders ` {subject_id} ` and ` {session_id} ` are dynamically replaced with actual primary key values.
81+
82+ ** Example with partitioning:**
83+
84+ ```
85+ s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table1/key1-value1/image1.tiff
86+ s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table2/key2-value2/movie2.zarr
87+ ```
688
789## Syntax
890
991``` python
1092@schema
11- class MyTable (dj .Manual ):
93+ class Recording (dj .Manual ):
1294 definition = """
13- id : int
95+ subject_id : int
96+ session_id : int
1497 ---
15- data_file : file@store # managed file with metadata
98+ raw_data : file # managed file storage
99+ processed : file # another file attribute
16100 """
17101```
18102
103+ Note: No ` @store ` suffix needed - storage is determined by pipeline configuration.
104+
19105## Database Storage
20106
21- The ` file ` type is stored as a ` JSON ` column in MySQL. The JSON structure contains :
107+ The ` file ` type is stored as a ` JSON ` column in MySQL containing :
22108
23109``` json
24110{
25- "path" : " relative/path/to/file.ext " ,
111+ "path" : " subject123/session45/schema_name/objects/Recording-raw_data/... " ,
26112 "size" : 12345 ,
27113 "hash" : " sha256:abcdef1234..." ,
28- "original_name" : " original_filename.ext " ,
114+ "original_name" : " recording.dat " ,
29115 "timestamp" : " 2025-01-15T10:30:00Z" ,
30116 "mime_type" : " application/octet-stream"
31117}
@@ -35,156 +121,179 @@ The `file` type is stored as a `JSON` column in MySQL. The JSON structure contai
35121
36122| Field | Type | Required | Description |
37123| -------| ------| ----------| -------------|
38- | ` path ` | string | Yes | Relative path within the store |
124+ | ` path ` | string | Yes | Full path/key within storage backend |
39125| ` size ` | integer | Yes | File size in bytes |
40126| ` hash ` | string | Yes | Content hash with algorithm prefix |
41127| ` original_name ` | string | Yes | Original filename at insert time |
42128| ` timestamp ` | string | Yes | ISO 8601 upload timestamp |
43129| ` mime_type ` | string | No | MIME type (auto-detected or provided) |
44130
131+ ## Path Generation
132+
133+ DataJoint generates storage paths using:
134+
135+ 1 . ** Project name** - from configuration
136+ 2 . ** Partition values** - from primary key (if configured)
137+ 3 . ** Schema name** - from the table's schema
138+ 4 . ** Object directory** - ` objects/ `
139+ 5 . ** Table-field identifier** - ` {table_name}-{field_name}/ `
140+ 6 . ** Key identifier** - derived from primary key values
141+ 7 . ** Original filename** - preserved from insert
142+
143+ Example path construction:
144+
145+ ```
146+ {project}/{partition}/{schema}/objects/{table}-{field}/{key_hash}/{original_name}
147+ ```
148+
45149## Insert Behavior
46150
47151At insert time, the ` file ` attribute accepts:
48152
49- 1 . ** File path (string or Path) ** : Path to an existing file
153+ 1 . ** File path** (string or ` Path ` ) : Path to an existing file
501542 . ** Stream object** : File-like object with ` read() ` method
511553 . ** Tuple of (name, stream)** : Stream with explicit filename
52156
53- ### Insert Flow
54-
55157``` python
56158# From file path
57- table.insert1({" id" : 1 , " data_file" : " /path/to/file.dat" })
58- table.insert1({" id" : 2 , " data_file" : Path(" /path/to/file.dat" )})
59-
60- # From stream
61- with open (" /path/to/file.dat" , " rb" ) as f:
62- table.insert1({" id" : 3 , " data_file" : f})
159+ Recording.insert1({
160+ " subject_id" : 123 ,
161+ " session_id" : 45 ,
162+ " raw_data" : " /local/path/to/recording.dat"
163+ })
63164
64165# From stream with explicit name
65- with open (" /path/to/file.dat" , " rb" ) as f:
66- table.insert1({" id" : 4 , " data_file" : (" custom_name.dat" , f)})
166+ with open (" /local/path/data.bin" , " rb" ) as f:
167+ Recording.insert1({
168+ " subject_id" : 123 ,
169+ " session_id" : 45 ,
170+ " raw_data" : (" custom_name.dat" , f)
171+ })
67172```
68173
69- ### Processing Steps
174+ ### Insert Processing Steps
70175
71- 1 . Read file content (from path or stream)
72- 2 . Compute content hash (SHA-256)
73- 3 . Generate storage path using hash-based subfolding
74- 4 . Copy file to target location in store
75- 5 . Build JSON metadata structure
76- 6 . Store JSON in database column
176+ 1 . Resolve storage backend from schema's pipeline configuration
177+ 2 . Read file content (from path or stream)
178+ 3 . Compute content hash (SHA-256)
179+ 4 . Generate storage path using partition pattern and primary key
180+ 5 . Upload file to storage backend via ` fsspec `
181+ 6 . Build JSON metadata structure
182+ 7 . Store JSON in database column
77183
78184## Fetch Behavior
79185
80- On fetch, the ` file ` type returns a ` FileRef ` object (or configurable to return the path string directly).
186+ On fetch, the ` file ` type returns a ` FileRef ` object:
81187
82188``` python
83- # Fetch returns FileRef object
84- record = table.fetch1()
85- file_ref = record[" data_file" ]
189+ record = Recording.fetch1()
190+ file_ref = record[" raw_data" ]
86191
87192# Access metadata
88- print (file_ref.path) # Full path to file
89- print (file_ref.size) # File size
193+ print (file_ref.path) # Full storage path
194+ print (file_ref.size) # File size in bytes
90195print (file_ref.hash) # Content hash
91196print (file_ref.original_name) # Original filename
92197
93- # Read content
198+ # Read content directly (streams from backend)
94199content = file_ref.read() # Returns bytes
95200
96- # Get as path
97- path = file_ref.as_path() # Returns Path object
201+ # Download to local path
202+ local_path = file_ref.download(" /local/destination/" )
203+
204+ # Open as fsspec file object
205+ with file_ref.open() as f:
206+ data = f.read()
98207```
99208
100- ### Fetch Options
209+ ## Implementation Components
101210
102- ``` python
103- # Return path strings instead of FileRef objects
104- records = table.fetch(download_path = " /local/path" , format = " path" )
211+ ### 1. Storage Backend (` storage.py ` - new module)
105212
106- # Return raw JSON metadata
107- records = table.fetch(format = " metadata" )
108- ```
213+ - ` StorageBackend ` class wrapping ` fsspec `
214+ - Methods: ` upload() ` , ` download() ` , ` open() ` , ` exists() ` , ` delete() `
215+ - Path generation with partition support
216+ - Configuration loading from ` datajoint.toml `
109217
110- ## Store Configuration
218+ ### 2. Type Declaration ( ` declare.py ` )
111219
112- The ` file ` type uses the existing external store infrastructure:
220+ - Add ` FILE ` pattern: ` file$ `
221+ - Add to ` SPECIAL_TYPES `
222+ - Substitute to ` JSON ` type in database
113223
114- ``` python
115- dj.config[" stores" ] = {
116- " raw" : {
117- " protocol" : " file" ,
118- " location" : " /data/raw-files" ,
119- " subfolding" : (2 , 2 ), # Hash-based directory structure
120- },
121- " s3store" : {
122- " protocol" : " s3" ,
123- " endpoint" : " s3.amazonaws.com" ,
124- " bucket" : " my-bucket" ,
125- " location" : " datajoint-files" ,
126- " access_key" : " ..." ,
127- " secret_key" : " ..." ,
128- }
129- }
130- ```
224+ ### 3. Schema Integration (` schemas.py ` )
131225
132- ## Comparison with Existing Types
226+ - Associate storage backend with schema
227+ - Load configuration on schema creation
133228
134- | Feature | ` attach ` | ` filepath ` | ` file ` |
135- | ---------| ----------| ------------| --------|
136- | Storage | External store | External store | External store |
137- | DB Column | binary(16) UUID | binary(16) UUID | JSON |
138- | Metadata | Limited | Path + hash | Full structured |
139- | Deduplication | By content | By path | By content |
140- | Fetch returns | Downloaded path | Staged path | FileRef object |
141- | Track history | No | Via hash | Yes (in JSON) |
229+ ### 4. Insert Processing (` table.py ` )
142230
143- ## Implementation Components
231+ - New ` __process_file_attribute() ` method
232+ - Path generation using primary key and partition pattern
233+ - Upload via storage backend
144234
145- ### 1. Type Declaration (` declare .py` )
235+ ### 5. Fetch Processing (` fetch .py` )
146236
147- - Add ` FILE ` pattern: ` file@(?P<store>[a-z][\-\w]*)$ `
148- - Add to ` SPECIAL_TYPES `
149- - Substitute to ` JSON ` type in database
237+ - New ` FileRef ` class
238+ - Lazy loading from storage backend
239+ - Metadata access interface
150240
151- ### 2. Insert Processing (` table .py` )
241+ ### 6. FileRef Class (` fileref .py` - new module )
152242
153- - New ` __process_file_attribute() ` method
154- - Handle file path, stream, and (name, stream) inputs
155- - Copy to store and build metadata JSON
243+ ``` python
244+ class FileRef :
245+ """ Reference to a file stored in the pipeline's storage backend."""
246+
247+ path: str
248+ size: int
249+ hash : str
250+ original_name: str
251+ timestamp: datetime
252+ mime_type: str | None
253+
254+ def read (self ) -> bytes : ...
255+ def open (self , mode = " rb" ) -> IO : ...
256+ def download (self , destination : Path) -> Path: ...
257+ def exists (self ) -> bool : ...
258+ ```
156259
157- ### 3. Fetch Processing ( ` fetch.py ` )
260+ ## Dependencies
158261
159- - New ` FileRef ` class for return values
160- - Optional download/staging behavior
161- - Metadata access interface
262+ New dependency: ` fsspec ` with optional backend-specific packages:
162263
163- ### 4. Heading Support (` heading.py ` )
264+ ``` toml
265+ [project .dependencies ]
266+ fsspec = " >=2023.1.0"
164267
165- - Track ` is_file ` attribute flag
166- - Store detection from comment
268+ [project .optional-dependencies ]
269+ s3 = [" s3fs" ]
270+ gcs = [" gcsfs" ]
271+ azure = [" adlfs" ]
272+ ```
167273
168- ## Error Handling
274+ ## Comparison with Existing Types
169275
170- | Scenario | Behavior |
171- | ----------| ----------|
172- | File not found | Raise ` DataJointError ` at insert |
173- | Stream not readable | Raise ` DataJointError ` at insert |
174- | Store not configured | Raise ` DataJointError ` at insert |
175- | File missing on fetch | Raise ` DataJointError ` with metadata |
176- | Hash mismatch on fetch | Warning + option to re-download |
276+ | Feature | ` attach@store ` | ` filepath@store ` | ` file ` |
277+ | ---------| ----------------| ------------------| --------|
278+ | Store config | Per-attribute | Per-attribute | Per-pipeline |
279+ | Path control | DataJoint | User-managed | DataJoint |
280+ | DB column | binary(16) UUID | binary(16) UUID | JSON |
281+ | Backend | File/S3 | File/S3 | fsspec (any) |
282+ | Partitioning | Hash-based | User path | Configurable |
283+ | Metadata | External table | External table | Inline JSON |
177284
178- ## Migration Considerations
285+ ## Migration Path
179286
180- - No migration needed - new type, new tables only
181- - Existing ` attach@store ` and ` filepath@store ` unchanged
182- - Can coexist in same schema
287+ - Existing ` attach@store ` and ` filepath@store ` remain unchanged
288+ - ` file ` type is additive - new tables only
289+ - Future: Migration utilities to convert existing external storage
183290
184291## Future Extensions
185292
186- - [ ] Compression options (gzip, lz4)
293+ - [ ] Directory/folder support (store entire directories)
294+ - [ ] Compression options (gzip, lz4, zstd)
187295- [ ] Encryption at rest
188296- [ ] Versioning support
189- - [ ] Lazy loading / streaming fetch
297+ - [ ] Streaming upload for large files
190298- [ ] Checksum verification options
299+ - [ ] Cache layer for frequently accessed files
0 commit comments