Skip to content

Commit ba3c66b

Browse files
committed
Revise file type spec: unified storage backend with fsspec
- Single storage backend per pipeline (no @store suffix) - Use fsspec for multi-backend support (local, S3, GCS, Azure) - Configuration via datajoint.toml at project level - Configurable partition patterns based on primary key attributes - Hierarchical project structure with tables/ and objects/ dirs
1 parent 4518b36 commit ba3c66b

File tree

1 file changed

+209
-100
lines changed

1 file changed

+209
-100
lines changed

docs/src/design/tables/file-type-spec.md

Lines changed: 209 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,116 @@
22

33
## Overview
44

5-
The `file` type is a new DataJoint column data type that provides managed file storage with metadata tracking. Unlike existing attachment types, `file` stores structured metadata as JSON while managing file storage in a configurable location.
5+
The `file` type introduces a new paradigm for managed file storage in DataJoint. Unlike existing `attach@store` and `filepath@store` types that reference named stores, the `file` type uses a **unified storage backend** that is tightly coupled with the schema and configured at the pipeline level.
6+
7+
## Storage Architecture
8+
9+
### Single Storage Backend Per Pipeline
10+
11+
Each DataJoint pipeline has **one** associated storage backend configured in `datajoint.toml`. DataJoint fully controls the path structure within this backend.
12+
13+
### Supported Backends
14+
15+
DataJoint uses **[`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/)** to ensure compatibility across multiple storage backends:
16+
17+
- **Local storage** – POSIX-compliant file systems (e.g., NFS, SMB)
18+
- **Cloud-based object storage** – Amazon S3, Google Cloud Storage, Azure Blob, MinIO
19+
- **Hybrid storage** – Combining local and cloud storage for flexibility
20+
21+
## Project Structure
22+
23+
A DataJoint project creates a structured hierarchical storage pattern:
24+
25+
```
26+
📁 project_name/
27+
├── datajoint.toml
28+
├── 📁 schema_name1/
29+
├── 📁 schema_name2/
30+
├── 📁 schema_name3/
31+
│ ├── schema.py
32+
│ ├── 📁 tables/
33+
│ │ ├── table1/key1-value1.parquet
34+
│ │ ├── table2/key2-value2.parquet
35+
│ │ ...
36+
│ ├── 📁 objects/
37+
│ │ ├── table1-field1/key3-value3.zarr
38+
│ │ ├── table1-field2/key3-value3.gif
39+
│ │ ...
40+
```
41+
42+
### Object Storage Keys
43+
44+
When using cloud object storage:
45+
46+
```
47+
s3://bucket/project_name/schema_name3/objects/table1/key1-value1.parquet
48+
s3://bucket/project_name/schema_name3/objects/table1-field1/key3-value3.zarr
49+
```
50+
51+
## Configuration
52+
53+
### `datajoint.toml` Structure
54+
55+
```toml
56+
[project]
57+
name = "my_project"
58+
59+
[storage]
60+
backend = "s3" # or "file", "gcs", "azure"
61+
bucket = "my-bucket"
62+
# For local: path = "/data/my_project"
63+
64+
[storage.credentials]
65+
# Backend-specific credentials (or reference to secrets manager)
66+
67+
[object_storage]
68+
partition_pattern = "subject{subject_id}/session{session_id}"
69+
```
70+
71+
### Partition Pattern
72+
73+
The organizational structure of stored objects is configurable, allowing partitioning based on **primary key attributes**.
74+
75+
```toml
76+
[object_storage]
77+
partition_pattern = "subject{subject_id}/session{session_id}"
78+
```
79+
80+
Placeholders `{subject_id}` and `{session_id}` are dynamically replaced with actual primary key values.
81+
82+
**Example with partitioning:**
83+
84+
```
85+
s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table1/key1-value1/image1.tiff
86+
s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table2/key2-value2/movie2.zarr
87+
```
688

789
## Syntax
890

991
```python
1092
@schema
11-
class MyTable(dj.Manual):
93+
class Recording(dj.Manual):
1294
definition = """
13-
id : int
95+
subject_id : int
96+
session_id : int
1497
---
15-
data_file : file@store # managed file with metadata
98+
raw_data : file # managed file storage
99+
processed : file # another file attribute
16100
"""
17101
```
18102

103+
Note: No `@store` suffix needed - storage is determined by pipeline configuration.
104+
19105
## Database Storage
20106

21-
The `file` type is stored as a `JSON` column in MySQL. The JSON structure contains:
107+
The `file` type is stored as a `JSON` column in MySQL containing:
22108

23109
```json
24110
{
25-
"path": "relative/path/to/file.ext",
111+
"path": "subject123/session45/schema_name/objects/Recording-raw_data/...",
26112
"size": 12345,
27113
"hash": "sha256:abcdef1234...",
28-
"original_name": "original_filename.ext",
114+
"original_name": "recording.dat",
29115
"timestamp": "2025-01-15T10:30:00Z",
30116
"mime_type": "application/octet-stream"
31117
}
@@ -35,156 +121,179 @@ The `file` type is stored as a `JSON` column in MySQL. The JSON structure contai
35121

36122
| Field | Type | Required | Description |
37123
|-------|------|----------|-------------|
38-
| `path` | string | Yes | Relative path within the store |
124+
| `path` | string | Yes | Full path/key within storage backend |
39125
| `size` | integer | Yes | File size in bytes |
40126
| `hash` | string | Yes | Content hash with algorithm prefix |
41127
| `original_name` | string | Yes | Original filename at insert time |
42128
| `timestamp` | string | Yes | ISO 8601 upload timestamp |
43129
| `mime_type` | string | No | MIME type (auto-detected or provided) |
44130

131+
## Path Generation
132+
133+
DataJoint generates storage paths using:
134+
135+
1. **Project name** - from configuration
136+
2. **Partition values** - from primary key (if configured)
137+
3. **Schema name** - from the table's schema
138+
4. **Object directory** - `objects/`
139+
5. **Table-field identifier** - `{table_name}-{field_name}/`
140+
6. **Key identifier** - derived from primary key values
141+
7. **Original filename** - preserved from insert
142+
143+
Example path construction:
144+
145+
```
146+
{project}/{partition}/{schema}/objects/{table}-{field}/{key_hash}/{original_name}
147+
```
148+
45149
## Insert Behavior
46150

47151
At insert time, the `file` attribute accepts:
48152

49-
1. **File path (string or Path)**: Path to an existing file
153+
1. **File path** (string or `Path`): Path to an existing file
50154
2. **Stream object**: File-like object with `read()` method
51155
3. **Tuple of (name, stream)**: Stream with explicit filename
52156

53-
### Insert Flow
54-
55157
```python
56158
# From file path
57-
table.insert1({"id": 1, "data_file": "/path/to/file.dat"})
58-
table.insert1({"id": 2, "data_file": Path("/path/to/file.dat")})
59-
60-
# From stream
61-
with open("/path/to/file.dat", "rb") as f:
62-
table.insert1({"id": 3, "data_file": f})
159+
Recording.insert1({
160+
"subject_id": 123,
161+
"session_id": 45,
162+
"raw_data": "/local/path/to/recording.dat"
163+
})
63164

64165
# From stream with explicit name
65-
with open("/path/to/file.dat", "rb") as f:
66-
table.insert1({"id": 4, "data_file": ("custom_name.dat", f)})
166+
with open("/local/path/data.bin", "rb") as f:
167+
Recording.insert1({
168+
"subject_id": 123,
169+
"session_id": 45,
170+
"raw_data": ("custom_name.dat", f)
171+
})
67172
```
68173

69-
### Processing Steps
174+
### Insert Processing Steps
70175

71-
1. Read file content (from path or stream)
72-
2. Compute content hash (SHA-256)
73-
3. Generate storage path using hash-based subfolding
74-
4. Copy file to target location in store
75-
5. Build JSON metadata structure
76-
6. Store JSON in database column
176+
1. Resolve storage backend from schema's pipeline configuration
177+
2. Read file content (from path or stream)
178+
3. Compute content hash (SHA-256)
179+
4. Generate storage path using partition pattern and primary key
180+
5. Upload file to storage backend via `fsspec`
181+
6. Build JSON metadata structure
182+
7. Store JSON in database column
77183

78184
## Fetch Behavior
79185

80-
On fetch, the `file` type returns a `FileRef` object (or configurable to return the path string directly).
186+
On fetch, the `file` type returns a `FileRef` object:
81187

82188
```python
83-
# Fetch returns FileRef object
84-
record = table.fetch1()
85-
file_ref = record["data_file"]
189+
record = Recording.fetch1()
190+
file_ref = record["raw_data"]
86191

87192
# Access metadata
88-
print(file_ref.path) # Full path to file
89-
print(file_ref.size) # File size
193+
print(file_ref.path) # Full storage path
194+
print(file_ref.size) # File size in bytes
90195
print(file_ref.hash) # Content hash
91196
print(file_ref.original_name) # Original filename
92197

93-
# Read content
198+
# Read content directly (streams from backend)
94199
content = file_ref.read() # Returns bytes
95200

96-
# Get as path
97-
path = file_ref.as_path() # Returns Path object
201+
# Download to local path
202+
local_path = file_ref.download("/local/destination/")
203+
204+
# Open as fsspec file object
205+
with file_ref.open() as f:
206+
data = f.read()
98207
```
99208

100-
### Fetch Options
209+
## Implementation Components
101210

102-
```python
103-
# Return path strings instead of FileRef objects
104-
records = table.fetch(download_path="/local/path", format="path")
211+
### 1. Storage Backend (`storage.py` - new module)
105212

106-
# Return raw JSON metadata
107-
records = table.fetch(format="metadata")
108-
```
213+
- `StorageBackend` class wrapping `fsspec`
214+
- Methods: `upload()`, `download()`, `open()`, `exists()`, `delete()`
215+
- Path generation with partition support
216+
- Configuration loading from `datajoint.toml`
109217

110-
## Store Configuration
218+
### 2. Type Declaration (`declare.py`)
111219

112-
The `file` type uses the existing external store infrastructure:
220+
- Add `FILE` pattern: `file$`
221+
- Add to `SPECIAL_TYPES`
222+
- Substitute to `JSON` type in database
113223

114-
```python
115-
dj.config["stores"] = {
116-
"raw": {
117-
"protocol": "file",
118-
"location": "/data/raw-files",
119-
"subfolding": (2, 2), # Hash-based directory structure
120-
},
121-
"s3store": {
122-
"protocol": "s3",
123-
"endpoint": "s3.amazonaws.com",
124-
"bucket": "my-bucket",
125-
"location": "datajoint-files",
126-
"access_key": "...",
127-
"secret_key": "...",
128-
}
129-
}
130-
```
224+
### 3. Schema Integration (`schemas.py`)
131225

132-
## Comparison with Existing Types
226+
- Associate storage backend with schema
227+
- Load configuration on schema creation
133228

134-
| Feature | `attach` | `filepath` | `file` |
135-
|---------|----------|------------|--------|
136-
| Storage | External store | External store | External store |
137-
| DB Column | binary(16) UUID | binary(16) UUID | JSON |
138-
| Metadata | Limited | Path + hash | Full structured |
139-
| Deduplication | By content | By path | By content |
140-
| Fetch returns | Downloaded path | Staged path | FileRef object |
141-
| Track history | No | Via hash | Yes (in JSON) |
229+
### 4. Insert Processing (`table.py`)
142230

143-
## Implementation Components
231+
- New `__process_file_attribute()` method
232+
- Path generation using primary key and partition pattern
233+
- Upload via storage backend
144234

145-
### 1. Type Declaration (`declare.py`)
235+
### 5. Fetch Processing (`fetch.py`)
146236

147-
- Add `FILE` pattern: `file@(?P<store>[a-z][\-\w]*)$`
148-
- Add to `SPECIAL_TYPES`
149-
- Substitute to `JSON` type in database
237+
- New `FileRef` class
238+
- Lazy loading from storage backend
239+
- Metadata access interface
150240

151-
### 2. Insert Processing (`table.py`)
241+
### 6. FileRef Class (`fileref.py` - new module)
152242

153-
- New `__process_file_attribute()` method
154-
- Handle file path, stream, and (name, stream) inputs
155-
- Copy to store and build metadata JSON
243+
```python
244+
class FileRef:
245+
"""Reference to a file stored in the pipeline's storage backend."""
246+
247+
path: str
248+
size: int
249+
hash: str
250+
original_name: str
251+
timestamp: datetime
252+
mime_type: str | None
253+
254+
def read(self) -> bytes: ...
255+
def open(self, mode="rb") -> IO: ...
256+
def download(self, destination: Path) -> Path: ...
257+
def exists(self) -> bool: ...
258+
```
156259

157-
### 3. Fetch Processing (`fetch.py`)
260+
## Dependencies
158261

159-
- New `FileRef` class for return values
160-
- Optional download/staging behavior
161-
- Metadata access interface
262+
New dependency: `fsspec` with optional backend-specific packages:
162263

163-
### 4. Heading Support (`heading.py`)
264+
```toml
265+
[project.dependencies]
266+
fsspec = ">=2023.1.0"
164267

165-
- Track `is_file` attribute flag
166-
- Store detection from comment
268+
[project.optional-dependencies]
269+
s3 = ["s3fs"]
270+
gcs = ["gcsfs"]
271+
azure = ["adlfs"]
272+
```
167273

168-
## Error Handling
274+
## Comparison with Existing Types
169275

170-
| Scenario | Behavior |
171-
|----------|----------|
172-
| File not found | Raise `DataJointError` at insert |
173-
| Stream not readable | Raise `DataJointError` at insert |
174-
| Store not configured | Raise `DataJointError` at insert |
175-
| File missing on fetch | Raise `DataJointError` with metadata |
176-
| Hash mismatch on fetch | Warning + option to re-download |
276+
| Feature | `attach@store` | `filepath@store` | `file` |
277+
|---------|----------------|------------------|--------|
278+
| Store config | Per-attribute | Per-attribute | Per-pipeline |
279+
| Path control | DataJoint | User-managed | DataJoint |
280+
| DB column | binary(16) UUID | binary(16) UUID | JSON |
281+
| Backend | File/S3 | File/S3 | fsspec (any) |
282+
| Partitioning | Hash-based | User path | Configurable |
283+
| Metadata | External table | External table | Inline JSON |
177284

178-
## Migration Considerations
285+
## Migration Path
179286

180-
- No migration needed - new type, new tables only
181-
- Existing `attach@store` and `filepath@store` unchanged
182-
- Can coexist in same schema
287+
- Existing `attach@store` and `filepath@store` remain unchanged
288+
- `file` type is additive - new tables only
289+
- Future: Migration utilities to convert existing external storage
183290

184291
## Future Extensions
185292

186-
- [ ] Compression options (gzip, lz4)
293+
- [ ] Directory/folder support (store entire directories)
294+
- [ ] Compression options (gzip, lz4, zstd)
187295
- [ ] Encryption at rest
188296
- [ ] Versioning support
189-
- [ ] Lazy loading / streaming fetch
297+
- [ ] Streaming upload for large files
190298
- [ ] Checksum verification options
299+
- [ ] Cache layer for frequently accessed files

0 commit comments

Comments
 (0)