Skip to content

Commit 4518b36

Browse files
committed
Add initial specification for file column type
Draft specification document for the new `file@store` column type that stores files with JSON metadata. Includes syntax, storage format, insert/fetch behavior, and comparison with existing attachment types.
1 parent 028dbad commit 4518b36

File tree

1 file changed

+190
-0
lines changed

1 file changed

+190
-0
lines changed
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# File Column Type Specification
2+
3+
## Overview
4+
5+
The `file` type is a new DataJoint column data type that provides managed file storage with metadata tracking. Unlike existing attachment types, `file` stores structured metadata as JSON while managing file storage in a configurable location.
6+
7+
## Syntax
8+
9+
```python
10+
@schema
11+
class MyTable(dj.Manual):
12+
definition = """
13+
id : int
14+
---
15+
data_file : file@store # managed file with metadata
16+
"""
17+
```
18+
19+
## Database Storage
20+
21+
The `file` type is stored as a `JSON` column in MySQL. The JSON structure contains:
22+
23+
```json
24+
{
25+
"path": "relative/path/to/file.ext",
26+
"size": 12345,
27+
"hash": "sha256:abcdef1234...",
28+
"original_name": "original_filename.ext",
29+
"timestamp": "2025-01-15T10:30:00Z",
30+
"mime_type": "application/octet-stream"
31+
}
32+
```
33+
34+
### JSON Schema
35+
36+
| Field | Type | Required | Description |
37+
|-------|------|----------|-------------|
38+
| `path` | string | Yes | Relative path within the store |
39+
| `size` | integer | Yes | File size in bytes |
40+
| `hash` | string | Yes | Content hash with algorithm prefix |
41+
| `original_name` | string | Yes | Original filename at insert time |
42+
| `timestamp` | string | Yes | ISO 8601 upload timestamp |
43+
| `mime_type` | string | No | MIME type (auto-detected or provided) |
44+
45+
## Insert Behavior
46+
47+
At insert time, the `file` attribute accepts:
48+
49+
1. **File path (string or Path)**: Path to an existing file
50+
2. **Stream object**: File-like object with `read()` method
51+
3. **Tuple of (name, stream)**: Stream with explicit filename
52+
53+
### Insert Flow
54+
55+
```python
56+
# From file path
57+
table.insert1({"id": 1, "data_file": "/path/to/file.dat"})
58+
table.insert1({"id": 2, "data_file": Path("/path/to/file.dat")})
59+
60+
# From stream
61+
with open("/path/to/file.dat", "rb") as f:
62+
table.insert1({"id": 3, "data_file": f})
63+
64+
# From stream with explicit name
65+
with open("/path/to/file.dat", "rb") as f:
66+
table.insert1({"id": 4, "data_file": ("custom_name.dat", f)})
67+
```
68+
69+
### Processing Steps
70+
71+
1. Read file content (from path or stream)
72+
2. Compute content hash (SHA-256)
73+
3. Generate storage path using hash-based subfolding
74+
4. Copy file to target location in store
75+
5. Build JSON metadata structure
76+
6. Store JSON in database column
77+
78+
## Fetch Behavior
79+
80+
On fetch, the `file` type returns a `FileRef` object (or configurable to return the path string directly).
81+
82+
```python
83+
# Fetch returns FileRef object
84+
record = table.fetch1()
85+
file_ref = record["data_file"]
86+
87+
# Access metadata
88+
print(file_ref.path) # Full path to file
89+
print(file_ref.size) # File size
90+
print(file_ref.hash) # Content hash
91+
print(file_ref.original_name) # Original filename
92+
93+
# Read content
94+
content = file_ref.read() # Returns bytes
95+
96+
# Get as path
97+
path = file_ref.as_path() # Returns Path object
98+
```
99+
100+
### Fetch Options
101+
102+
```python
103+
# Return path strings instead of FileRef objects
104+
records = table.fetch(download_path="/local/path", format="path")
105+
106+
# Return raw JSON metadata
107+
records = table.fetch(format="metadata")
108+
```
109+
110+
## Store Configuration
111+
112+
The `file` type uses the existing external store infrastructure:
113+
114+
```python
115+
dj.config["stores"] = {
116+
"raw": {
117+
"protocol": "file",
118+
"location": "/data/raw-files",
119+
"subfolding": (2, 2), # Hash-based directory structure
120+
},
121+
"s3store": {
122+
"protocol": "s3",
123+
"endpoint": "s3.amazonaws.com",
124+
"bucket": "my-bucket",
125+
"location": "datajoint-files",
126+
"access_key": "...",
127+
"secret_key": "...",
128+
}
129+
}
130+
```
131+
132+
## Comparison with Existing Types
133+
134+
| Feature | `attach` | `filepath` | `file` |
135+
|---------|----------|------------|--------|
136+
| Storage | External store | External store | External store |
137+
| DB Column | binary(16) UUID | binary(16) UUID | JSON |
138+
| Metadata | Limited | Path + hash | Full structured |
139+
| Deduplication | By content | By path | By content |
140+
| Fetch returns | Downloaded path | Staged path | FileRef object |
141+
| Track history | No | Via hash | Yes (in JSON) |
142+
143+
## Implementation Components
144+
145+
### 1. Type Declaration (`declare.py`)
146+
147+
- Add `FILE` pattern: `file@(?P<store>[a-z][\-\w]*)$`
148+
- Add to `SPECIAL_TYPES`
149+
- Substitute to `JSON` type in database
150+
151+
### 2. Insert Processing (`table.py`)
152+
153+
- New `__process_file_attribute()` method
154+
- Handle file path, stream, and (name, stream) inputs
155+
- Copy to store and build metadata JSON
156+
157+
### 3. Fetch Processing (`fetch.py`)
158+
159+
- New `FileRef` class for return values
160+
- Optional download/staging behavior
161+
- Metadata access interface
162+
163+
### 4. Heading Support (`heading.py`)
164+
165+
- Track `is_file` attribute flag
166+
- Store detection from comment
167+
168+
## Error Handling
169+
170+
| Scenario | Behavior |
171+
|----------|----------|
172+
| File not found | Raise `DataJointError` at insert |
173+
| Stream not readable | Raise `DataJointError` at insert |
174+
| Store not configured | Raise `DataJointError` at insert |
175+
| File missing on fetch | Raise `DataJointError` with metadata |
176+
| Hash mismatch on fetch | Warning + option to re-download |
177+
178+
## Migration Considerations
179+
180+
- No migration needed - new type, new tables only
181+
- Existing `attach@store` and `filepath@store` unchanged
182+
- Can coexist in same schema
183+
184+
## Future Extensions
185+
186+
- [ ] Compression options (gzip, lz4)
187+
- [ ] Encryption at rest
188+
- [ ] Versioning support
189+
- [ ] Lazy loading / streaming fetch
190+
- [ ] Checksum verification options

0 commit comments

Comments
 (0)