Skip to content

Commit 667e740

Browse files
committed
Add filename collision avoidance and transaction handling to spec
- Random hash suffix for filenames (URL-safe, filename-safe base64) - Configurable hash_length setting (default: 8, range: 4-16) - Upload-first transaction strategy with cleanup on failure - Batch insert atomicity handling - Orphaned file detection/cleanup utilities (future)
1 parent 965a30f commit 667e740

File tree

1 file changed

+85
-5
lines changed

1 file changed

+85
-5
lines changed

docs/src/design/tables/file-type-spec.md

Lines changed: 85 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ For local filesystem storage:
8585
| `object_storage.bucket` | string | For cloud | Bucket name (S3, GCS, Azure) |
8686
| `object_storage.endpoint` | string | For S3 | S3 endpoint URL |
8787
| `object_storage.partition_pattern` | string | No | Path pattern with `{attribute}` placeholders |
88+
| `object_storage.hash_length` | int | No | Random suffix length for filenames (default: 8, range: 4-16) |
8889
| `object_storage.access_key` | string | For cloud | Access key (can use secrets file) |
8990
| `object_storage.secret_key` | string | For cloud | Secret key (can use secrets file) |
9091

@@ -149,7 +150,7 @@ The `file` type is stored as a `JSON` column in MySQL containing:
149150

150151
```json
151152
{
152-
"path": "subject123/session45/schema_name/objects/Recording-raw_data/recording.dat",
153+
"path": "subject123/session45/schema_name/objects/Recording-raw_data/recording_Ax7bQ2kM.dat",
153154
"size": 12345,
154155
"hash": "sha256:abcdef1234...",
155156
"original_name": "recording.dat",
@@ -178,15 +179,41 @@ DataJoint generates storage paths using:
178179
3. **Schema name** - from the table's schema
179180
4. **Object directory** - `objects/`
180181
5. **Table-field identifier** - `{TableName}-{field_name}/`
181-
6. **Primary key hash** - unique identifier for the record
182-
7. **Original filename** - preserved from insert
182+
6. **Suffixed filename** - original name with random hash suffix
183183

184184
Example path construction:
185185

186186
```
187-
{location}/{partition}/{schema}/objects/{Table}-{field}/{pk_hash}/{original_name}
187+
{location}/{partition}/{schema}/objects/{Table}-{field}/{basename}_{hash}.{ext}
188188
```
189189

190+
### Filename Collision Avoidance
191+
192+
To prevent filename collisions, each stored file receives a **random hash suffix** appended to its basename:
193+
194+
```
195+
original: recording.dat
196+
stored: recording_Ax7bQ2kM.dat
197+
198+
original: image.analysis.tiff
199+
stored: image.analysis_pL9nR4wE.tiff
200+
```
201+
202+
#### Hash Suffix Specification
203+
204+
- **Alphabet**: URL-safe and filename-safe Base64 characters: `A-Z`, `a-z`, `0-9`, `-`, `_`
205+
- **Length**: Configurable via `object_storage.hash_length` (default: 8, range: 4-16)
206+
- **Generation**: Cryptographically random using `secrets.token_urlsafe()`
207+
208+
At 8 characters with 64 possible values per character: 64^8 = 281 trillion combinations.
209+
210+
#### Rationale
211+
212+
- Avoids collisions without requiring existence checks
213+
- Preserves original filename for human readability
214+
- URL-safe for web-based access to cloud storage
215+
- Filesystem-safe across all supported platforms
216+
190217
### No Deduplication
191218

192219
Each insert stores a separate copy of the file, even if identical content was previously stored. This ensures:
@@ -224,11 +251,63 @@ with open("/local/path/data.bin", "rb") as f:
224251
1. Resolve storage backend from pipeline configuration
225252
2. Read file content (from path or stream)
226253
3. Compute content hash (SHA-256)
227-
4. Generate storage path using partition pattern and primary key
254+
4. Generate storage path with random suffix
228255
5. Upload file to storage backend via `fsspec`
229256
6. Build JSON metadata structure
230257
7. Store JSON in database column
231258

259+
## Transaction Handling
260+
261+
File uploads and database inserts must be coordinated to maintain consistency. Since storage backends don't support distributed transactions with MySQL, DataJoint uses a **upload-first** strategy with cleanup on failure.
262+
263+
### Insert Transaction Flow
264+
265+
```
266+
┌─────────────────────────────────────────────────────────┐
267+
│ 1. Validate input and generate storage path │
268+
├─────────────────────────────────────────────────────────┤
269+
│ 2. Upload file to storage backend │
270+
│ └─ On failure: raise error (nothing to clean up) │
271+
├─────────────────────────────────────────────────────────┤
272+
│ 3. Build JSON metadata with storage path │
273+
├─────────────────────────────────────────────────────────┤
274+
│ 4. Execute database INSERT │
275+
│ └─ On failure: delete uploaded file, raise error │
276+
├─────────────────────────────────────────────────────────┤
277+
│ 5. Commit database transaction │
278+
│ └─ On failure: delete uploaded file, raise error │
279+
└─────────────────────────────────────────────────────────┘
280+
```
281+
282+
### Failure Scenarios
283+
284+
| Scenario | State Before | Recovery Action | Result |
285+
|----------|--------------|-----------------|--------|
286+
| Upload fails | No file, no record | None needed | Clean failure |
287+
| DB insert fails | File exists, no record | Delete file | Clean failure |
288+
| DB commit fails | File exists, no record | Delete file | Clean failure |
289+
| Cleanup fails | File exists, no record | Log warning | Orphaned file |
290+
291+
### Orphaned File Handling
292+
293+
In rare cases (e.g., process crash, network failure during cleanup), orphaned files may remain in storage. These can be identified and cleaned:
294+
295+
```python
296+
# Future utility method
297+
schema.external_storage.find_orphaned() # List files not referenced in DB
298+
schema.external_storage.cleanup_orphaned() # Delete orphaned files
299+
```
300+
301+
### Batch Insert Handling
302+
303+
For batch inserts with multiple `file` attributes:
304+
305+
1. Upload all files first (collect paths)
306+
2. Execute batch INSERT with all metadata
307+
3. On any failure: delete all uploaded files from this batch
308+
309+
This ensures atomicity at the batch level - either all records are inserted with their files, or none are.
310+
232311
## Fetch Behavior
233312

234313
On fetch, the `file` type returns a `FileRef` object:
@@ -275,6 +354,7 @@ class ObjectStorageSettings(BaseSettings):
275354
bucket: str | None = None
276355
endpoint: str | None = None
277356
partition_pattern: str | None = None
357+
hash_length: int = Field(default=8, ge=4, le=16)
278358
access_key: str | None = None
279359
secret_key: SecretStr | None = None
280360
```

0 commit comments

Comments
 (0)