1
1
# Filepath Datatype
2
2
3
- ## Configuration & usage
3
+ Note: Filepath Datatype is available as a preview feature in DataJoint Python v0.12.
4
+ This means that the feature is required to be explicitly enabled. To do so, make sure
5
+ to set the environment variable ` FILEPATH_FEATURE_SWITCH=TRUE ` prior to use.
4
6
5
- https://github.com/datajoint/datajoint-python/issues/481
7
+ ## Configuration & Usage
6
8
7
- The ` filepath ` attribute type links DataJoint records to files already
9
+ Corresponding to issue
10
+ [ #481 ] ( https://github.com/datajoint/datajoint-python/issues/481 ) ,
11
+ the ` filepath ` attribute type links DataJoint records to files already
8
12
managed outside of DataJoint. This can aid in sharing data with
9
- other systems, such as allowing an image viewer application to
13
+ other systems such as allowing an image viewer application to
10
14
directly use files from a DataJoint pipeline, or to allow downstream
11
- tables to reference data which lives outside of the DataJoint
12
- pipeline .
15
+ tables to reference data which reside outside of DataJoint
16
+ pipelines .
13
17
14
18
To define a table using the ` filepath ` datatype, an existing DataJoint
15
19
[ store] ( ../../sysadmin/external-store.md ) should be created and then referenced in the
16
20
new table definition. For example, given a simple store:
17
21
18
- ``` json
19
- dj.config['stores'] = {
20
- 'data': {
21
- 'protocol': 'file',
22
- 'location': '/data',
23
- 'stage': '/data'
24
- }
25
- }
22
+ ``` python
23
+ dj.config[' stores' ] = {
24
+ ' data' : {
25
+ ' protocol' : ' file' ,
26
+ ' location' : ' /data' ,
27
+ ' stage' : ' /data'
28
+ }
29
+ }
26
30
```
27
31
28
- We can define an ScanImages table as follows:
32
+ we can define an ` ScanImages ` table as follows:
29
33
30
34
``` python
31
35
@schema
32
36
class ScanImages (dj .Manual ):
33
- definition = """
34
- -> Session
35
- image_id: int
36
- ---
37
- image_path: filepath@data
38
- """
37
+ definition = """
38
+ -> Session
39
+ image_id: int
40
+ ---
41
+ image_path: filepath@data
42
+ """
39
43
```
40
44
41
- This table can now be used for tracking paths within the ' /data' area .
45
+ This table can now be used for tracking paths within the ` /data ` local directory .
42
46
For example:
43
47
44
48
``` python
@@ -50,27 +54,43 @@ For example:
50
54
As can be seen from the example, unlike [ blob] ( blobs.md ) records, file
51
55
paths are managed as path locations to the underlying file.
52
56
53
- ## Filepath integrity notes
57
+ ## Integrity Notes
54
58
55
59
Unlike other data in DataJoint, data in ` filepath ` records are
56
60
deliberately intended for shared use outside of DataJoint. To help
57
- ensure integrity of filepath records, DataJoint will record a
58
- checksum of the file data on insert, and will verify this checksum
59
- on fetch. However, since the underlying file data may be shared
61
+ ensure integrity of ` filepath ` records, DataJoint will record a
62
+ checksum of the file data on ` insert ` , and will verify this checksum
63
+ on ` fetch ` . However, since the underlying file data may be shared
60
64
with other applications, special care should be taken to ensure
61
65
records stored in ` filepath ` attributes are not modified outside
62
66
of the pipeline, or, if they are, that records in the pipeline are
63
- updated accordingly. A safe method of changing filepath data is
67
+ updated accordingly. A safe method of changing ` filepath ` data is
64
68
as follows:
65
69
66
- 1 . Delete filepath database record
67
- - This will ensure that any downstream records in the pipeline depending
68
- on the ` filepath ` record are purged from the database
69
- 2 . Modify filepath data
70
- 3 . Re-insert corresponding filepath record
71
- - This will add the record back to DataJoint with an updated file checksum
72
- 4 . Compute any downstream dependencies, if needed
73
- - This will ensure that downstream results dependent on the filepath
74
- record are updated to reflect the newer filepath contents.
70
+ 1 . Delete the ` filepath ` database record.
71
+ This will ensure that any downstream records in the pipeline depending
72
+ on the ` filepath ` record are purged from the database.
73
+ 2 . Modify ` filepath ` data.
74
+ 3 . Re-insert corresponding the ` filepath ` record.
75
+ This will add the record back to DataJoint with an updated file checksum.
76
+ 4 . Compute any downstream dependencies, if needed.
77
+ This will ensure that downstream results dependent on the ` filepath `
78
+ record are updated to reflect the newer ` filepath ` contents.
79
+
80
+ ### Disable Fetch Verification
81
+
82
+ Note: Skipping the checksum is not recommended as it ensures file integrity i.e.
83
+ downloaded files are not corrupted. With S3 stores, most of the time to complete a
84
+ ` .fetch() ` is from the file download itself as opposed to evaluating the checksum. This
85
+ option will primarily benefit ` filepath ` usage connected to a local ` file ` store.
86
+
87
+ To disable checksums you can set a threshold in bytes
88
+ for when to stop evaluating checksums like in the example below:
89
+
90
+ ``` python
91
+ dj.config[" filepath_checksum_size_limit" ] = 5 * 1024 ** 3 # Skip for all files greater than 5GiB
92
+ ```
93
+
94
+ The default is ` None ` which means it will always verify checksums.
75
95
76
96
<!-- TODO: purging filepath data -->
0 commit comments