|
| 1 | +================= |
| 2 | +Data organization |
| 3 | +================= |
| 4 | + |
| 5 | +``aind-data-schema`` validates and write JSON files containing metadata. We store those |
| 6 | +JSON files in a particular directory structure to support our ability to rapidly and openly |
| 7 | +share data. |
| 8 | + |
| 9 | +Core principles |
| 10 | +=============== |
| 11 | + |
| 12 | +**Immutability** |
| 13 | + |
| 14 | +Derived data cannot affect input data. This is essential for reproducibility. |
| 15 | +All data, once produced, should be treated as “read only”. Derived processes |
| 16 | +cannot change input data. This means no appending information to input files, |
| 17 | +and no adding files to existing directories. |
| 18 | + |
| 19 | +**Acquisition sessions first** |
| 20 | + |
| 21 | +The fundamental logical unit for primary data is the acquisition session (time). |
| 22 | + |
| 23 | +There are many ways to logically group data. We group all data acquired at the |
| 24 | +same time. This is for two reasons: |
| 25 | + |
| 26 | +First, it is helpful to logically group data that directly affect each other. The |
| 27 | +treadmill data stream is tightly coupled to the video capturing the body of the |
| 28 | +mouse, which naturally affects neural activity. Grouping these simultaneously |
| 29 | +collected data streams together helps users to understand the data they process |
| 30 | +and analyze. |
| 31 | + |
| 32 | +Second, organizing by session (time) facilitates immutable rapid sharing. Were |
| 33 | +we to share data at the project or dataset level, our ability to share would be |
| 34 | +dependent on difficult decisions that depend on the project’s intended use of the |
| 35 | +data. For example, waiting to release data that all meet the quality control |
| 36 | +criteria defined by a particular project assumes that those criteria apply to all |
| 37 | +potential uses of the data. |
| 38 | + |
| 39 | +**Flat structure** |
| 40 | + |
| 41 | +We avoid using hierarchies to encode metadata. Grouping data into hierarchies via |
| 42 | +directories - or implied hierarchies with complex ordered file naming conventions - is |
| 43 | +a common practice to facilitate search. However, any type of hierarchy dramatically |
| 44 | +impacts how data can be used. Grouping data by project makes it difficult to find data |
| 45 | +by modality. Grouping data by modality makes it difficult to find data by mouse. |
| 46 | + |
| 47 | +A flat structure organized by time is unopinionated about what metadata will be most |
| 48 | +useful. We will instead rely on flexible database queries to facilitate data discovery |
| 49 | +along any dimension, rather than biasing in favor of one field or another. |
| 50 | + |
| 51 | +**Processing is a session** |
| 52 | + |
| 53 | +Processing sessions are analogous to primary data acquisition sessions. Processed data |
| 54 | +files should therefore be logically grouped together, separate from primary data. |
| 55 | +Timestamping processed results allows us to flexibly reprocess without affecting primary |
| 56 | +data. The generic term we use to describe acquisition sessions and processing sessions |
| 57 | +is the data asset. |
| 58 | + |
| 59 | +We could consider separate data assets for different processing pipeline steps (e.g. one |
| 60 | +asset for stitching transforms, one asset for fused results, one asset for segmented neurons, |
| 61 | +etc). However, at this point that seems like unnecessary complexity. |
| 62 | + |
| 63 | +**Standard processing, flexible analysis** |
| 64 | + |
| 65 | +We define processing as basic feature extraction - spike sorting for electrophysiology, |
| 66 | +limb positions extracted from behavior videos, cell positions from light microscopy. |
| 67 | + |
| 68 | +Analysis is taking processed features and using them to answer a scientific question. |
| 69 | +For physiology, the NWB file is a key marker between processing and analysis. |
| 70 | + |
| 71 | +We separate data processing and analysis to facilitate flexible use of data. Whereas |
| 72 | +analytical use of processing features can vary widely, what features will be generally useful |
| 73 | +is often constrained and well-understood (though they are rarely easy to generate). |
| 74 | + |
| 75 | +Processing results must be represented in community-standard formats (NWB-Zarr, OME-Zarr). |
| 76 | +Analysis results can also be captured in standard formats, when applicable, and internally |
| 77 | +consistent formats when standards don’t exist. |
| 78 | + |
| 79 | + |
| 80 | +Primary data conventions |
| 81 | +======================== |
| 82 | + |
| 83 | +All data acquired in a single acquisition session will be stored together. This |
| 84 | +group needs a name, but it must be as simple as possible. It is critical that this |
| 85 | +name be unique, but we should not use this name to encode essential metadata. |
| 86 | + |
| 87 | +All primary data assets have the following naming convention: |
| 88 | + |
| 89 | + <platform-abbreviation>_<subject-id>_<acquisition-date>_<acquisition-time> |
| 90 | + |
| 91 | +A platform is a standardized system for collecting one or more modalities of data. |
| 92 | + |
| 93 | +A few points: |
| 94 | + |
| 95 | +- ``<acquisition-date>``: yyyy-mm-dd at end of acquisition |
| 96 | +- ``<acquisition-time>``: hh-mm-ss at end of acquisition |
| 97 | +- Acquisition date and time are essential for uniqueness |
| 98 | +- Acquisition date and time are in local time zone |
| 99 | +- Time-zone is documented in metadata |
| 100 | +- All tokens (e.g. ``<platform-abbreviation>``, ``<subject-id>``) must not contain underscores or illegal filename characters. |
| 101 | +- ``<platform-abbreviation>``: a less-than 10 character shorthand for a data acquisition platform |
| 102 | + |
| 103 | +Again, this name is strictly for uniqueness. We could use a GUID, but choose |
| 104 | +to have a relatively simple naming convention to facilitate casual browsing. |
| 105 | + |
| 106 | +Primary data assets are organized as follows: |
| 107 | + |
| 108 | + - <asset name> |
| 109 | + - data_description.json (administrative information, funders, licenses, projects, etc) |
| 110 | + - subject.json (species, sex, DOB, unique identifier, genotype, etc) |
| 111 | + - procedures.json (subject surgeries, tissue preparation, water restriction, training protocols, etc) |
| 112 | + - instrument.json/rig.json (static hardware components) |
| 113 | + - acquisition.json/session.json (device settings that change acquisition-to-acquisition) |
| 114 | + - <modality-1> |
| 115 | + - <list of data files> |
| 116 | + - <modality-2> |
| 117 | + - <list of data files> |
| 118 | + - <modality-n> |
| 119 | + - <list of data files> |
| 120 | + - derivatives (processed data generated during acquisition) |
| 121 | + - <label> (e.g. MIP) |
| 122 | + - <list of files> |
| 123 | + - logs (general log files generated by the instrument or rig that are not modality-specific) |
| 124 | + - <list of files> |
| 125 | + |
| 126 | +Platform abbreviation and modality terms come from controlled vocabularies in aind-data-schema-models. |
| 127 | + |
| 128 | +Example for simultaneous electrophysiology with optotagging and fiber photometry: |
| 129 | + |
| 130 | + - EFIP_655568_2022-04-26_11-48-09 |
| 131 | + - <metadata JSON files> |
| 132 | + - FIB |
| 133 | + - L415_2022-04-26T11_48_09.csv |
| 134 | + - L470_2022-04-26T11_48_09.csv |
| 135 | + - L560_2022-04-26T11_48_09.3024512-07_00 |
| 136 | + - Raw2022-04-26T11_48_09.csv |
| 137 | + - TTL_2022-04-26T11_48_08.1780864-07_00 |
| 138 | + - TTL_TS2022-04-26T11_48_08.csv |
| 139 | + - TimeStamp_2022-04-26T11_48_08.csv |
| 140 | + - ecephys |
| 141 | + - 220426114809_655568.opto.csv |
| 142 | + - Record Node 104 |
| 143 | + - <files> |
| 144 | + - behavior-videos |
| 145 | + - face_camera.mp4 |
| 146 | + - body_camera.mp4 |
| 147 | + |
| 148 | +Example for lightsheet microscopy data acquired on the ExaSPIM platform: |
| 149 | + |
| 150 | + - exaSPIM_655568_2022-04-26_11-48-09 |
| 151 | + - <metadata JSON files> |
| 152 | + - SPIM |
| 153 | + - SPIM.ome.zarr |
| 154 | + - derivatives |
| 155 | + - MIP |
| 156 | + - <list of e.g. tiff files> |
| 157 | + |
| 158 | +Derived data conventions |
| 159 | +======================== |
| 160 | + |
| 161 | +Anything computed in a single run should be logically grouped in a folder. The folder should be named: |
| 162 | + |
| 163 | + <primary-asset-name>_<process-label>_<process-date>_<process-time> |
| 164 | + |
| 165 | +For example: |
| 166 | + |
| 167 | +- ``exaSPIM_ANM457202_2022-07-11_22-11-32_processed_2022-08-11_22-11-32`` |
| 168 | +- ``ecephys_595262_2022-02-21_15-18-07_processed_2022-08-11_22-11-32`` |
| 169 | + |
| 170 | +Processed outputs are usually the result of a multi-stage pipeline, so often <process-label> should |
| 171 | +just be “processed.” Other common process labels include: |
| 172 | + |
| 173 | +- ``curation`` - tags assigned to input data (e.g. merge/split/noise calls for ephys units) |
| 174 | +- ... |
| 175 | + |
| 176 | +Overlong names are difficult to read, so do not daisy-chain. The goal is to keep names as simple |
| 177 | +as possible while being readable, not to encode all metadata or the entire provenance chain. If |
| 178 | +various stages of processing are being performed manually over extended periods of time, anchor |
| 179 | +each derived asset on the primary data asset. |
| 180 | + |
| 181 | +Processed result folder organization is as follows: |
| 182 | + |
| 183 | + - <asset name> |
| 184 | + - data_description.json |
| 185 | + - processing.json (describes the code, input parameters, outputs) |
| 186 | + - subject.json (copied from primary asset) |
| 187 | + - procedures.json (copied from primary asset) |
| 188 | + - instrument.json (copied from primary asset) |
| 189 | + - acquisition.json (copied from primary asset) |
| 190 | + - <process-label-1> |
| 191 | + - <list of files> |
| 192 | + - <process-label-2> |
| 193 | + - <list of files> |
| 194 | + - <process-label-n> |
| 195 | + - <list of files> |
| 196 | + |
| 197 | +File name guidelines |
| 198 | +==================== |
| 199 | + |
| 200 | +When naming files, we should: |
| 201 | + |
| 202 | +- use terms from vocabularies defined in aind-data-schema, e.g. |
| 203 | + - platform names and modalities behavior video file names |
| 204 | + - use “yyyy-mm-dd" and “hh-mm-ss" in local time zone for dates and times |
| 205 | +- separate tokens with underscores, and not include underscores in tokens, e.g. |
| 206 | + - Do this: ``EFIP_655568_2022-04-26_11-48-09`` |
| 207 | + - Not this: ``EFIP-655568-2022_04_26-11_48_09`` |
| 208 | +- Do not include illegal filename characters in tokens |
| 209 | + |
0 commit comments