Skip to content

Commit e894069

Browse files
authored
Feat getting started (#980)
* initial integration of data organization conventions * upgrading * vocabs * fixing links * cleanup * shuffling faq * adding example workflow code * draft * formatting * tabs * typos * typos * tabs * lint example * line nos * lint * typo * tweaks * lint * better dates * never assume * start times * date formats * shuffling * reframing database question * typo
1 parent 12fd6f2 commit e894069

File tree

13 files changed

+556
-31
lines changed

13 files changed

+556
-31
lines changed

docs/source/acquisition.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Acquisition schema
2-
==================
1+
Acquisition
2+
===========
33

44
**Q: What is Acquisition?**
55

docs/source/data_description.rst

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Data description schema
2-
=======================
1+
Data description
2+
================
33

44
**Q: What is the data description file?**
55

@@ -39,4 +39,21 @@ No. The funding for internally funded AIND or AIBS work is listed as “Allen In
3939

4040
Congratulations! The funding information is pulled from the Funding Smartsheet that Shelby maintains. Work with Shelby
4141
to make sure your grant is on that sheet.
42+
43+
**What are “Institution” and “Group” doing in data_description.json?**
44+
45+
In the future we may need to tag cloud resources based on the originating
46+
group, which may or may not be in AIND, in order to track usage and spending.
47+
48+
49+
**Q: What happened to the “experiment type” asset label? Why are we using platform names instead?**
50+
51+
Formerly we used a short label called “experiment type” in asset names instead of platform
52+
names. This concept was confusing because it was difficult to distinguish from a “modality”.
53+
Most of our data contains multiple modalities. A recording session may contain trained behavior
54+
event data (e.g. lick times), behavior videos (e.g. face camera), neuropixels recordings, and
55+
fiber photometry recordings.
56+
57+
Anchoring browsing on data collection platforms is clearer. We will tag sessions in our metadata
58+
database to indicate which modalities are present in which sessions.
4259

docs/source/data_organization.rst

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
=================
2+
Data organization
3+
=================
4+
5+
``aind-data-schema`` validates and write JSON files containing metadata. We store those
6+
JSON files in a particular directory structure to support our ability to rapidly and openly
7+
share data.
8+
9+
Core principles
10+
===============
11+
12+
**Immutability**
13+
14+
Derived data cannot affect input data. This is essential for reproducibility.
15+
All data, once produced, should be treated as “read only”. Derived processes
16+
cannot change input data. This means no appending information to input files,
17+
and no adding files to existing directories.
18+
19+
**Acquisition sessions first**
20+
21+
The fundamental logical unit for primary data is the acquisition session (time).
22+
23+
There are many ways to logically group data. We group all data acquired at the
24+
same time. This is for two reasons:
25+
26+
First, it is helpful to logically group data that directly affect each other. The
27+
treadmill data stream is tightly coupled to the video capturing the body of the
28+
mouse, which naturally affects neural activity. Grouping these simultaneously
29+
collected data streams together helps users to understand the data they process
30+
and analyze.
31+
32+
Second, organizing by session (time) facilitates immutable rapid sharing. Were
33+
we to share data at the project or dataset level, our ability to share would be
34+
dependent on difficult decisions that depend on the project’s intended use of the
35+
data. For example, waiting to release data that all meet the quality control
36+
criteria defined by a particular project assumes that those criteria apply to all
37+
potential uses of the data.
38+
39+
**Flat structure**
40+
41+
We avoid using hierarchies to encode metadata. Grouping data into hierarchies via
42+
directories - or implied hierarchies with complex ordered file naming conventions - is
43+
a common practice to facilitate search. However, any type of hierarchy dramatically
44+
impacts how data can be used. Grouping data by project makes it difficult to find data
45+
by modality. Grouping data by modality makes it difficult to find data by mouse.
46+
47+
A flat structure organized by time is unopinionated about what metadata will be most
48+
useful. We will instead rely on flexible database queries to facilitate data discovery
49+
along any dimension, rather than biasing in favor of one field or another.
50+
51+
**Processing is a session**
52+
53+
Processing sessions are analogous to primary data acquisition sessions. Processed data
54+
files should therefore be logically grouped together, separate from primary data.
55+
Timestamping processed results allows us to flexibly reprocess without affecting primary
56+
data. The generic term we use to describe acquisition sessions and processing sessions
57+
is the data asset.
58+
59+
We could consider separate data assets for different processing pipeline steps (e.g. one
60+
asset for stitching transforms, one asset for fused results, one asset for segmented neurons,
61+
etc). However, at this point that seems like unnecessary complexity.
62+
63+
**Standard processing, flexible analysis**
64+
65+
We define processing as basic feature extraction - spike sorting for electrophysiology,
66+
limb positions extracted from behavior videos, cell positions from light microscopy.
67+
68+
Analysis is taking processed features and using them to answer a scientific question.
69+
For physiology, the NWB file is a key marker between processing and analysis.
70+
71+
We separate data processing and analysis to facilitate flexible use of data. Whereas
72+
analytical use of processing features can vary widely, what features will be generally useful
73+
is often constrained and well-understood (though they are rarely easy to generate).
74+
75+
Processing results must be represented in community-standard formats (NWB-Zarr, OME-Zarr).
76+
Analysis results can also be captured in standard formats, when applicable, and internally
77+
consistent formats when standards don’t exist.
78+
79+
80+
Primary data conventions
81+
========================
82+
83+
All data acquired in a single acquisition session will be stored together. This
84+
group needs a name, but it must be as simple as possible. It is critical that this
85+
name be unique, but we should not use this name to encode essential metadata.
86+
87+
All primary data assets have the following naming convention:
88+
89+
<platform-abbreviation>_<subject-id>_<acquisition-date>_<acquisition-time>
90+
91+
A platform is a standardized system for collecting one or more modalities of data.
92+
93+
A few points:
94+
95+
- ``<acquisition-date>``: yyyy-mm-dd at end of acquisition
96+
- ``<acquisition-time>``: hh-mm-ss at end of acquisition
97+
- Acquisition date and time are essential for uniqueness
98+
- Acquisition date and time are in local time zone
99+
- Time-zone is documented in metadata
100+
- All tokens (e.g. ``<platform-abbreviation>``, ``<subject-id>``) must not contain underscores or illegal filename characters.
101+
- ``<platform-abbreviation>``: a less-than 10 character shorthand for a data acquisition platform
102+
103+
Again, this name is strictly for uniqueness. We could use a GUID, but choose
104+
to have a relatively simple naming convention to facilitate casual browsing.
105+
106+
Primary data assets are organized as follows:
107+
108+
- <asset name>
109+
- data_description.json (administrative information, funders, licenses, projects, etc)
110+
- subject.json (species, sex, DOB, unique identifier, genotype, etc)
111+
- procedures.json (subject surgeries, tissue preparation, water restriction, training protocols, etc)
112+
- instrument.json/rig.json (static hardware components)
113+
- acquisition.json/session.json (device settings that change acquisition-to-acquisition)
114+
- <modality-1>
115+
- <list of data files>
116+
- <modality-2>
117+
- <list of data files>
118+
- <modality-n>
119+
- <list of data files>
120+
- derivatives (processed data generated during acquisition)
121+
- <label> (e.g. MIP)
122+
- <list of files>
123+
- logs (general log files generated by the instrument or rig that are not modality-specific)
124+
- <list of files>
125+
126+
Platform abbreviation and modality terms come from controlled vocabularies in aind-data-schema-models.
127+
128+
Example for simultaneous electrophysiology with optotagging and fiber photometry:
129+
130+
- EFIP_655568_2022-04-26_11-48-09
131+
- <metadata JSON files>
132+
- FIB
133+
- L415_2022-04-26T11_48_09.csv
134+
- L470_2022-04-26T11_48_09.csv
135+
- L560_2022-04-26T11_48_09.3024512-07_00
136+
- Raw2022-04-26T11_48_09.csv
137+
- TTL_2022-04-26T11_48_08.1780864-07_00
138+
- TTL_TS2022-04-26T11_48_08.csv
139+
- TimeStamp_2022-04-26T11_48_08.csv
140+
- ecephys
141+
- 220426114809_655568.opto.csv
142+
- Record Node 104
143+
- <files>
144+
- behavior-videos
145+
- face_camera.mp4
146+
- body_camera.mp4
147+
148+
Example for lightsheet microscopy data acquired on the ExaSPIM platform:
149+
150+
- exaSPIM_655568_2022-04-26_11-48-09
151+
- <metadata JSON files>
152+
- SPIM
153+
- SPIM.ome.zarr
154+
- derivatives
155+
- MIP
156+
- <list of e.g. tiff files>
157+
158+
Derived data conventions
159+
========================
160+
161+
Anything computed in a single run should be logically grouped in a folder. The folder should be named:
162+
163+
<primary-asset-name>_<process-label>_<process-date>_<process-time>
164+
165+
For example:
166+
167+
- ``exaSPIM_ANM457202_2022-07-11_22-11-32_processed_2022-08-11_22-11-32``
168+
- ``ecephys_595262_2022-02-21_15-18-07_processed_2022-08-11_22-11-32``
169+
170+
Processed outputs are usually the result of a multi-stage pipeline, so often <process-label> should
171+
just be “processed.” Other common process labels include:
172+
173+
- ``curation`` - tags assigned to input data (e.g. merge/split/noise calls for ephys units)
174+
- ...
175+
176+
Overlong names are difficult to read, so do not daisy-chain. The goal is to keep names as simple
177+
as possible while being readable, not to encode all metadata or the entire provenance chain. If
178+
various stages of processing are being performed manually over extended periods of time, anchor
179+
each derived asset on the primary data asset.
180+
181+
Processed result folder organization is as follows:
182+
183+
- <asset name>
184+
- data_description.json
185+
- processing.json (describes the code, input parameters, outputs)
186+
- subject.json (copied from primary asset)
187+
- procedures.json (copied from primary asset)
188+
- instrument.json (copied from primary asset)
189+
- acquisition.json (copied from primary asset)
190+
- <process-label-1>
191+
- <list of files>
192+
- <process-label-2>
193+
- <list of files>
194+
- <process-label-n>
195+
- <list of files>
196+
197+
File name guidelines
198+
====================
199+
200+
When naming files, we should:
201+
202+
- use terms from vocabularies defined in aind-data-schema, e.g.
203+
- platform names and modalities behavior video file names
204+
- use “yyyy-mm-dd" and “hh-mm-ss" in local time zone for dates and times
205+
- separate tokens with underscores, and not include underscores in tokens, e.g.
206+
- Do this: ``EFIP_655568_2022-04-26_11-48-09``
207+
- Not this: ``EFIP-655568-2022_04_26-11_48_09``
208+
- Do not include illegal filename characters in tokens
209+
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import pandas as pd
2+
import os
3+
4+
from aind_data_schema_models.modalities import Modality
5+
from aind_data_schema_models.organizations import Organization
6+
from aind_data_schema_models.pid_names import PIDName
7+
from aind_data_schema_models.platforms import Platform
8+
9+
from aind_data_schema.core.data_description import Funding, RawDataDescription
10+
from aind_data_schema.core.subject import Subject, Species, BreedingInfo, Housing
11+
from aind_data_schema.core.procedures import (
12+
NanojectInjection,
13+
Procedures,
14+
Surgery,
15+
ViralMaterial,
16+
Perfusion,
17+
)
18+
19+
sessions_df = pd.read_excel("example_workflow.xlsx", sheet_name="sessions")
20+
mice_df = pd.read_excel("example_workflow.xlsx", sheet_name="mice")
21+
procedures_df = pd.read_excel("example_workflow.xlsx", sheet_name="procedures")
22+
23+
# everything was done by one person, so it's not in the spreadsheet
24+
experimenter = "Sam Student"
25+
26+
# in our spreadsheet, we stored sex as M/F instead of Male/Female
27+
subject_sex_lookup = {
28+
"F": "Female",
29+
"M": "Male",
30+
}
31+
32+
# everything is covered by the same IACUC protocol
33+
iacuc_protocol = "2109"
34+
35+
# loop through all of the sessions
36+
for session_idx, session in sessions_df.iterrows():
37+
38+
# our data always contains planar optical physiology and behavior videos
39+
d = RawDataDescription(
40+
modality=[Modality.POPHYS, Modality.BEHAVIOR_VIDEOS],
41+
platform=Platform.BEHAVIOR,
42+
subject_id=str(session["mouse_id"]),
43+
creation_time=session["end_time"].to_pydatetime(),
44+
institution=Organization.OTHER,
45+
investigators=[PIDName(name="Some Investigator")],
46+
funding_source=[Funding(funder=Organization.NIMH)],
47+
)
48+
49+
# we will store our json files in a directory named after the session
50+
os.makedirs(d.name, exist_ok=True)
51+
52+
d.write_standard_file(output_directory=d.name)
53+
54+
# look up the mouse used in this session
55+
mouse = mice_df[mice_df["id"] == session["mouse_id"]].iloc[0]
56+
dam = mice_df[mice_df["id"] == mouse["dam_id"]].iloc[0]
57+
sire = mice_df[mice_df["id"] == mouse["sire_id"]].iloc[0]
58+
59+
# construct the subject
60+
s = Subject(
61+
subject_id=str(mouse["id"]),
62+
species=Species.MUS_MUSCULUS, # all subjects are mice
63+
sex=subject_sex_lookup.get(mouse["sex"]),
64+
date_of_birth=mouse["dob"],
65+
genotype=mouse["genotype"],
66+
breeding_info=BreedingInfo(
67+
maternal_id=str(dam["id"]),
68+
maternal_genotype=dam["genotype"],
69+
paternal_id=str(sire["id"]),
70+
paternal_genotype=sire["genotype"],
71+
breeding_group="unknown", # not in spreadsheet
72+
),
73+
housing=Housing(
74+
home_cage_enrichment=["Running wheel"], # all subjects had a running wheel in their cage
75+
cage_id="unknown", # not in spreadsheet
76+
),
77+
background_strain="C57BL/6J",
78+
source=Organization.OTHER,
79+
)
80+
s.write_standard_file(output_directory=d.name)
81+
82+
# look up the procedures performed in this session
83+
proc_row = procedures_df[procedures_df["mouse_id"] == mouse["id"]].iloc[0]
84+
85+
# we stored the injection coordinates as a comma-delimited string: AP,ML,DV,angle
86+
coords = proc_row.injection_coord.split(",")
87+
88+
# in this example, a single protocol that covers all surgical procedures
89+
protocol = str(proc_row["protocol"])
90+
91+
p = Procedures(
92+
subject_id=str(mouse["id"]),
93+
subject_procedures=[
94+
Surgery(
95+
start_date=proc_row["injection_date"].to_pydatetime().date(),
96+
protocol_id=protocol,
97+
iacuc_protocol=iacuc_protocol,
98+
experimenter_full_name=experimenter,
99+
procedures=[
100+
NanojectInjection(
101+
protocol_id=protocol,
102+
injection_materials=[
103+
ViralMaterial(
104+
material_type="Virus",
105+
name=proc_row["virus_name"],
106+
titer=proc_row["virus_titer"],
107+
)
108+
],
109+
targeted_structure=proc_row["brain_area"],
110+
injection_coordinate_ml=float(coords[1]),
111+
injection_coordinate_ap=float(coords[0]),
112+
injection_angle=float(coords[3]),
113+
# multiple injection volumes at different depths are allowed, but that's not happening here
114+
injection_coordinate_depth=[float(coords[2])],
115+
injection_volume=[float(proc_row["injection_volume"])],
116+
)
117+
],
118+
),
119+
Surgery(
120+
start_date=proc_row["perfusion_date"].to_pydatetime().date(),
121+
experimenter_full_name=experimenter,
122+
iacuc_protocol=iacuc_protocol,
123+
protocol_id=protocol,
124+
procedures=[Perfusion(protocol_id=protocol, output_specimen_ids=["1"])],
125+
),
126+
],
127+
)
128+
p.write_standard_file(output_directory=d.name)

0 commit comments

Comments
 (0)