-
Notifications
You must be signed in to change notification settings - Fork 0
Status of Database Creation Pipeline
This page documents the progress towards getting a working pipeline that processes all of our messy experimental PAM data into a cleanly organized data table.
Each data table row represents a well during a single experiment (i.e. a well from a particular plate # and measurement method) Together the well id (row x column), plate number (1-20 or 99) and measurement id (M1-6) function as a unique identifier.
- No duplicated rows
- 100% of the rows should have a number of frames = 84 or 164
- 100% of Y(II) values are floats (and not NaNs)
- counting by actual Y(II) values -- i.e. each row has 41 or 81 values, and a fraction is reported for each row
- Duplicated rows
Out of 33792 unique wells:
Fraction duplicated: 0.06818181818181818
Number Duplicated: 2304
Notes - duplicated plates might be real duplicates from the 99 well plates -> solution is on the gdrive side - duplicate files need to be removed)
- Correct number of frames

Out of 33792 unique wells:
Fraction with too many frames: 0.03409090909090909
Number with too many frames: 1152
plates with extra frames should be discarded
- Valid Y(II) values

Out of 32640 unique wells with the correct number of frames:
Fraction of NaN timeseries: 0.08250612745098039
Number of NaN timeseries: 2693
the light/dark pairing might be off -- this is all due to Fv/Fm's being off
Out of 33792 wells with data collected, 29947 wells or 88.6% have the full set of valid data checked specified by these checks.
other notes: make sure that the Fv/Fm of the 3 WT's is different from the previous day
Commit hash to reproduce: 670cee00a37c0d345d8bfba3d167417fbf3a6fee
I started from the main branch, and implemented some changes in order to re-run database creation locally on my laptop. The total set of tif and csv data from the google drive came to 6.7G.
Number of mutant_ID values: 6461
Number of Y(II) time series which are not all NaN: 33,395


