Skip to content

Status of Database Creation Pipeline

Murray Cutforth edited this page Apr 8, 2024 · 8 revisions

This page documents the progress towards getting a working pipeline that processes all of our messy experimental PAM data into a cleanly organized data table.

Each data table row represents a well during a single experiment (i.e. a well from a particular plate # and measurement method) Together the well id (row x column), plate number (1-20 or 99) and measurement id (M1-6) function as a unique identifier.

Targets

  1. No duplicated rows
  2. 100% of the rows should have a number of frames = 84 or 164
  3. 100% of Y(II) values are floats (and not NaNs)
    1. counting by actual Y(II) values -- i.e. each row has 41 or 81 values, and a fraction is reported for each row

3/18/24

  1. Duplicated rows
Out of 33792 unique wells: 
Fraction duplicated: 0.06818181818181818
Number Duplicated: 2304

Notes - duplicated plates might be real duplicates from the 99 well plates -> solution is on the gdrive side - duplicate files need to be removed)

  1. Correct number of frames

Screenshot 2024-03-18 at 12 44 11 PM

Out of 33792 unique wells: 
Fraction with too many frames: 0.03409090909090909
Number with too many frames: 1152

plates with extra frames should be discarded

  1. Valid Y(II) values

Screenshot 2024-03-18 at 1 23 47 PM

Out of 32640 unique wells with the correct number of frames: 
Fraction of NaN timeseries: 0.08250612745098039
Number of NaN timeseries: 2693

the light/dark pairing might be off -- this is all due to Fv/Fm's being off

Summary

Out of 33792 wells with data collected, 29947 wells or 88.6% have the full set of valid data checked specified by these checks.

other notes: make sure that the Fv/Fm of the 3 WT's is different from the previous day

8th April 2024 update from Murray

Commit hash to reproduce: 670cee00a37c0d345d8bfba3d167417fbf3a6fee

I started from the main branch, and implemented some changes in order to re-run database creation locally on my laptop. The total set of tif and csv data from the google drive came to 6.7G.

Some summary data:

Number of mutant_ID values: 6461 Number of Y(II) time series which are not all NaN: 33,395

Raw Y(II) data:

image

Light regime / plate coverage:

image

Number of blank wells per plate:

image

Clone this wiki locally