|
| 1 | +# Passing Files to DCP |
| 2 | + |
| 3 | +Distributed-CellProfiler can be told what files to use through LoadData.csv, Batch Files, or file lists. |
| 4 | + |
| 5 | +## Load Data |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +LoadData.csv are CSVs that tell CellProfiler how the images should be parsed. |
| 10 | +At a minimum, this CSV should contain PathName_{NameOfChannel} and FileName_{NameOfChannel} columns for each of your channels, as well as Metadata_{PieceOfMetadata} for each kind of metadata being used to group your image sets. |
| 11 | +It can contain any other metadata you would like to track. |
| 12 | +Some users have reported issues with using relative paths in the PathName columns; using absolute paths beginning with `/home/ubuntu/bucket/{relativepath}` may increase your odds of success. |
| 13 | + |
| 14 | +### Creating LoadData.csv |
| 15 | + |
| 16 | +You can create this CSV yourself via your favorite scripting language. |
| 17 | +We maintain a script for creating LoadData.csv from Phenix metadata XML files called [pe2loaddata](https://github.com/broadinstitute/pe2loaddata). |
| 18 | + |
| 19 | +You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. |
| 20 | +More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). |
| 21 | +After loading in your images, use the Export->Image Set Listing command. |
| 22 | +You will then need to replace the local paths with the paths where the files can be found in the cloud. |
| 23 | +If your files are in the same structure, this can be done with a simple find and replace in any text editing software. |
| 24 | +(e.g. Find '/Users/eweisbar/Desktop' and replace with '/home/ubuntu/bucket') |
| 25 | + |
| 26 | +### Using LoadData.csv |
| 27 | + |
| 28 | +To use a LoadData.csv with submitJobs, put the path to the LoadData.csv in **data_file:**. |
| 29 | + |
| 30 | +To use a LoadData.csv with run_batch_general.py, enter the name of the LoadData.csv under **#project specific stuff** in `{STEP}name`. |
| 31 | +At the bottom of the file, make sure there are no arguments or `batch=False` in the command for the step you are running. |
| 32 | +(e.g. `MakeAnalysisJobs()` or `MakeAnalysisJobs(batch=False)`) |
| 33 | +Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `datafilepath`. |
| 34 | + |
| 35 | +## Batch Files |
| 36 | + |
| 37 | +Batch files are an easy way to transition from running locally to distributed. |
| 38 | +A batch file is an `.h5` file created by CellProfiler which captures all the data needed to run your workflow - pipeline and file information are packaged together. |
| 39 | +To use a batch file, your data needs to have the same structure in the cloud as on your local machine. |
| 40 | + |
| 41 | +### Creating batch files |
| 42 | + |
| 43 | +To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. |
| 44 | +More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). |
| 45 | +Put the `CreateBatchFiles` module at the end of your pipeline and ensure that it is selected. |
| 46 | +Add a path mapping and edit the `Local root path` and `Cluster root path`. |
| 47 | +Run the CellProfiler pipeline by pressing the `Analyze Images` button; note that it won't actually run your pipeline but will instead create a batch file. |
| 48 | +More information on the `CreateBatchFiles` module can be found [here](https://cellprofiler-manual.s3.amazonaws.com/CellProfiler-4.2.4/modules/fileprocessing.html). |
| 49 | + |
| 50 | +### Using batch files |
| 51 | + |
| 52 | +To use a batch file with submitJobs, put the path to the `.h5` file in **data_file:** and **pipeline:**. |
| 53 | + |
| 54 | +To use a batch file with run_batch_general.py, enter the name of the batch file under **#project specific stuff** in `batchpipename{STEP}`. |
| 55 | +At the bottom of the file, set `batch=True` in the command for the step you are running. |
| 56 | +(e.g. `MakeAnalysisJobs(batch=True)`) |
| 57 | +Note that if you do not follow our standard file organization, under **#not project specific, unless you deviate from the structure** you will also need to edit `batchpath`. |
| 58 | + |
| 59 | +## File lists |
| 60 | + |
| 61 | +You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format. |
| 62 | +Note that file lists themselves do not associate metadata with file paths (in contrast to LoadData.csv files where you can enter any metadata columns you desire.) |
| 63 | +Therefore, you need to extract metadata for Distributed-CellProfiler to use for grouping by extracting metadata from file and folder names in the Metadata module in your CellProfiler pipeline. |
| 64 | +You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`. |
| 65 | + |
| 66 | +### Creating File Lists |
| 67 | + |
| 68 | +Use any text editing software to create a `.txt` file where each line of the file is a path to a single image that you want to process. |
| 69 | + |
| 70 | +### Using File Lists |
| 71 | + |
| 72 | +To use a file list with submitJobs, put the path to the `.txt` file in **data_file:**. |
0 commit comments