Merge branch 'master' into pycytominer/issues/160

shntnu · web-flow · commit 877914181377 · 2024-03-21T06:16:18.000-04:00
diff --git a/05-create-profiles.md b/05-create-profiles.md
@@ -73,6 +73,8 @@ cd pycytominer
 python3 -m pip install -e .
 ```
 
+Note that if your system does not already have `cytominer-database` installed, you can install it at the same time as pycytominer by changing the final command above to `python3 -m pip install -e .[collate]`
+
 The command below first calls `cytominer-database ingest` to create the SQLite backend, and then pycytominer's `aggregate_profiles` to create per-well profiles. Once complete, all files are uploaded to S3 and the local cache are deleted. This step takes several hours, but metadata creation and GitHub setup can be done in this time.
 
 [collate.py](https://github.com/cytomining/pycytominer/blob/master/pycytominer/cyto_utils/collate.py) ingests and indexes the database. [collate_command.py](https://github.com/cytomining/pycytominer/blob/master/pycytominer/cyto_utils/collate_cmd.py) exposes this functionality to the command line. 
@@ -89,17 +91,22 @@ parallel \
 --results ../../log/${BATCH_ID}/collate \
 --files \
 --keep-order \
-python3 pycytominer/cyto_utils/collate_cmd.py ${BATCH_ID}  pycytominer/cyto_utils/ingest_config.ini {1} \
---temp ~/ebs_tmp \
---remote=s3://${BUCKET}/projects/${PROJECT_NAME}/workspace :::: ${PLATES}
+python3 pycytominer/cyto_utils/collate_cmd.py ${BATCH_ID}  pycytominer/cyto_utils/database_config/ingest_config.ini {1} \
+--tmp-dir ~/ebs_tmp \
+--aws-remote=s3://${BUCKET}/projects/${PROJECT_NAME}/workspace :::: ${PLATES}
+```
+
+```{note}
+`collate_cmd.py` does not recreate the SQLite backend if it already exists in the local cache. Add `--overwrite` flag to recreate.
+If your SQLite creation succeeded but you ran into issues during aggregation, rerunning with `--aggregate-only` will allow you to rerun just that sub-step.
 ```
 
 ```{note}
-`collate.py` does not recreate the SQLite backend if it already exists in the local cache. Add `--overwrite` flag to recreate.
+`collate_cmd` assumes that you will have image measurements in the following categories - Count, Threshold (both generated by any Identify objects present in the module); Granularity, Threshold (both generated if in these modules you use the "both" setting when asked for measurements of images, objects, or both); and ImageQuality (generated by the MeasureImageQuality measurement). If any or all of these are missing in your data, or you wish to add other image measurements, you may pass in an `image-feature-categories` flag to `collate_cmd`: e.g. `--image-feature-categories="Granularity,Texture,Count,Threshold"` . We currently believe these features provide value, but you can also skip adding them to profiles by passing to `collate_cmd` the flag `--dont-add-image-features`.
 ```
 
 ```{note}
-or pipelines that use FlagImage to skip the measurements modules if the image failed QC, the failed images will have Image.csv files with fewer columns that the rest (because columns corresponding to aggregated measurements will be absent). The ingest command will show a warning related to sqlite: `expected X columns but found Y - filling the rest with NULL`. This is expected behavior.
+In pipelines that use FlagImage to skip the measurements modules if the image failed QC, the failed images will have Image.csv files with fewer columns that the rest (because columns corresponding to aggregated measurements will be absent). The ingest command will show a warning related to sqlite: `expected X columns but found Y - filling the rest with NULL`. This is expected behavior.
 ```
 
 ```{note}
@@ -180,8 +187,6 @@ Once and only once - fork the [profiling recipe](https://github.com/cytomining/p
 
 Once per new PROJECT, not new batch - make a copy of the [template repository](https://github.com/cytomining/profiling-template) into your preferred organization with a project name that is similar OR identical to its project tag on S3 and elsewhere.
 
-Once per new PROJECT, not new batch - make a copy of the [template repository](https://github.com/cytomining/profiling-template) into your preferred organization with a project name that is similar OR identical to its project tag on S3 and elsewhere.
-
 ## Make Profiles
 
 ### Optional - set up compute environment
@@ -285,7 +290,7 @@ This needs to happen once per project, not per batch.
 Skip this step if not using DVC.
 ```
 # Navigate
-cd ~/work/projects/${PROJECT_NAME}/workspace/software/${DATA}/profiling-recipe
+cd ~/work/projects/${PROJECT_NAME}/workspace/software/${DATA}
 # Initialize DVC
 dvc init
 # Set up remote storage
@@ -295,8 +300,8 @@ git add .dvc/.gitignore .dvc/config
 git commit -m "Setup DVC"
 ```
 
-
-### If a first batch in this project, create the necessary directories
+```{note}
+If you have multiple AWS profiles on your machine and do not want to use the default one for DVC, you can specify which profile to use by running `dvc remote modify S3storage profile PROFILE_NAME` at any point between adding the remote and performing the final DVC push. 
 ```
 
 ### If a first batch in this project, create the necessary directories
@@ -346,21 +351,24 @@ rsync -arzv --include="*/" --include="*.gz" --exclude "*" ../../backend/${BATCH_
 Especially for large number of plates, this will take some time.  Output will be logged to the console as different steps proceed.
 
 ```
-python profiling-recipe/profiles/profiling_pipeline.py --config config_files/{$CONFIG_FILE}.yml
+python profiling-recipe/profiles/profiling_pipeline.py --config config_files/${CONFIG_FILE}.yml
 ```
 
 ### Push resulting files back up to GitHub
 If using a data repository, push the newly created profiles to DVC and the .dvc files and other files to GitHub as follows
 ```
 dvc add profiles/${BATCH} --recursive
 dvc push
-git add profiles/${BATCH}/*.dvc profiles/*.gitignore
+git add profiles/${BATCH}/*/*.dvc profiles/${BATCH}/*/*.gitignore
 git commit -m 'add profiles'
 git add *
 git commit -m 'add files made in profiling'
 git push
 ```
-If not using DVC but using a data repository, push all new files to GitHub as follows
+
+
+```{note}
+If you have multiple AWS profiles on your machine and do not want to use the default one for DVC, you can specify which profile to use by running `dvc remote modify S3storage profile PROFILE_NAME` at any point between adding the remote and performing the final DVC push. 
 ```
 
 If not using DVC but using a data repository, push all new files to GitHub as follows
diff --git a/06-appendix.md b/06-appendix.md
@@ -17,7 +17,7 @@ this handbook
         image data
     -   The `illum` folder is identical to the `images` folder in terms
         of structure
-        -   `illum` is an output of the first stage of cell profiler
+        -   `illum` is an output of the first stage of CellProfiler
             pipeline that stores a function to adjust the plates in
             `images`
 -   `workspace` also has subdirectories
@@ -87,12 +87,11 @@ this handbook
     ├── 2016_04_01_a549_48hr_batch1
     │   ├── illum
     │   │   └── SQ00015167
-    │   │       ├── SQ00015167_IllumAGP.mat
-    │   │       ├── SQ00015167_IllumDNA.mat
-    │   │       ├── SQ00015167_IllumER.mat
-    │   │       ├── SQ00015167_IllumMito.mat
-    │   │       ├── SQ00015167_IllumRNA.mat
-    │   │       └── SQ00015167.stderr
+    │   │       ├── SQ00015167_IllumAGP.npy
+    │   │       ├── SQ00015167_IllumDNA.npy
+    │   │       ├── SQ00015167_IllumER.npy
+    │   │       ├── SQ00015167_IllumMito.npy
+    │   │       └── SQ00015167_IllumRNA.npy
     │   └── images
     │       └── SQ00015167__2016-04-21T03_34_00-Measurement1
     │         ├── Assaylayout