first draft

sgosline · sgosline · commit e5e41aa9058d · 2025-01-03T12:14:38.000-08:00
diff --git a/docs/pages/contribution.md b/docs/pages/contribution.md
@@ -7,7 +7,7 @@ title: CoderData
 
 ## Contribute to CoderData
 
-CoderData is a at its core a data assembly pipeline that pulls from
+CoderData is a data assembly pipeline that pulls from
 original data sources of drug sensitivity and omics datasets and
 assembles them so they can be integrated into a python package for
 AI/ML integration. 
@@ -41,23 +41,31 @@ process is depicted below.
 ![Coderdata Build](coderDataBuild.jpg?raw=true "Modular build
 process")
 
-Therefore to add a new dataset, you must create a docker image that
+The build process is slow, partially due to our querying of PubChem,
+and also because of our extensive curve fitting. However, it can be
+run locally so that you can still leverage the Python package
+functionality with your own datasets.
+
+If you want to add a new dataset, you must create a docker image that
 contains all the scripts to pull the data and reformat it into our
 [LinkML Schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml). Once complete, you can modify `build_dataset.py` to
-call your Docker image and associated scripts.
+call your Docker image and associated scripts. More details are below.
 
 ## Adding your own dataset
 
 To add your own data, you must add a Docker image with the following
 constraints:
 
-1. Be named `Dockerfile.[yourdataset]` and reside in the
+1. Be named `Dockerfile.[dataset_name]` and reside in the
    `/build/docker` directory
 2. Possess scripts called `build_omics.sh`, `build_samples.sh`,
    `build_drugs.sh`,  `build_exp.sh` , and if needed, a
    `build_misc.sh`. These will all be called directly by
    `build_dataset.py`. 
-3. Create tables that mirror the schema described by the [LinkML YAML file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
+3. Create tables that mirror the schema described by the [LinkML YAML
+   file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
+   
+Files are generated in the following order as described above. 
 
 
 ### Sample generation
@@ -69,9 +77,9 @@ file. We recommend following these steps:
 1. Build a python script that pulls the sample identifier information
    from a stable repository and generates Improve identiefiers for
    each sample while also ensuring that no sample identifiers are
-   clashing with prior samples. Examples can be found here and here. If
+   clashing with prior samples. Examples can be found [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/mpnst/00_sample_gen.R) and [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/broad_sanger/01-broadSangerSamples.R). If
    you are using the Genomic Data Commons, you can leverage our
-   existing scripts here. 
+   existing scripts [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/hcmi/01-createHCMISamplesFile.py). 
 2. Create a `build_samples.sh` script that calls your script with an
    existing sample file as the first argument. 
 3. Test the `build_samples.sh` script with a [test sample
@@ -92,7 +100,8 @@ a few caveats.
    clashing with prior samples. Examples can be found here and here. If
    you are using the Genomic Data Commons, you can leverage our
    existing scripts here. For each type of omics data (see below), a
-   single file is created.
+   single file is created.It might take more than one script, but you
+   can combine those in step 2. 
 2. Create a `build_omics.sh` script that calls your script with the
    `genes.csv` file as the first argument and `[dataset_name]_samples.csv` file as second
    argument. 
@@ -132,11 +141,14 @@ drug data file, it's possible to shorten this process.
    process. To standardize this we encourage using
 our [standard drug lookup
   script](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py)
-  that retrieves drug structure and information by name or identifier. 
+  that retrieves drug structure and information by name or
+   identifier. [This file of NCI60
+   drugs](https://github.com/PNNL-CompBio/coderdata/blob/main/build/broad_sanger/03a-nci60Drugs.py)
+   is our most comprehensive script as it pulls over 50k drugs.
 2. Createa  `build_drugs.sh` script that takes as its first argument
 an existing drug file and calls the script created in step 1
 above. Once the drugs for a dataset are retrieved, we have a second utility
-script that [builds the drug descriptor table](). Add this to the
+script that [builds the drug descriptor table](https://github.com/PNNL-CompBio/coderdata/blob/cbf017326b83771c55f12317189f4b2dbd9d900a/schema/coderdata.yaml#L94). Add this to the
 shell script to generate the drug descriptor file.
 3. Test the `build_drugs.sh` script with the [test drugs
    file] (TBD). 
@@ -155,17 +167,32 @@ varies based on the type of system:
   tool](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py)
   that maps doses of drugs (in Moles) to drug response measurements
   (in percent) to a variety of curve fitting metrics described in our
-  [schema file](). 
+  [schema file](https://github.com/PNNL-CompBio/coderdata/blob/8000968dc5f19fbb986a700862c5035a0230b656/schema/coderdata.yaml#L200). 
 - Patient derived xenografts require an alternate script that [creates
-  PDX-speciic metrics](). 
+  PDX-speciic metrics](https://github.com/PNNL-CompBio/coderdata/blob/main/build/utils/calc_pdx_metrics.py). 
   
 Otherwise the steps for building an experiment file are similar:
-
+1. Build a python script that maps the drug information and sample
+information to the DOSE and GROWTH values, then calls the appropriate
+curve fitting tool described above. 
+2. Create a  `build_exp.sh` script that takes as its first argument
+the samples file and the second argument the drug file. 
+3. Test the `build_exp.sh` script with the drug and samples files. 
+4. Validate the files generated with the [linkML validation tool](https://linkml.io/linkml/cli/validate.html) and our
+   [schema file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml). 
 
   
 ### Dockerize and test
 
-Lastly, check out examples! We have numerous Docker files in our
+All scripts described above go into a single directory with the name
+of the dataset under the [build](http://github.com/pnnl-compbio/coderdata/tree/main/build) directory, with instructions to add everything in the [docker](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker)
+directory. Make sure to include any requirements for building in the
+folder and docker image as well. 
+
+Once the Dockerfile builds and runs, you can modify the
+`build_dataset.py` script so that it runs and validates. 
+
+Check out examples! We have numerous Docker files in our
 [Dockerfile
 directory](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker),
 and multiple datasets in our [build