Skip to content

Commit e5e41aa

Browse files
committed
first draft
1 parent 886690c commit e5e41aa

File tree

1 file changed

+41
-14
lines changed

1 file changed

+41
-14
lines changed

docs/pages/contribution.md

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ title: CoderData
77

88
## Contribute to CoderData
99

10-
CoderData is a at its core a data assembly pipeline that pulls from
10+
CoderData is a data assembly pipeline that pulls from
1111
original data sources of drug sensitivity and omics datasets and
1212
assembles them so they can be integrated into a python package for
1313
AI/ML integration.
@@ -41,23 +41,31 @@ process is depicted below.
4141
![Coderdata Build](coderDataBuild.jpg?raw=true "Modular build
4242
process")
4343

44-
Therefore to add a new dataset, you must create a docker image that
44+
The build process is slow, partially due to our querying of PubChem,
45+
and also because of our extensive curve fitting. However, it can be
46+
run locally so that you can still leverage the Python package
47+
functionality with your own datasets.
48+
49+
If you want to add a new dataset, you must create a docker image that
4550
contains all the scripts to pull the data and reformat it into our
4651
[LinkML Schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml). Once complete, you can modify `build_dataset.py` to
47-
call your Docker image and associated scripts.
52+
call your Docker image and associated scripts. More details are below.
4853

4954
## Adding your own dataset
5055

5156
To add your own data, you must add a Docker image with the following
5257
constraints:
5358

54-
1. Be named `Dockerfile.[yourdataset]` and reside in the
59+
1. Be named `Dockerfile.[dataset_name]` and reside in the
5560
`/build/docker` directory
5661
2. Possess scripts called `build_omics.sh`, `build_samples.sh`,
5762
`build_drugs.sh`, `build_exp.sh` , and if needed, a
5863
`build_misc.sh`. These will all be called directly by
5964
`build_dataset.py`.
60-
3. Create tables that mirror the schema described by the [LinkML YAML file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
65+
3. Create tables that mirror the schema described by the [LinkML YAML
66+
file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
67+
68+
Files are generated in the following order as described above.
6169

6270

6371
### Sample generation
@@ -69,9 +77,9 @@ file. We recommend following these steps:
6977
1. Build a python script that pulls the sample identifier information
7078
from a stable repository and generates Improve identiefiers for
7179
each sample while also ensuring that no sample identifiers are
72-
clashing with prior samples. Examples can be found here and here. If
80+
clashing with prior samples. Examples can be found [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/mpnst/00_sample_gen.R) and [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/broad_sanger/01-broadSangerSamples.R). If
7381
you are using the Genomic Data Commons, you can leverage our
74-
existing scripts here.
82+
existing scripts [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/hcmi/01-createHCMISamplesFile.py).
7583
2. Create a `build_samples.sh` script that calls your script with an
7684
existing sample file as the first argument.
7785
3. Test the `build_samples.sh` script with a [test sample
@@ -92,7 +100,8 @@ a few caveats.
92100
clashing with prior samples. Examples can be found here and here. If
93101
you are using the Genomic Data Commons, you can leverage our
94102
existing scripts here. For each type of omics data (see below), a
95-
single file is created.
103+
single file is created.It might take more than one script, but you
104+
can combine those in step 2.
96105
2. Create a `build_omics.sh` script that calls your script with the
97106
`genes.csv` file as the first argument and `[dataset_name]_samples.csv` file as second
98107
argument.
@@ -132,11 +141,14 @@ drug data file, it's possible to shorten this process.
132141
process. To standardize this we encourage using
133142
our [standard drug lookup
134143
script](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py)
135-
that retrieves drug structure and information by name or identifier.
144+
that retrieves drug structure and information by name or
145+
identifier. [This file of NCI60
146+
drugs](https://github.com/PNNL-CompBio/coderdata/blob/main/build/broad_sanger/03a-nci60Drugs.py)
147+
is our most comprehensive script as it pulls over 50k drugs.
136148
2. Createa `build_drugs.sh` script that takes as its first argument
137149
an existing drug file and calls the script created in step 1
138150
above. Once the drugs for a dataset are retrieved, we have a second utility
139-
script that [builds the drug descriptor table](). Add this to the
151+
script that [builds the drug descriptor table](https://github.com/PNNL-CompBio/coderdata/blob/cbf017326b83771c55f12317189f4b2dbd9d900a/schema/coderdata.yaml#L94). Add this to the
140152
shell script to generate the drug descriptor file.
141153
3. Test the `build_drugs.sh` script with the [test drugs
142154
file] (TBD).
@@ -155,17 +167,32 @@ varies based on the type of system:
155167
tool](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py)
156168
that maps doses of drugs (in Moles) to drug response measurements
157169
(in percent) to a variety of curve fitting metrics described in our
158-
[schema file]().
170+
[schema file](https://github.com/PNNL-CompBio/coderdata/blob/8000968dc5f19fbb986a700862c5035a0230b656/schema/coderdata.yaml#L200).
159171
- Patient derived xenografts require an alternate script that [creates
160-
PDX-speciic metrics]().
172+
PDX-speciic metrics](https://github.com/PNNL-CompBio/coderdata/blob/main/build/utils/calc_pdx_metrics.py).
161173

162174
Otherwise the steps for building an experiment file are similar:
163-
175+
1. Build a python script that maps the drug information and sample
176+
information to the DOSE and GROWTH values, then calls the appropriate
177+
curve fitting tool described above.
178+
2. Create a `build_exp.sh` script that takes as its first argument
179+
the samples file and the second argument the drug file.
180+
3. Test the `build_exp.sh` script with the drug and samples files.
181+
4. Validate the files generated with the [linkML validation tool](https://linkml.io/linkml/cli/validate.html) and our
182+
[schema file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
164183

165184

166185
### Dockerize and test
167186

168-
Lastly, check out examples! We have numerous Docker files in our
187+
All scripts described above go into a single directory with the name
188+
of the dataset under the [build](http://github.com/pnnl-compbio/coderdata/tree/main/build) directory, with instructions to add everything in the [docker](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker)
189+
directory. Make sure to include any requirements for building in the
190+
folder and docker image as well.
191+
192+
Once the Dockerfile builds and runs, you can modify the
193+
`build_dataset.py` script so that it runs and validates.
194+
195+
Check out examples! We have numerous Docker files in our
169196
[Dockerfile
170197
directory](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker),
171198
and multiple datasets in our [build

0 commit comments

Comments
 (0)