Skip to content

Commit 886690c

Browse files
committed
created detailed contribution instructions.
1 parent 8000968 commit 886690c

File tree

1 file changed

+137
-32
lines changed

1 file changed

+137
-32
lines changed

docs/pages/contribution.md

Lines changed: 137 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ title: CoderData
77

88
## Contribute to CoderData
99

10-
CoderData is designed to be a customizable resources that can be
11-
altered and appended for your own needs.
12-
13-
## Issues with current version
10+
CoderData is a at its core a data assembly pipeline that pulls from
11+
original data sources of drug sensitivity and omics datasets and
12+
assembles them so they can be integrated into a python package for
13+
AI/ML integration.
1414

1515
CoderData is indeed a work in progress. If you have specific requests
1616
or bugs, please file an issue on our [GitHub
@@ -20,52 +20,157 @@ would like to create a new feature to address the issue, you are welcome to fork
2020
repository and create a pull request to discuss it in more
2121
detail. These will be triaged by the CoderData team as they are received.
2222

23-
## Add your own data
23+
The rest of this document is focused on how to contribute to and
24+
augment CoderData, either for use by the community or your own
25+
purpopses.
26+
27+
### CoderData build process
28+
29+
To build your own internal Coderdata dataset, or to augment it, it is
30+
important to understand how the package is built.
31+
32+
The build process is managed in the [build
33+
directory](https://github.com/PNNL-CompBio/coderdata/tree/main/build)
34+
primarily by the [`build_all.py` script](https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_all.py). This script calls the
35+
[`build_dataset.py` script])(https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_dataset.py) for each dataset in CoderData in
36+
order. Because our sample and drug identifiers are unique, we must
37+
finish the generation of one dataset before we move to the next. This
38+
process is depicted below.
2439

25-
CoderData is designed to be federated and therefore you can build your
26-
own dataset that can be accessed locally. Below is an image of the
27-
current CoderData framework. Each dataset is processed by a single
28-
Docker image with a series of standard scripts
2940

3041
![Coderdata Build](coderDataBuild.jpg?raw=true "Modular build
3142
process")
3243

44+
Therefore to add a new dataset, you must create a docker image that
45+
contains all the scripts to pull the data and reformat it into our
46+
[LinkML Schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml). Once complete, you can modify `build_dataset.py` to
47+
call your Docker image and associated scripts.
48+
49+
## Adding your own dataset
3350

34-
### Documentation and steps
3551
To add your own data, you must add a Docker image with the following
3652
constraints:
3753

3854
1. Be named `Dockerfile.[yourdataset]` and reside in the
3955
`/build/docker` directory
4056
2. Possess scripts called `build_omics.sh`, `build_samples.sh`,
41-
`build_drugs.sh` and
42-
`build_exp.sh`
43-
3. Create tables that mirror the schema described by the [LinkML YAML file]().
44-
45-
The full process is documented
46-
on our [GitHub site](http://github.com/pnnl-compbio/coderdata) under
47-
'Adding a new dataset'.
48-
49-
### Considerations for building a dataset
50-
51-
Considerations to include are:
52-
- Do you have unique sample identifiers and metadata for your samples
53-
needed to populate the `Sample` table of the [linkML
54-
schema](https://github.com/PNNL-CompBio/coderdata/tree/main/schema)?
55-
- Do you have drug structure or names that can be used to carry out
56-
the lookup process using our [standard drug lookup
57-
script](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py)?
58-
- Have you standardized your omics values to conform with the schema?
59-
- Do you have dose and response information for your experiment files
60-
so that the curves can be refit using the [drug curve fitting
61-
tool](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py)?
62-
- Are your omics datasets in the correct format for our schema?
57+
`build_drugs.sh`, `build_exp.sh` , and if needed, a
58+
`build_misc.sh`. These will all be called directly by
59+
`build_dataset.py`.
60+
3. Create tables that mirror the schema described by the [LinkML YAML file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
61+
62+
63+
### Sample generation
64+
65+
The first step of any dataset build is to create a unique set of
66+
sample identifies and store them in a `[dataset_name]_samples.csv`
67+
file. We recommend following these steps:
68+
69+
1. Build a python script that pulls the sample identifier information
70+
from a stable repository and generates Improve identiefiers for
71+
each sample while also ensuring that no sample identifiers are
72+
clashing with prior samples. Examples can be found here and here. If
73+
you are using the Genomic Data Commons, you can leverage our
74+
existing scripts here.
75+
2. Create a `build_samples.sh` script that calls your script with an
76+
existing sample file as the first argument.
77+
3. Test the `build_samples.sh` script with a [test sample
78+
file](https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_test/test_samples.csv).
79+
4. Validate the file with the [linkML validation tool](https://linkml.io/linkml/cli/validate.html) and our
80+
[schema file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
81+
82+
### Omics data generation
83+
84+
The overall omics generation process is the same as the samples, with
85+
a few caveats.
86+
87+
1. Build a python script that maps the omics data and gene data to the
88+
standardized identifiers and aligns them to the schema.
89+
pulls the sample identifier information
90+
from a stable repository and generates Improve identiefiers for
91+
each sample while also ensuring that no sample identifiers are
92+
clashing with prior samples. Examples can be found here and here. If
93+
you are using the Genomic Data Commons, you can leverage our
94+
existing scripts here. For each type of omics data (see below), a
95+
single file is created.
96+
2. Create a `build_omics.sh` script that calls your script with the
97+
`genes.csv` file as the first argument and `[dataset_name]_samples.csv` file as second
98+
argument.
99+
3. Test the `build_omics.sh` script with your sample file and [test genes
100+
file](https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_test/test_genes.csv).
101+
4. Validate the files generated with the [linkML validation tool](https://linkml.io/linkml/cli/validate.html) and our
102+
[schema file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
103+
104+
The precise data files have varying standards, as described below:
105+
106+
- *Mutation data:* In addition to matching gene identifiers each gene
107+
mutation should be mapped to a specific schema of variations. The
108+
list of allowed variations can be found [in our linkML
109+
file](https://github.com/PNNL-CompBio/coderdata/blob/8000968dc5f19fbb986a700862c5035a0230b656/schema/coderdata.yaml#L247).
110+
- *Transcriptomic data:* Transcript data is mapped to the same gene
111+
identifiers and sample sbut is convered to transcripts per million,
112+
or TPM.
113+
- *Copy number data:* Copy number is assumed to be a value
114+
represneting the number of copies of that gene in a particular
115+
sample. 2 is assumed to be diploid.
116+
- *Proteomic data:* Proteomic measurements are generally log ratio
117+
values of the abundance measurements normalized to an internal
118+
control.
119+
120+
The resulting files are then stored as [dataset_name]_[datatype].csv.
121+
122+
### Drug data generation
123+
124+
The drug generation process can be slow depending on how many drugs
125+
require querying from PubChem. However, with the use of an existing
126+
drug data file, it's possible to shorten this process.
127+
128+
1. Build a python script that maps the drug information to SMILES
129+
String and IMPROVE identifier. All drugs are given an Improve
130+
identifier based on the canonical SMILES string to ensure that
131+
each drug has a unique structure to be used in the modeling
132+
process. To standardize this we encourage using
133+
our [standard drug lookup
134+
script](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py)
135+
that retrieves drug structure and information by name or identifier.
136+
2. Createa `build_drugs.sh` script that takes as its first argument
137+
an existing drug file and calls the script created in step 1
138+
above. Once the drugs for a dataset are retrieved, we have a second utility
139+
script that [builds the drug descriptor table](). Add this to the
140+
shell script to generate the drug descriptor file.
141+
3. Test the `build_drugs.sh` script with the [test drugs
142+
file] (TBD).
143+
4. Validate the files generated with the [linkML validation tool](https://linkml.io/linkml/cli/validate.html) and our
144+
[schema file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
145+
146+
The resulting files should be `[dataset_name]_drugs.tsv` and
147+
`[dataset_name]_drug_descriptors.tsv`.
148+
149+
### Experiment data generation
150+
151+
The experiment file maps the sample information to the drugs of
152+
interest with various drug response metrics. The experiment data
153+
varies based on the type of system:
154+
- Cell line and organoid data use the [drug curve fitting
155+
tool](http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py)
156+
that maps doses of drugs (in Moles) to drug response measurements
157+
(in percent) to a variety of curve fitting metrics described in our
158+
[schema file]().
159+
- Patient derived xenografts require an alternate script that [creates
160+
PDX-speciic metrics]().
161+
162+
Otherwise the steps for building an experiment file are similar:
163+
164+
165+
166+
### Dockerize and test
63167

64168
Lastly, check out examples! We have numerous Docker files in our
65169
[Dockerfile
66170
directory](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker),
67171
and multiple datasets in our [build
68172
directory](http://github.com/pnnl-compbio/coderdata/tree/main/build).
69173

174+
70175
---
71176
Your contributions are essential to the growth and improvement of CoderData. We look forward to collaborating with you!

0 commit comments

Comments
 (0)