@@ -7,10 +7,10 @@ title: CoderData
77
88## Contribute to CoderData
99
10- CoderData is designed to be a customizable resources that can be
11- altered and appended for your own needs.
12-
13- ## Issues with current version
10+ CoderData is a at its core a data assembly pipeline that pulls from
11+ original data sources of drug sensitivity and omics datasets and
12+ assembles them so they can be integrated into a python package for
13+ AI/ML integration.
1414
1515CoderData is indeed a work in progress. If you have specific requests
1616or bugs, please file an issue on our [ GitHub
@@ -20,52 +20,157 @@ would like to create a new feature to address the issue, you are welcome to fork
2020repository and create a pull request to discuss it in more
2121detail. These will be triaged by the CoderData team as they are received.
2222
23- ## Add your own data
23+ The rest of this document is focused on how to contribute to and
24+ augment CoderData, either for use by the community or your own
25+ purpopses.
26+
27+ ### CoderData build process
28+
29+ To build your own internal Coderdata dataset, or to augment it, it is
30+ important to understand how the package is built.
31+
32+ The build process is managed in the [ build
33+ directory] ( https://github.com/PNNL-CompBio/coderdata/tree/main/build )
34+ primarily by the [ ` build_all.py ` script] ( https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_all.py ) . This script calls the
35+ [ ` build_dataset.py ` script] )(https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_dataset.py ) for each dataset in CoderData in
36+ order. Because our sample and drug identifiers are unique, we must
37+ finish the generation of one dataset before we move to the next. This
38+ process is depicted below.
2439
25- CoderData is designed to be federated and therefore you can build your
26- own dataset that can be accessed locally. Below is an image of the
27- current CoderData framework. Each dataset is processed by a single
28- Docker image with a series of standard scripts
2940
3041![ Coderdata Build] (coderDataBuild.jpg?raw=true "Modular build
3142process")
3243
44+ Therefore to add a new dataset, you must create a docker image that
45+ contains all the scripts to pull the data and reformat it into our
46+ [ LinkML Schema] ( https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml ) . Once complete, you can modify ` build_dataset.py ` to
47+ call your Docker image and associated scripts.
48+
49+ ## Adding your own dataset
3350
34- ### Documentation and steps
3551To add your own data, you must add a Docker image with the following
3652constraints:
3753
38541 . Be named ` Dockerfile.[yourdataset] ` and reside in the
3955 ` /build/docker ` directory
40562 . Possess scripts called ` build_omics.sh ` , ` build_samples.sh ` ,
41- ` build_drugs.sh ` and
42- ` build_exp.sh `
43- 3 . Create tables that mirror the schema described by the [ LinkML YAML file] ( ) .
44-
45- The full process is documented
46- on our [ GitHub site] ( http://github.com/pnnl-compbio/coderdata ) under
47- 'Adding a new dataset'.
48-
49- ### Considerations for building a dataset
50-
51- Considerations to include are:
52- - Do you have unique sample identifiers and metadata for your samples
53- needed to populate the ` Sample ` table of the [ linkML
54- schema] ( https://github.com/PNNL-CompBio/coderdata/tree/main/schema ) ?
55- - Do you have drug structure or names that can be used to carry out
56- the lookup process using our [ standard drug lookup
57- script] ( http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py ) ?
58- - Have you standardized your omics values to conform with the schema?
59- - Do you have dose and response information for your experiment files
60- so that the curves can be refit using the [ drug curve fitting
61- tool] ( http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py ) ?
62- - Are your omics datasets in the correct format for our schema?
57+ ` build_drugs.sh ` , ` build_exp.sh ` , and if needed, a
58+ ` build_misc.sh ` . These will all be called directly by
59+ ` build_dataset.py ` .
60+ 3 . Create tables that mirror the schema described by the [ LinkML YAML file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml ) .
61+
62+
63+ ### Sample generation
64+
65+ The first step of any dataset build is to create a unique set of
66+ sample identifies and store them in a ` [dataset_name]_samples.csv `
67+ file. We recommend following these steps:
68+
69+ 1 . Build a python script that pulls the sample identifier information
70+ from a stable repository and generates Improve identiefiers for
71+ each sample while also ensuring that no sample identifiers are
72+ clashing with prior samples. Examples can be found here and here. If
73+ you are using the Genomic Data Commons, you can leverage our
74+ existing scripts here.
75+ 2 . Create a ` build_samples.sh ` script that calls your script with an
76+ existing sample file as the first argument.
77+ 3 . Test the ` build_samples.sh ` script with a [ test sample
78+ file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_test/test_samples.csv ) .
79+ 4 . Validate the file with the [ linkML validation tool] ( https://linkml.io/linkml/cli/validate.html ) and our
80+ [ schema file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml ) .
81+
82+ ### Omics data generation
83+
84+ The overall omics generation process is the same as the samples, with
85+ a few caveats.
86+
87+ 1 . Build a python script that maps the omics data and gene data to the
88+ standardized identifiers and aligns them to the schema.
89+ pulls the sample identifier information
90+ from a stable repository and generates Improve identiefiers for
91+ each sample while also ensuring that no sample identifiers are
92+ clashing with prior samples. Examples can be found here and here. If
93+ you are using the Genomic Data Commons, you can leverage our
94+ existing scripts here. For each type of omics data (see below), a
95+ single file is created.
96+ 2 . Create a ` build_omics.sh ` script that calls your script with the
97+ ` genes.csv ` file as the first argument and ` [dataset_name]_samples.csv ` file as second
98+ argument.
99+ 3 . Test the ` build_omics.sh ` script with your sample file and [ test genes
100+ file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/build/build_test/test_genes.csv ) .
101+ 4 . Validate the files generated with the [ linkML validation tool] ( https://linkml.io/linkml/cli/validate.html ) and our
102+ [ schema file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml ) .
103+
104+ The precise data files have varying standards, as described below:
105+
106+ - * Mutation data:* In addition to matching gene identifiers each gene
107+ mutation should be mapped to a specific schema of variations. The
108+ list of allowed variations can be found [ in our linkML
109+ file] ( https://github.com/PNNL-CompBio/coderdata/blob/8000968dc5f19fbb986a700862c5035a0230b656/schema/coderdata.yaml#L247 ) .
110+ - * Transcriptomic data:* Transcript data is mapped to the same gene
111+ identifiers and sample sbut is convered to transcripts per million,
112+ or TPM.
113+ - * Copy number data:* Copy number is assumed to be a value
114+ represneting the number of copies of that gene in a particular
115+ sample. 2 is assumed to be diploid.
116+ - * Proteomic data:* Proteomic measurements are generally log ratio
117+ values of the abundance measurements normalized to an internal
118+ control.
119+
120+ The resulting files are then stored as [ dataset_name] _ [ datatype] .csv.
121+
122+ ### Drug data generation
123+
124+ The drug generation process can be slow depending on how many drugs
125+ require querying from PubChem. However, with the use of an existing
126+ drug data file, it's possible to shorten this process.
127+
128+ 1 . Build a python script that maps the drug information to SMILES
129+ String and IMPROVE identifier. All drugs are given an Improve
130+ identifier based on the canonical SMILES string to ensure that
131+ each drug has a unique structure to be used in the modeling
132+ process. To standardize this we encourage using
133+ our [ standard drug lookup
134+ script] ( http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/pubchem_retrieval.py )
135+ that retrieves drug structure and information by name or identifier.
136+ 2 . Createa ` build_drugs.sh ` script that takes as its first argument
137+ an existing drug file and calls the script created in step 1
138+ above. Once the drugs for a dataset are retrieved, we have a second utility
139+ script that [ builds the drug descriptor table] ( ) . Add this to the
140+ shell script to generate the drug descriptor file.
141+ 3 . Test the ` build_drugs.sh ` script with the [ test drugs
142+ file] (TBD).
143+ 4 . Validate the files generated with the [ linkML validation tool] ( https://linkml.io/linkml/cli/validate.html ) and our
144+ [ schema file] ( https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml ) .
145+
146+ The resulting files should be ` [dataset_name]_drugs.tsv ` and
147+ ` [dataset_name]_drug_descriptors.tsv ` .
148+
149+ ### Experiment data generation
150+
151+ The experiment file maps the sample information to the drugs of
152+ interest with various drug response metrics. The experiment data
153+ varies based on the type of system:
154+ - Cell line and organoid data use the [ drug curve fitting
155+ tool] ( http://github.com/pnnl-compbio/coderdata/tree/main/build/utils/fit_curve.py )
156+ that maps doses of drugs (in Moles) to drug response measurements
157+ (in percent) to a variety of curve fitting metrics described in our
158+ [ schema file] ( ) .
159+ - Patient derived xenografts require an alternate script that [ creates
160+ PDX-speciic metrics] ( ) .
161+
162+ Otherwise the steps for building an experiment file are similar:
163+
164+
165+
166+ ### Dockerize and test
63167
64168Lastly, check out examples! We have numerous Docker files in our
65169[ Dockerfile
66170directory] ( http://github.com/pnnl-compbio/coderdata/tree/main/build/docker ) ,
67171and multiple datasets in our [ build
68172directory] ( http://github.com/pnnl-compbio/coderdata/tree/main/build ) .
69173
174+
70175---
71176Your contributions are essential to the growth and improvement of CoderData. We look forward to collaborating with you!
0 commit comments