You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Therefore to add a new dataset, you must create a docker image that
44
+
The build process is slow, partially due to our querying of PubChem,
45
+
and also because of our extensive curve fitting. However, it can be
46
+
run locally so that you can still leverage the Python package
47
+
functionality with your own datasets.
48
+
49
+
If you want to add a new dataset, you must create a docker image that
45
50
contains all the scripts to pull the data and reformat it into our
46
51
[LinkML Schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml). Once complete, you can modify `build_dataset.py` to
47
-
call your Docker image and associated scripts.
52
+
call your Docker image and associated scripts. More details are below.
48
53
49
54
## Adding your own dataset
50
55
51
56
To add your own data, you must add a Docker image with the following
52
57
constraints:
53
58
54
-
1. Be named `Dockerfile.[yourdataset]` and reside in the
59
+
1. Be named `Dockerfile.[dataset_name]` and reside in the
55
60
`/build/docker` directory
56
61
2. Possess scripts called `build_omics.sh`, `build_samples.sh`,
57
62
`build_drugs.sh`, `build_exp.sh` , and if needed, a
58
63
`build_misc.sh`. These will all be called directly by
59
64
`build_dataset.py`.
60
-
3. Create tables that mirror the schema described by the [LinkML YAML file](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/coderdata.yaml).
65
+
3. Create tables that mirror the schema described by the [LinkML YAML
Files are generated in the following order as described above.
61
69
62
70
63
71
### Sample generation
@@ -69,9 +77,9 @@ file. We recommend following these steps:
69
77
1. Build a python script that pulls the sample identifier information
70
78
from a stable repository and generates Improve identiefiers for
71
79
each sample while also ensuring that no sample identifiers are
72
-
clashing with prior samples. Examples can be found here and here. If
80
+
clashing with prior samples. Examples can be found [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/mpnst/00_sample_gen.R) and [here](https://github.com/PNNL-CompBio/coderdata/blob/main/build/broad_sanger/01-broadSangerSamples.R). If
73
81
you are using the Genomic Data Commons, you can leverage our
is our most comprehensive script as it pulls over 50k drugs.
136
148
2. Createa `build_drugs.sh` script that takes as its first argument
137
149
an existing drug file and calls the script created in step 1
138
150
above. Once the drugs for a dataset are retrieved, we have a second utility
139
-
script that [builds the drug descriptor table](). Add this to the
151
+
script that [builds the drug descriptor table](https://github.com/PNNL-CompBio/coderdata/blob/cbf017326b83771c55f12317189f4b2dbd9d900a/schema/coderdata.yaml#L94). Add this to the
140
152
shell script to generate the drug descriptor file.
141
153
3. Test the `build_drugs.sh` script with the [test drugs
142
154
file] (TBD).
@@ -155,17 +167,32 @@ varies based on the type of system:
Lastly, check out examples! We have numerous Docker files in our
187
+
All scripts described above go into a single directory with the name
188
+
of the dataset under the [build](http://github.com/pnnl-compbio/coderdata/tree/main/build) directory, with instructions to add everything in the [docker](http://github.com/pnnl-compbio/coderdata/tree/main/build/docker)
189
+
directory. Make sure to include any requirements for building in the
190
+
folder and docker image as well.
191
+
192
+
Once the Dockerfile builds and runs, you can modify the
193
+
`build_dataset.py` script so that it runs and validates.
194
+
195
+
Check out examples! We have numerous Docker files in our
0 commit comments