Skip to content

Commit 779d4fd

Browse files
new changes to datasets included rst and static files assocated with that
1 parent e14562b commit 779d4fd

File tree

4 files changed

+1948
-727
lines changed

4 files changed

+1948
-727
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
dataset,num_drugs,aac,abc,auc,dss,fit_auc,fit_ec50,fit_ec50se,fit_einf,fit_hs,fit_ic50,fit_r2,lmm,mRESCIST,published_auc,TGI
2+
beataml,164,X,,X,X,X,X,X,X,X,X,X,,,,
3+
bladder,50,X,,X,X,X,X,X,X,X,X,X,,,,
4+
ccle,24,X,,X,X,X,X,X,X,X,X,X,,,,
5+
colorectal,10,X,,X,X,X,X,X,X,X,X,X,,,,
6+
ctrpv2,459,X,,X,X,X,X,X,X,X,X,X,,,,
7+
fimm,52,X,,X,X,X,X,X,X,X,X,X,,,,
8+
gcsi,44,X,,X,X,X,X,X,X,X,X,X,,,,
9+
gdscv1,294,X,,X,X,X,X,X,X,X,X,X,,,,
10+
gdscv2,171,X,,X,X,X,X,X,X,X,X,X,,,,
11+
liver,76,X,,X,X,X,X,X,X,X,X,X,,,,
12+
mpnst,30,X,X,X,X,X,X,X,X,X,X,X,X,X,,X
13+
nci60,55157,X,,X,X,X,X,X,X,X,X,X,,,,
14+
novartis,25,,X,,,,,,,,,,X,X,,X
15+
pancreatic,25,X,,X,X,X,X,X,X,X,X,X,,,,
16+
prism,1419,X,,X,X,X,X,X,X,X,X,X,,,,
17+
sarcoma,34,,,,,,,,,,,,,,X,
Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
1-
dataset,num_samples,num_drugs,num_sample_drug_pairs,num_sample_drug_transcript_pairs,num_sample_drug_transcript_mutation_pairs,num_sample_drug_transcript_copynum_pairs,num_sample_drug_mutation_copynum_pairs
2-
hcmi,886,,,,,,
3-
beataml,1022,164,23662,3033,2905,,
4-
mpnst,50,25,212,163,163,163,163
5-
pancpdo,70,25,290,180,175,175,285
6-
cptac,1139,,,,,,
7-
sarcpdo,36,34,276,234,187,,
8-
colorectal,61,10,140,60,60,60,140
9-
bladderpdo,134,50,3300,840,640,640,3100
10-
liver,62,76,4453,4453,4453,4453,4453
11-
novartis,386,25,1766,1734,1734,1723,1723
12-
ccle,502,24,11543,10887,10792,10887,11118
13-
ctrpv2,846,460,310564,301263,296487,300452,301373
14-
fimm,52,52,2663,2457,2457,2457,2611
15-
gdscv1,984,293,246807,244282,241074,240318,241644
16-
gdscv2,806,169,113964,112911,111387,111085,111687
17-
gcsi,569,43,13229,12320,12155,12320,12919
18-
prism,478,1418,638684,631784,630379,631784,635929
19-
nci60,83,54654,2933857,2307990,2307977,2307990,2759211
1+
dataset,sample_drug_pairs,sample_drug_transcript_pairs,sample_drug_transcriptomics_mutation_pairs,sample_drug_transcriptomics_copynumber_pairs,sample_drug_mutation_copynumber_pairs
2+
beataml,31926.0,4137.0,3958.0,,
3+
bladder,3300.0,840.0,640.0,640.0,3100.0
4+
ccle,11543.0,10887.0,10792.0,10887.0,11118.0
5+
colorectal,140.0,60.0,60.0,60.0,140.0
6+
cptac,,,,,
7+
ctrpv2,309401.0,300507.0,295742.0,299698.0,300616.0
8+
fimm,2663.0,2457.0,2457.0,2457.0,2611.0
9+
gcsi,13398.0,12506.0,12338.0,12506.0,13112.0
10+
gdscv1,247753.0,245220.0,241999.0,241240.0,242570.0
11+
gdscv2,115440.0,114373.0,112829.0,112523.0,113133.0
12+
hcmi,,,,,
13+
liver,4453.0,4453.0,4453.0,4453.0,4453.0
14+
mpnst,272.0,193.0,184.0,191.0,184.0
15+
nci60,2960756.0,2329149.0,2329132.0,2329149.0,2784474.0
16+
novartis,1766.0,1734.0,1734.0,1723.0,1723.0
17+
pancreatic,190.0,190.0,185.0,185.0,185.0
18+
prism,638983.0,632078.0,630672.0,632078.0,636226.0
19+
sarcoma,275.0,234.0,187.0,,

docs/source/datasets_included.rst

Lines changed: 59 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,58 @@
11
Datasets Included
22
=================
33

4-
This page provides an overview of the datasets included in CoderData version 2.1.0.
4+
This page provides an overview of the datasets included in CoderData version 2.2.0. This package collects 18 diverse sets of paired molecular datasets with corresponding drug sensitivity data. All data here is reprocessed and standardized so it can be easily used as a benchmark dataset for machine learning models.
55

66
Figshare record: https://api.figshare.com/v2/articles/28823159
7-
Version: 2.1.0
7+
Version: 2.2.0
88

99
---------------------------
1010
Dataset Overview
1111
---------------------------
1212
.. csv-table:: Datasets and Modalities
13-
:header: "Dataset", "References", "Sample", "Transcriptomics", "Proteomics", "Mutations", "Copy Number", "Drug", "Drug Descriptor", "Experiments"
14-
:widths: 12, 10, 6, 12, 12, 12, 12, 8, 15, 12
15-
16-
"BeatAML", "[1]_, [2]_", "X", "X", "X", "X", "", "X", "X", "X"
17-
"BladderPDO", "[3]_", "X", "X", "", "X", "X", "X", "X", "X"
18-
"CCLE", "[4]_", "X", "X", "X", "X", "X", "X", "X", "X"
19-
"CPTAC", "[5]_", "X", "X", "X", "X", "X", "", "", ""
20-
"CTRPv2", "[6]_, [7]_, [8]_", "X", "X", "", "X", "X", "X", "X", "X"
21-
"FIMM", "[9]_, [10]_", "X", "X", "", "", "", "X", "X", "X"
22-
"HCMI", "[11]_", "X", "X", "", "X", "X", "", "", ""
23-
"MPNST", "[12]_", "X", "X", "X", "X", "X", "X", "X", "X"
24-
"NCI60", "[13]_", "X", "X", "X", "X", "", "X", "X", "X"
25-
"Pancreatic PDO", "[14]_", "X", "X", "", "X", "X", "X", "X", "X"
26-
"PRISM", "[15]_, [16]_", "X", "X", "", "", "", "X", "X", "X"
27-
"Sarcoma PDO", "[17]_", "X", "X", "", "X", "", "X", "X", "X"
28-
"CRC PDO", "[18]_", "X", "X", "", "X", "X", "X", "X", ""
29-
"Liver PDO", "[19]_", "X", "X", "", "X", "X", "X", "X", ""
30-
"Novartis PDX", "[20]_", "X", "X", "", "X", "X", "X", "X", ""
31-
"gCSI", "[21]_, [22]_", "X", "X", "X", "X", "X", "X", "X", ""
32-
"GDSC v1", "[23]_, [24]_, [25]_", "X", "X", "X", "X", "X", "X", "X", ""
33-
"GDSC v2", "[23]_, [24]_, [25]_", "X", "X", "X", "X", "X", "X", "X", ""
34-
35-
The table above lists the datasets included in CoderData version 2.1.0, along with references to their original publications and the types of data available for each dataset. An "X" indicates the presence of a particular data type for the corresponding dataset.
13+
:header: "Dataset", "References", "Sample", "Drug", "Drug Descriptor", "Experiments", "Transcriptomics", "Proteomics", "Mutations", "Copy Number"
14+
:widths: 14, 12, 6, 8, 15, 12, 12, 12, 12, 12
15+
16+
"BeatAML", "[1]_, [2]_", "1022", "164", "X", "X", "X", "X", "X", ""
17+
"Bladder", "[3]_", "134", "50", "X", "X", "X", "", "X", "X"
18+
"CCLE", "[4]_", "502", "24", "X", "X", "X", "X", "X", "X"
19+
"Colorectal ", "[18]_", "61", "10", "X", "", "X", "", "X", "X"
20+
"CPTAC", "[5]_", "1139", "", "", "", "X", "X", "X", "X"
21+
"CTRPv2", "[6]_, [7]_, [8]_", "846", "459", "X", "X", "X", "", "X", "X"
22+
"FIMM", "[9]_, [10]_", "52", "52", "X", "X", "X", "", "", ""
23+
"GDSC v1", "[23]_, [24]_, [25]_", "984", "294", "X", "", "X", "X", "X", "X"
24+
"GDSC v2", "[23]_, [24]_, [25]_", "806", "171", "X", "", "X", "X", "X", "X"
25+
"gCSI", "[21]_, [22]_", "569", "X", "X", "", "X", "X", "X", "X"
26+
"HCMI", "[11]_", "886", "", "", "", "X", "", "X", "X"
27+
"Liver", "[19]_", "62", "76", "X", "", "X", "", "X", "X"
28+
"MPNST", "[12]_", "50", "30", "X", "X", "X", "X", "X", "X"
29+
"NCI60", "[13]_", "83", "55157", "X", "X", "X", "X", "X", ""
30+
"Novartis", "[20]_", "386", "25", "X", "", "X", "", "X", "X"
31+
"Pancreatic", "[14]_", "70", "25", "X", "X", "X", "", "X", "X"
32+
"PRISM", "[15]_, [16]_", "478", "1419", "X", "X", "X", "", "", ""
33+
"Sarcoma", "[17]_", "36", "34", "X", "X", "X", "", "X", ""
34+
35+
36+
The table above lists the datasets included in CoderData version 2.2.0, along with references to their original publications, counts of samples and drugs, and the types of data available for each dataset.
37+
38+
CoderData includes the following data:
39+
40+
- Sample - cell lines, patient-derived samples, or patient-derived organoids
41+
- Drug - compounds tested for sensitivity
42+
- Drug Descriptor - molecular descriptors for each drug (computed using RDKit)
43+
- Experiments - dose-response experiments (various metrics such as AUC, IC50, etc.)
44+
- Transcriptomics - gene expression (in transcripts per million, TPM)
45+
- Proteomics - protein expression (in log2 ratio to reference)
46+
- Mutations - gene mutations (variant calls)
47+
- Copy Number - gene copy number variations (number of copies of each gene, 2 being diploid)
48+
49+
An "X" indicates the presence of a particular data type for the corresponding dataset.
3650

3751

3852
---------------------------
3953
Dataset Summary Statistics
4054
---------------------------
41-
The following table summarizes key statistics for each dataset, including the number of samples, drugs, and various combinations of sample-drug pairs with different molecular data types.
55+
The following table summarizes combination counts for each dataset. This includes the number of experimental sample-drug pairs, with different molecular data types. Each column represents the number of unique combinations of samples and drugs with the specified molecular data types available. For example, the "Sample-Drug-Transcriptomics-Mutations" column indicates the number of unique sample-drug pairs that have both transcriptomics and mutation data available.
4256

4357
.. csv-table:: Dataset Summary Statistics
4458
:file: _static/dataset_summary_statistics.csv
@@ -51,9 +65,27 @@ Drug Curve Metrics Collected
5165
The following table summarizes the number of drugs associated with each dose-response metric across the datasets.
5266

5367
.. csv-table:: Drug Curve Metrics Summary
54-
:file: _static/dataset_curve_metric_summary.csv
68+
:file: _static/dataset_curve_metrics_wide.csv
5569
:header-rows: 0
5670

71+
Types of dose-response metrics collected include:
72+
73+
- AAC - Area above the response curve; the complement value of AUC.
74+
- ABC - Area between curves, the difference between the AUC of the control and the treated cells.
75+
- AUC - Area under the fitted hill slope curve across all doses present. Lower AUC signifies lower levels of growth.
76+
- DSS - A multiparametric dose response value that takes into account control and treated cells.
77+
- fit_auc - Area under the fitted hill slope curve across the common interval of −log10[M], where the molar concentration ranges from 10⁻⁴ to 10⁻¹⁰.
78+
- fit_ec50 - The fitted curve prediction of the −log10M concentration at which 50% of the maximal effect is observed.
79+
- fit_ec50se - Standard error of the Fit_EC50 estimate.
80+
- fit_einf - The fraction of cells that are unaffected even at an infinite dose concentration. Calculated as the lower asymptote of the hill slope function.
81+
- fit_hs - The estimated hill slope binding cooperativity, calculated as the slope of the sigmoidal hill curve.
82+
- fit_ic50 - The fitted curve prediction of the −log10M concentration required to reduce tumor growth by 50%.
83+
- fit_r2 - Coefficient of determination between observed growth and the fitted hill slope curve, indicating goodness of fit.
84+
- lmm - The resulting “time and treatment interaction” in a linear mixed model with fixed effects as time and treatment and patient as a random effect. Indicates how much the treatment changes the slope of log(volume) over time compared to the control.
85+
- mRESCIST - Disease status classified into PD (progressive disease), SD (stable disease), PR (partial response), and CR (complete response), based on percent volume change and cumulative average response.
86+
- published_auc - Published Area Under the Curve
87+
- TG - Tumor growth inhibition between the control and treatment time-volume curves.
88+
5789

5890

5991
---------------------------

0 commit comments

Comments
 (0)