Skip to content

Commit 5d4876a

Browse files
Merge pull request #157 from daisybio/development
New version: v1.2.4
2 parents 20dd7f6 + 604908d commit 5d4876a

33 files changed

+771
-448
lines changed

.github/workflows/run_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ jobs:
6666
print("::set-output name=result::{}".format(result))
6767
6868
- name: Restore pre-commit cache
69-
uses: actions/[email protected].1
69+
uses: actions/[email protected].2
7070
if: matrix.session == 'pre-commit'
7171
with:
7272
path: ~/.cache/pre-commit
@@ -129,6 +129,6 @@ jobs:
129129
run: nox --force-color --session=coverage -- xml -i
130130

131131
- name: Upload coverage report
132-
uses: codecov/codecov-action@v5.3.1
132+
uses: codecov/codecov-action@v5.4.0
133133
with:
134134
token: ${{ secrets.CODECOV_TOKEN }}

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ data/mapping
55
data/GDSC1
66
data/GDSC2
77
data/CCLE
8-
data/Toy_Data
8+
data/TOYv1
9+
data/TOYv2
910
data/CTRPv1
1011
data/CTRPv2
1112

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,9 @@
5656
# the built documents.
5757
#
5858
# The short X.Y version.
59-
version = "1.2.3"
59+
version = "1.2.4"
6060
# The full version, including alpha/beta/rc tags.
61-
release = "1.2.3"
61+
release = "1.2.4"
6262

6363
# The language for content autogenerated by Sphinx. Refer to documentation
6464
# for a list of supported languages.

docs/quickstart.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@ Quickstart
33

44
Make sure you have installed DrEvalPy and its dependencies (see `Installation <./installation.html>`_).
55

6-
To make sure the pipeline runs, you can use the fast models NaiveDrugMeanPredictor and NaivePredictor on the Toy_Data
6+
To make sure the pipeline runs, you can use the fast models NaiveDrugMeanPredictor and NaivePredictor on the TOYv1 (subset of CTRPv2) or TOYv2 (subset of GDSC2)
77
dataset with the LPO test mode.
88

99
.. code-block:: bash
1010
11-
python run_suite.py --run_id my_first_run --models NaiveDrugMeanPredictor --baselines NaivePredictor --dataset Toy_Data --test_mode LPO
11+
python run_suite.py --run_id my_first_run --models NaiveDrugMeanPredictor --baselines NaivePredictor --dataset TOYv1 --test_mode LPO
1212
1313
This will train the two baseline models on a subset of gene expression features and drug fingerprint features to
1414
predict IC50 values of the GDSC1 database. It will evaluate in "LPO" which is the leave-pairs-out splitting strategy

docs/usage.rst

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -156,14 +156,21 @@ We provide commonly used datasets to evaluate your model on (GDSC1, GDSC2, CCLE,
156156
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
157157
| Dataset Name | Number of Drugs | Number of Cell Lines| Description |
158158
+===================+=================+=====================+=======================================================================================================================+
159-
| GDSC1 | 345 | 987 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 1. |
159+
| GDSC1 | 378 | 970 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 1. |
160160
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
161-
| GDSC2 | 192 | 809 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 2. |
161+
| GDSC2 | 287 | 969 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 2. |
162162
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
163-
| CCLE | 18 | 471 | The Cancer Cell Line Encyclopedia (CCLE) dataset. The response data will soon be replaced with the data from CTRPv2. |
163+
| CCLE | 24 | 503 | The Cancer Cell Line Encyclopedia (CCLE) dataset. |
164164
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
165-
| Toy_Data | 40 | 98 | A toy dataset for testing purposes. |
165+
| CTRPv1 | 354 | 243 | The Cancer Therapeutics Response Portal (CTRP) dataset version 1. |
166166
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
167+
| CTRPv2 | 546 | 886 | The Cancer Therapeutics Response Portal (CTRP) dataset version 2. |
168+
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
169+
| TOYv1 | 36 | 90 | A toy dataset for testing purposes subsetted from CTRPv2. |
170+
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
171+
| TOYv2 | 36 | 90 | A second toy dataset for cross study testing purposes. 80 cell lines and 32 drugs overlap TOYv2. |
172+
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
173+
167174
168175
If using the ``--curve_curator`` option with these datasets, the desired measure provided with the ``--measure`` option is appended with "_curvecurator", e.g. "IC50_curvecurator".
169176
In the provided datasets, these are the measures calculated with the same fitting procedure using CurveCurator. To use the measures reported from the original publications of the

drevalpy/datasets/dataset.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
import numpy as np
2323
import pandas as pd
2424
from sklearn.base import TransformerMixin
25+
from sklearn.feature_selection import VarianceThreshold
2526
from sklearn.model_selection import GroupKFold, train_test_split
2627

2728
from ..pipeline_function import pipeline_function
@@ -1002,6 +1003,9 @@ def fit_transform_features(self, train_ids: np.ndarray, transformer: Transformer
10021003
# Collect all features of the view for fitting the scaler
10031004
train_features = np.vstack([self.features[identifier][view] for identifier in train_ids])
10041005
transformer.fit(train_features)
1006+
if isinstance(transformer, VarianceThreshold):
1007+
mask = transformer.get_support()
1008+
self.meta_info[view] = self.meta_info[view][mask]
10051009

10061010
# Apply transformation and scaling to each feature vector
10071011
for identifier in self.features:

drevalpy/datasets/loader.py

Lines changed: 38 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def load_gdsc1(
2323
2424
:param path_data: Path to the dataset.
2525
:param file_name: File name of the dataset.
26-
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
26+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
2727
2828
:param dataset_name: Name of the dataset.
2929
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
@@ -49,7 +49,7 @@ def load_gdsc2(path_data: str = "data", measure: str = "LN_IC50_curvecurator", f
4949
5050
:param path_data: Path to the dataset.
5151
:param file_name: File name of the dataset.
52-
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
52+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
5353
5454
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
5555
"""
@@ -64,7 +64,7 @@ def load_ccle(
6464
6565
:param path_data: Path to the dataset.
6666
:param file_name: File name of the dataset.
67-
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
67+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
6868
6969
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
7070
"""
@@ -84,17 +84,19 @@ def load_ccle(
8484
)
8585

8686

87-
def load_toy(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
87+
def _load_toy(
88+
path_data: str = "data", measure: str = "LN_IC50_curvecurator", dataset_name="TOYv1"
89+
) -> DrugResponseDataset:
8890
"""
89-
Loads small Toy dataset, subsampled from GDSC1.
91+
Loads small Toy dataset, subsampled from CTRPv2 or GDSC2.
9092
9193
:param path_data: Path to the dataset.
92-
:param measure: The name of the column containing the measure to predict, default = "response"
94+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
95+
:param dataset_name: Name of the dataset. Either "TOYv1" or "TOYv2".
9396
9497
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
9598
"""
96-
dataset_name = "Toy_Data"
97-
path = os.path.join(path_data, dataset_name, "toy_data.csv")
99+
path = os.path.join(path_data, dataset_name, f"{dataset_name}.csv")
98100
if not os.path.exists(path):
99101
download_dataset(dataset_name, path_data, redownload=True)
100102
response_data = pd.read_csv(path, dtype={"pubchem_id": str})
@@ -107,13 +109,37 @@ def load_toy(path_data: str = "data", measure: str = "LN_IC50_curvecurator") ->
107109
)
108110

109111

112+
def load_toyv1(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
113+
"""
114+
Loads small Toy dataset, subsampled from CTRPv2.
115+
116+
:param path_data: Path to the dataset.
117+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
118+
119+
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
120+
"""
121+
return _load_toy(path_data, measure, "TOYv1")
122+
123+
124+
def load_toyv2(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
125+
"""
126+
Loads small Toy dataset, subsampled from GDSC2. Can be used to test cross study prediction.
127+
128+
:param path_data: Path to the dataset.
129+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
130+
131+
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
132+
"""
133+
return _load_toy(path_data, measure, "TOYv2")
134+
135+
110136
def _load_ctrpv(version: str, path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
111137
"""
112138
Load CTRPv1 dataset.
113139
114140
:param version: The version of the CTRP dataset to load.
115141
:param path_data: Path to location of CTRPv1 dataset
116-
:param measure: The name of the column containing the measure to predict, default = "response"
142+
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
117143
118144
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs
119145
"""
@@ -171,7 +197,8 @@ def load_custom(path_data: str | Path, measure: str = "response") -> DrugRespons
171197
"GDSC1": load_gdsc1,
172198
"GDSC2": load_gdsc2,
173199
"CCLE": load_ccle,
174-
"Toy_Data": load_toy,
200+
"TOYv1": load_toyv1,
201+
"TOYv2": load_toyv2,
175202
"CTRPv1": load_ctrpv1,
176203
"CTRPv2": load_ctrpv2,
177204
}
@@ -184,7 +211,7 @@ def load_dataset(
184211
"""
185212
Load a dataset based on the dataset name.
186213
187-
:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', or 'Toy_Data')
214+
:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', 'TOYv1', or 'TOYv2')
188215
to download provided datasets, or any other name to allow for custom datasets.
189216
:param path_data: The parent path in which custom or downloaded datasets should be located, or in which raw
190217
viability data is to be found for fitting with CurveCurator (see param curve_curator for details).

drevalpy/datasets/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ def download_dataset(
2121
"""
2222
Download the latets dataset from Zenodo.
2323
24-
:param dataset_name: dataset name, e.g., "GDSC1", "GDSC2", "CCLE" or "Toy_Data"
24+
:param dataset_name: dataset name, from "GDSC1", "GDSC2", "CCLE", "CTRPv1", "CTRPv2", "TOYv1", "TOYv2"
2525
:param data_path: where to save the data
2626
:param redownload: whether to redownload the data
2727
:raises HTTPError: if the download fails

drevalpy/experiment.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ def drug_response_experiment(
282282
models=models,
283283
n_cv_splits=n_cv_splits,
284284
results_path=result_path,
285-
cross_study_datasets=cross_study_datasets,
285+
cross_study_datasets=[cs.dataset_name for cs in cross_study_datasets],
286286
randomization_mode=randomization_mode,
287287
n_trials_robustness=n_trials_robustness,
288288
out_path=result_path,
@@ -295,7 +295,7 @@ def consolidate_single_drug_model_predictions(
295295
models: list[type[DRPModel]],
296296
n_cv_splits: int,
297297
results_path: str,
298-
cross_study_datasets: list[DrugResponseDataset],
298+
cross_study_datasets: list[str],
299299
randomization_mode: list[str] | None = None,
300300
n_trials_robustness: int = 0,
301301
out_path: str = "",
@@ -357,10 +357,10 @@ def consolidate_single_drug_model_predictions(
357357
# Cross study predictions
358358
for cross_study_dataset in cross_study_datasets:
359359
cross_study_prediction_path = os.path.join(single_drug_prediction_path, "cross_study")
360-
f = f"cross_study_{cross_study_dataset.dataset_name}_split_{split}.csv"
361-
if cross_study_dataset.dataset_name not in predictions["cross_study"]:
362-
predictions["cross_study"][cross_study_dataset.dataset_name] = []
363-
predictions["cross_study"][cross_study_dataset.dataset_name].append(
360+
f = f"cross_study_{cross_study_dataset}_split_{split}.csv"
361+
if cross_study_dataset not in predictions["cross_study"]:
362+
predictions["cross_study"][cross_study_dataset] = []
363+
predictions["cross_study"][cross_study_dataset].append(
364364
pd.read_csv(
365365
os.path.join(cross_study_prediction_path, f),
366366
index_col=0,

drevalpy/models/DIPK/dipk.py

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
from drevalpy.datasets.dataset import DrugResponseDataset, FeatureDataset
2222
from drevalpy.models.drp_model import DRPModel
23-
from drevalpy.models.utils import load_and_reduce_gene_features
23+
from drevalpy.models.utils import load_and_select_gene_features
2424

2525
from .data_utils import CollateFn, DIPKDataset, get_data, load_bionic_features
2626
from .gene_expression_encoder import GeneExpressionEncoder, encode_gene_expression, train_gene_expession_autoencoder
@@ -263,13 +263,28 @@ def predict(
263263
:param cell_line_input: input data associated with the cell line
264264
:param drug_input: input data associated with the drug
265265
:return: predicted response values
266-
:raises ValueError: if drug_input is None or if the model is not initialized
266+
:raises ValueError: if drug_input is None or if the model is not initialized or
267+
if the gene expression encoder is not initialized
267268
"""
268269
if drug_input is None:
269270
raise ValueError("DIPK model requires drug features.")
270271
if not isinstance(self.model, Predictor):
271272
raise ValueError("DIPK model not initialized.")
272273

274+
# Encode gene expression data if this has not been done yet (e.g., for cross-study predictions)
275+
if self.gene_expression_encoder is None:
276+
raise ValueError("Gene expression encoder is not initialized.")
277+
random_cell_line = next(iter(cell_line_input.features.keys()))
278+
if (
279+
len(cell_line_input.features[random_cell_line]["gene_expression"])
280+
!= self.gene_expression_encoder.latent_dim
281+
):
282+
print("Encoding gene expression data for cross study prediction")
283+
cell_line_input.apply(
284+
lambda x: encode_gene_expression(x, self.gene_expression_encoder), # type: ignore[arg-type]
285+
view="gene_expression",
286+
) # type: ignore[arg-type]
287+
273288
# Load data
274289
collate = CollateFn(train=False)
275290
test_samples = get_data(
@@ -310,9 +325,11 @@ def load_cell_line_features(self, data_path: str, dataset_name: str) -> FeatureD
310325
:param dataset_name: path to the dataset
311326
:returns: cell line features
312327
"""
313-
gene_expression = load_and_reduce_gene_features(
328+
# we use the interception of all genes that are present
329+
# in the gene expression features of all datasets
330+
gene_expression = load_and_select_gene_features(
314331
feature_type="gene_expression",
315-
gene_list=None,
332+
gene_list="gene_expression_intersection",
316333
data_path=data_path,
317334
dataset_name=dataset_name,
318335
)

0 commit comments

Comments
 (0)