Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
1710f72
hashes missing
PascalIversen Mar 3, 2025
59d78e7
fix hashing, the check was wrong
PascalIversen Mar 4, 2025
12c2c9d
why is this test failing
JudithBernett Mar 4, 2025
427a894
updated tests
JudithBernett Mar 4, 2025
f618bdd
multiomics fixed
JudithBernett Mar 4, 2025
02d035c
fix dipk be selecting intercept genes for autoencoder
PascalIversen Mar 4, 2025
3ad6b3a
comment added mainly to rerun tests
PascalIversen Mar 4, 2025
1ca076b
Trigger GitHub Actions rerun
PascalIversen Mar 4, 2025
b55054d
toy data does not have all genes. Either we add all genes or this har…
PascalIversen Mar 5, 2025
fccd54c
Toy_Data is hereby renamed to TOYv1, consistent with the other datase…
PascalIversen Mar 5, 2025
b457741
fix order of genes when selecting
PascalIversen Mar 5, 2025
cd343ac
test order
PascalIversen Mar 5, 2025
585b8c1
dataset: also subsetting meta info when transformed via variancethres…
JudithBernett Mar 5, 2025
34bf7ca
new updates
JudithBernett Mar 5, 2025
3cb11fb
Update usage.rst
JudithBernett Mar 5, 2025
48aea09
Update .gitignore
PascalIversen Mar 5, 2025
79b80ad
Update docs/quickstart.rst
PascalIversen Mar 5, 2025
3c1e0cf
Update drevalpy/datasets/loader.py
PascalIversen Mar 5, 2025
db14550
Update tests/test_available_data.py
PascalIversen Mar 5, 2025
48c28e6
Update tests/test_available_data.py
PascalIversen Mar 5, 2025
805bcd6
Update tests/test_available_data.py
PascalIversen Mar 5, 2025
b1b1ad1
Update tests/test_available_data.py
PascalIversen Mar 5, 2025
2650b6c
replace ctrpv1 with toyv2, toy with toyv1
JudithBernett Mar 5, 2025
caec0b1
Update drevalpy/models/MOLIR/molir.py
JudithBernett Mar 5, 2025
db01ce2
Merge branch 'hash_missing' of github.com:daisybio/drevalpy into hash…
JudithBernett Mar 5, 2025
3109aeb
Merge branch 'DIPK_fix_genes_for_cs' into hash_missing
JudithBernett Mar 5, 2025
ab552bf
Merge branch 'hash_missing' into cross_toy
JudithBernett Mar 5, 2025
a34780d
Merge branch 'hash_missing' into fix_gene_select
JudithBernett Mar 5, 2025
1ab5406
Update drevalpy/models/DIPK/dipk.py
JudithBernett Mar 5, 2025
5eec38e
mypy fixes in the tests and molir, sperfeltr and random fignerprint f…
PascalIversen Mar 5, 2025
b214bef
new tests for new toy datasets
JudithBernett Mar 5, 2025
f9f6096
Merge remote-tracking branch 'origin/cross_toy' into hash_missing
JudithBernett Mar 5, 2025
35a503f
Merge remote-tracking branch 'origin/DIPK_fix_genes_for_cs' into hash…
JudithBernett Mar 5, 2025
e46bb29
fix selection bug. its an array not a df
PascalIversen Mar 5, 2025
f1bc70b
Merge remote-tracking branch 'origin/fix_gene_select' into hash_missing
JudithBernett Mar 5, 2025
895b1db
everything works except for dipk
JudithBernett Mar 5, 2025
b300851
fix dipk cs
PascalIversen Mar 5, 2025
d8e75d4
fixed mypy
JudithBernett Mar 5, 2025
7770e0f
Merge branch 'DIPK_fix_genes_for_cs' of github.com:daisybio/drevalpy …
JudithBernett Mar 5, 2025
81f90b8
Merge branch 'hash_missing' into DIPK_fix_genes_for_cs
JudithBernett Mar 5, 2025
34653cd
new version
JudithBernett Mar 5, 2025
daa18fd
Merge branch 'cross_toy' of github.com:daisybio/drevalpy into cross_toy
JudithBernett Mar 5, 2025
d5f0720
Merge branch 'DIPK_fix_genes_for_cs' into cross_toy
JudithBernett Mar 5, 2025
cc3556c
Merge branch 'fix_gene_select' of github.com:daisybio/drevalpy into f…
JudithBernett Mar 5, 2025
d631d1d
Merge branch 'DIPK_fix_genes_for_cs' into fix_gene_select
JudithBernett Mar 5, 2025
1b04de5
fix tests
PascalIversen Mar 6, 2025
df91e04
renaming
PascalIversen Mar 6, 2025
e05f770
merge
PascalIversen Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ data/mapping
data/GDSC1
data/GDSC2
data/CCLE
data/Toy_Data
data/TOYv1
data/CTRPv1
data/CTRPv2

Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ Quickstart

Make sure you have installed DrEvalPy and its dependencies (see `Installation <./installation.html>`_).

To make sure the pipeline runs, you can use the fast models NaiveDrugMeanPredictor and NaivePredictor on the Toy_Data
To make sure the pipeline runs, you can use the fast models NaiveDrugMeanPredictor and NaivePredictor on the TOYv1
dataset with the LPO test mode.

.. code-block:: bash

python run_suite.py --run_id my_first_run --models NaiveDrugMeanPredictor --baselines NaivePredictor --dataset Toy_Data --test_mode LPO
python run_suite.py --run_id my_first_run --models NaiveDrugMeanPredictor --baselines NaivePredictor --dataset TOYv1 --test_mode LPO

This will train the two baseline models on a subset of gene expression features and drug fingerprint features to
predict IC50 values of the GDSC1 database. It will evaluate in "LPO" which is the leave-pairs-out splitting strategy
Expand Down
15 changes: 11 additions & 4 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,14 +156,21 @@ We provide commonly used datasets to evaluate your model on (GDSC1, GDSC2, CCLE,
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| Dataset Name | Number of Drugs | Number of Cell Lines| Description |
+===================+=================+=====================+=======================================================================================================================+
| GDSC1 | 345 | 987 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 1. |
| GDSC1 | 378 | 970 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 1. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| GDSC2 | 192 | 809 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 2. |
| GDSC2 | 287 | 969 | The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 2. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| CCLE | 18 | 471 | The Cancer Cell Line Encyclopedia (CCLE) dataset. The response data will soon be replaced with the data from CTRPv2. |
| CCLE | 24 | 503 | The Cancer Cell Line Encyclopedia (CCLE) dataset. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| Toy_Data | 40 | 98 | A toy dataset for testing purposes. |
| CTRPv1 | 354 | 243 | The Cancer Therapeutics Response Portal (CTRP) dataset version 1. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| CTRPv2 | 546 | 886 | The Cancer Therapeutics Response Portal (CTRP) dataset version 2. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| TOYv1 | 36 | 90 | A toy dataset for testing purposes subsetted from CTRPv2. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+
| TOYv2 | 36 | 90 | A second toy dataset for cross study testing purposes. 80 cell lines and 32 drugs overlap TOYv2. |
+-------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------+


If using the ``--curve_curator`` option with these datasets, the desired measure provided with the ``--measure`` option is appended with "_curvecurator", e.g. "IC50_curvecurator".
In the provided datasets, these are the measures calculated with the same fitting procedure using CurveCurator. To use the measures reported from the original publications of the
Expand Down
49 changes: 38 additions & 11 deletions drevalpy/datasets/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def load_gdsc1(

:param path_data: Path to the dataset.
:param file_name: File name of the dataset.
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:param dataset_name: Name of the dataset.
:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
Expand All @@ -49,7 +49,7 @@ def load_gdsc2(path_data: str = "data", measure: str = "LN_IC50_curvecurator", f

:param path_data: Path to the dataset.
:param file_name: File name of the dataset.
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
"""
Expand All @@ -64,7 +64,7 @@ def load_ccle(

:param path_data: Path to the dataset.
:param file_name: File name of the dataset.
:param measure: The name of the column containing the measure to predict, default = "LN_IC50"
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
"""
Expand All @@ -84,17 +84,19 @@ def load_ccle(
)


def load_toy(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
def _load_toy(
path_data: str = "data", measure: str = "LN_IC50_curvecurator", dataset_name="TOYv1"
) -> DrugResponseDataset:
"""
Loads small Toy dataset, subsampled from GDSC1.
Loads small Toy dataset, subsampled from CTRPv2 or GDSC2.

:param path_data: Path to the dataset.
:param measure: The name of the column containing the measure to predict, default = "response"
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"
:param dataset_name: Name of the dataset. Either "TOYv1" or "TOYv2".

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
"""
dataset_name = "Toy_Data"
path = os.path.join(path_data, dataset_name, "toy_data.csv")
path = os.path.join(path_data, dataset_name, f"{dataset_name}.csv")
if not os.path.exists(path):
download_dataset(dataset_name, path_data, redownload=True)
response_data = pd.read_csv(path, dtype={"pubchem_id": str})
Expand All @@ -107,13 +109,37 @@ def load_toy(path_data: str = "data", measure: str = "LN_IC50_curvecurator") ->
)


def load_toyv1(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
"""
Loads small Toy dataset, subsampled from CTRPv2.

:param path_data: Path to the dataset.
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
"""
return _load_toy(path_data, measure, "TOYv1")


def load_toyv2(path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
"""
Loads small Toy dataset, subsampled from GDSC2. Can be used to test cross study prediction.

:param path_data: Path to the dataset.
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs.
"""
return _load_toy(path_data, measure, "TOYv2")


def _load_ctrpv(version: str, path_data: str = "data", measure: str = "LN_IC50_curvecurator") -> DrugResponseDataset:
"""
Load CTRPv1 dataset.

:param version: The version of the CTRP dataset to load.
:param path_data: Path to location of CTRPv1 dataset
:param measure: The name of the column containing the measure to predict, default = "response"
:param measure: The name of the column containing the measure to predict, default = "LN_IC50_curvecurator"

:return: DrugResponseDataset containing response, cell line IDs, and drug IDs
"""
Expand Down Expand Up @@ -171,7 +197,8 @@ def load_custom(path_data: str | Path, measure: str = "response") -> DrugRespons
"GDSC1": load_gdsc1,
"GDSC2": load_gdsc2,
"CCLE": load_ccle,
"Toy_Data": load_toy,
"TOYv1": load_toyv1,
"TOYv2": load_toyv2,
"CTRPv1": load_ctrpv1,
"CTRPv2": load_ctrpv2,
}
Expand All @@ -184,7 +211,7 @@ def load_dataset(
"""
Load a dataset based on the dataset name.

:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', or 'Toy_Data')
:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', or 'TOYv1')
to download provided datasets, or any other name to allow for custom datasets.
:param path_data: The parent path in which custom or downloaded datasets should be located, or in which raw
viability data is to be found for fitting with CurveCurator (see param curve_curator for details).
Expand Down
2 changes: 1 addition & 1 deletion drevalpy/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def download_dataset(
"""
Download the latets dataset from Zenodo.

:param dataset_name: dataset name, e.g., "GDSC1", "GDSC2", "CCLE" or "Toy_Data"
:param dataset_name: dataset name, from "GDSC1", "GDSC2", "CCLE", "CTRPv1", "CTRPv2", "TOYv1", "TOYv2"
:param data_path: where to save the data
:param redownload: whether to redownload the data
:raises HTTPError: if the download fails
Expand Down
9 changes: 6 additions & 3 deletions drevalpy/models/SuperFELTR/hyperparameters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,24 @@ SuperFELTR:
expression_var_threshold:
GDSC1: 0.1
GDSC2: 0.1
Toy_Data: 0.03
TOYv1: 0.03
TOYv2: 0.03
CCLE: 0.1
CTRPv1: 0.1
CTRPv2: 0.1
mutation_var_threshold:
GDSC1: 0.1
GDSC2: 0.1
Toy_Data: 0.05
TOYv1: 0.05
TOYv2: 0.05
CCLE: 0.1
CTRPv1: 0.1
CTRPv2: 0.1
cnv_var_threshold:
GDSC1: 0.7
GDSC2: 0.7
Toy_Data: 0.6
TOYv1: 0.6
TOYv2: 0.6
CCLE: 0.7
CTRPv1: 0.7
CTRPv2: 0.7
Expand Down
5 changes: 3 additions & 2 deletions drevalpy/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,8 +341,9 @@ def get_datasets(
"""
Load the response data and cross-study datasets.

:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', or 'Toy_Data')
to download provided datasets, or any other name to allow for custom datasets.
:param dataset_name: The name of the dataset to load. Can be one of ('GDSC1', 'GDSC2', 'CCLE', CTRPv1',
'CTRPv2', 'TOYv1', 'TOYv2')
to download provided datasets, or any other name to use a custom datasets.
:param cross_study_datasets: list of cross-study datasets. CurveCurator is not applicable to these. If you wish
to provide custom cross_study_datasets, you have to invoke curve fitting manually using
drevalpy.datasets.curvecurator.fit_curves
Expand Down
12 changes: 6 additions & 6 deletions tests/individual_models/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import pytest

from drevalpy.datasets.dataset import DrugResponseDataset, FeatureDataset
from drevalpy.datasets.loader import load_toy
from drevalpy.datasets.loader import load_toyv1
from drevalpy.models.utils import (
get_multiomics_feature_dataset,
load_cl_ids_from_csv,
Expand All @@ -20,13 +20,13 @@ def sample_dataset() -> tuple[DrugResponseDataset, FeatureDataset, FeatureDatase
:returns: drug_response, cell_line_input, drug_input
"""
path_data = "../data"
drug_response = load_toy(path_data)
drug_response = load_toyv1(path_data)
drug_response.remove_nan_responses()
cell_line_input = get_multiomics_feature_dataset(data_path=path_data, dataset_name="Toy_Data", gene_lists=None)
cell_line_ids = load_cl_ids_from_csv(path=path_data, dataset_name="Toy_Data")
cell_line_input = get_multiomics_feature_dataset(data_path=path_data, dataset_name="TOYv1", gene_lists=None)
cell_line_ids = load_cl_ids_from_csv(path=path_data, dataset_name="TOYv1")
cell_line_input.add_features(cell_line_ids)
# Load the drug features
drug_ids = load_drug_ids_from_csv(data_path=path_data, dataset_name="Toy_Data")
drug_input = load_drug_fingerprint_features(data_path=path_data, dataset_name="Toy_Data")
drug_ids = load_drug_ids_from_csv(data_path=path_data, dataset_name="TOYv1")
drug_input = load_drug_fingerprint_features(data_path=path_data, dataset_name="TOYv1")
drug_input.add_features(drug_ids)
return drug_response, cell_line_input, drug_input
4 changes: 2 additions & 2 deletions tests/individual_models/test_literature_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,8 @@ def test_dipk(
hpam_combi["epochs"] = 1
hpam_combi["epochs_autoencoder"] = 1
model.build_model(hpam_combi)
drug_input = model.load_drug_features(data_path="../data", dataset_name="Toy_Data") # type: ignore
cell_line_input = model.load_cell_line_features(data_path="../data", dataset_name="Toy_Data")
drug_input = model.load_drug_features(data_path="../data", dataset_name="TOYv1") # type: ignore
cell_line_input = model.load_cell_line_features(data_path="../data", dataset_name="TOYv1")

cell_lines_to_keep = cell_line_input.identifiers
drugs_to_keep = drug_input.identifiers
Expand Down
17 changes: 12 additions & 5 deletions tests/test_available_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ def test_factory() -> None:
assert "GDSC1" in AVAILABLE_DATASETS
assert "GDSC2" in AVAILABLE_DATASETS
assert "CCLE" in AVAILABLE_DATASETS
assert "Toy_Data" in AVAILABLE_DATASETS
assert "TOYv1" in AVAILABLE_DATASETS
assert "CTRPv1" in AVAILABLE_DATASETS
assert "CTRPv2" in AVAILABLE_DATASETS
assert len(AVAILABLE_DATASETS) == 6
Expand Down Expand Up @@ -51,8 +51,15 @@ def test_ctrpv2():
assert len(ctrpv2) == 395024


def test_toy_data():
"""Test the Toy_Data dataset."""
def test_toyv1():
"""Test the TOYv1 dataset."""
tempdir = tempfile.TemporaryDirectory()
toy_data = AVAILABLE_DATASETS["Toy_Data"](path_data=tempdir.name)
assert len(toy_data) == 3426
toyv1 = AVAILABLE_DATASETS["TOYv1"](path_data=tempdir.name)
assert len(toyv1) == 2680


def test_toyv2():
"""Test the TOYv2 dataset."""
tempdir = tempfile.TemporaryDirectory()
toyv2 = AVAILABLE_DATASETS["TOYv2"](path_data=tempdir.name)
assert len(toyv2) == 2837
17 changes: 17 additions & 0 deletions tests/test_drp_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,23 @@ def test_load_and_reduce_gene_features(gene_list: Optional[str]) -> None:
assert "The following genes are missing from the dataset GDSC1_small" in str(valerr.value)


def test_order_load_and_reduce_gene_features() -> None:
"""Test the order of the features after loading and reducing gene features. it should be maintained."""
# TODO move to cross study tests where TOYv1 and TOYv2 are available!!!
gene_list = "gene_expression_genes_intersection.csv"
a = load_and_reduce_gene_features("gene_expression", gene_list, "data", "TOYv1")
b = load_and_reduce_gene_features("gene_expression", gene_list, "data", "TOYv2")
# assert the meta info (=gene names) are the same
assert np.all(a.meta_info["gene_expression"] == b.meta_info["gene_expression"])
# assert the shape of the features for a random cell line is actually the same
random_cell_line_a = np.random.choice(a.identifiers)
random_cell_line_b = np.random.choice(b.identifiers)
assert (
a.features[random_cell_line_a]["gene_expression"].shape
== b.features[random_cell_line_b]["gene_expression"].shape
)


def test_iterate_features() -> None:
"""Test the iteration over features."""
df = pd.DataFrame({"GeneA": [1, 2, 3, 2], "GeneB": [4, 5, 6, 2], "GeneC": [7, 8, 9, 2]})
Expand Down
4 changes: 2 additions & 2 deletions tests/test_run_suite.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
[
{
"run_id": "test_run",
"dataset_name": "Toy_Data",
"dataset_name": "TOYv1",
"models": ["NaiveCellLineMeanPredictor"],
"baselines": ["NaiveDrugMeanPredictor"],
"test_mode": ["LPO"],
Expand Down Expand Up @@ -53,7 +53,7 @@ def test_run_suite(args):
evaluation_results_per_drug,
evaluation_results_per_cell_line,
true_vs_pred,
) = parse_results(path_to_results=os.path.join(temp_dir.name, args.run_id), dataset="Toy_Data")
) = parse_results(path_to_results=os.path.join(temp_dir.name, args.run_id), dataset="TOYv1")

(
evaluation_results,
Expand Down
Loading