restructured data loading, included PDX_Bruna, updated documentation #307

JudithBernett · 2025-11-13T13:49:48Z

PR Checklist for all PRs

This comment contains a description of changes (with reason)
Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated. If you've created a new file, add it to the API documentation pages.

Changes

Run tests: fail-fast: true to save some GitHub action resources
Restructured loader.py to reduce duplicated code
Gene lists are now in the meta directory of Zenodo to avoid duplication. Had to adapt the loaders to always download the meta/ directory and adapt load_and_select_gene_features to load the gene lists from meta/.

Bug fixes

Added pydantic dependency until new ray version because of Add pydantic to Ray Tune requirements ray-project/ray#58354: ray.tune automatically imports ray.train which requires pydantic

New features

PDX_Bruna and BeatAML2 dataset added

Maintenance

Updated cachecontrol and protobuf

Run ‚poetry update‘ to get the latest package versions. This will update the poetry.lock file
Run ‚poetry export --without-hashes --without development -f requirements.txt -o requirements.txt‘ to update the requirements.txt file

…opment

PascalIversen

I walked through the full set of changes across the nine modified files, and the restructuring you’ve done is a clear improvement both in maintainability and clarity. The introduction of the new _load_zenodo_dataset parent loader is especially welcome. Consolidating the duplicated dataset-loading logic into a single pathway dramatically reduces code redundancy and the risk of dataset-specific drift in future updates. It also makes the semantics of “dataset families” much clearer, since GDSC1, GDSC2, CCLE, CTRPv1, CTRPv2, TOYv1, TOYv2, BeatAML2, and now PDX_Bruna can all flow through a unified mechanism.

The naming is consistent and the docstrings reflect the new architecture accurately. I also appreciate the correction of several previously misaligned parameters — some of those inherited defaults (e.g., file_name vs. dataset_name) were confusing before, and the new form is cleaner and more predictable. The switch to explicitly setting dtypes including both pubchem_id and cell_line_name is a good catch and avoids downstream string-handling inconsistencies.

The PDX_Bruna integration looks very solid. The dataset behaves slightly differently than the others in that its “cell line” dimension corresponds to mouse passages, and your handling of that (including the tissue override and the distinctions in the documentation table) makes the semantics explicit. The shallow mutation & methylation notes are also correctly captured. Having this dataset available is going to be very helpful for anyone doing cross-study comparisons with more clinically proximate data.

The documentation updates are extensive but warranted. The big tables with DRP curve counts, drug counts, and cell-line/patient/passages counts are accurate and much easier to scan. The addition of BeatAML2 and PDX_Bruna to both the high-level table and the detailed subsection is consistent, clearly written, and matches the known sources (BeatAML2 website, Bruna figshare, etc.). Also nice to see the clarification of which omics are shared vs. selectively available across datasets — that’s been a recurring source of confusion for new users.

In models/utils.py, the fix to gene list loading (moving everything into the meta/gene_lists directory) aligns with the new folder structure and with how we already treat other meta-level omics resources. This was a common stumbling point in end-to-end workflows, so the correction is very welcome.

The CI change (switching the workflow to fail-fast: true) is a reasonable choice now that our test suite is more stable, and it will save quite a bit of wasted compute time on GitHub runners. The adjustments to requirements (cachecontrol bump, extras updates, pydantic optional flag) all look consistent with the lockfile diff.

Tests were updated properly:
• factory test now includes PDX_Bruna
• dataset count incremented to 9
• explicit checks for BeatAML2 and PDX_Bruna row counts
• gene list path changes reflected in model tests

The fact that all tests pass under these structural changes is a strong sign that the refactor was done carefully.

Overall, this PR substantially improves the clarity of the dataset-loading pipeline, brings in a valuable new dataset, corrects lingering path inconsistencies, and refreshes the documentation in a way that will immediately help users. Everything is well-structured, internally consistent, and aligns perfectly with the long-term direction of the data layer.

Fantastic work — merging this is an easy decision.

JudithBernett · 2025-11-14T10:25:33Z

pls stop using AI for the review messages 😭

JudithBernett added 7 commits November 13, 2025 14:38

restructured data loading, included PDX_Bruna, updated documentation

7d23f92

Merge branch 'development' of github.com:daisybio/drevalpy into devel…

2852ae8

…opment

Merge branch 'development' into pdx_beataml

d24bb62

adapt: gene lists are now located in meta

995708f

fix: added tissue, comment raytune out for the moment

14096d0

added pydantic because of ray-project/ray#58354 until new ray version

47befe3

forfot to add pydantic to the project dependencies

edb124f

JudithBernett requested a review from PascalIversen November 14, 2025 10:12

PascalIversen approved these changes Nov 14, 2025

View reviewed changes

JudithBernett merged commit 1e6e713 into development Nov 14, 2025
30 checks passed

JudithBernett deleted the pdx_beataml branch November 14, 2025 10:25

JudithBernett mentioned this pull request Nov 20, 2025

v1.4.0 #264

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

restructured data loading, included PDX_Bruna, updated documentation #307

restructured data loading, included PDX_Bruna, updated documentation #307

Uh oh!

JudithBernett commented Nov 13, 2025 •

edited

Loading

Uh oh!

PascalIversen left a comment

Uh oh!

JudithBernett commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

restructured data loading, included PDX_Bruna, updated documentation #307

restructured data loading, included PDX_Bruna, updated documentation #307

Uh oh!

Conversation

JudithBernett commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist for all PRs

Changes

Bug fixes

New features

Maintenance

Uh oh!

PascalIversen left a comment

Choose a reason for hiding this comment

Uh oh!

JudithBernett commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JudithBernett commented Nov 13, 2025 •

edited

Loading