Skip to content

Conversation

@JudithBernett
Copy link
Contributor

@JudithBernett JudithBernett commented Nov 13, 2025

PR Checklist for all PRs

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated. If you've created a new file, add it to the API documentation pages.

Changes

  • Run tests: fail-fast: true to save some GitHub action resources
  • Restructured loader.py to reduce duplicated code
  • Gene lists are now in the meta directory of Zenodo to avoid duplication. Had to adapt the loaders to always download the meta/ directory and adapt load_and_select_gene_features to load the gene lists from meta/.

Bug fixes

New features

  • PDX_Bruna and BeatAML2 dataset added

Maintenance

  • Updated cachecontrol and protobuf
  • Run ‚poetry update‘ to get the latest package versions. This will update the poetry.lock file
  • Run ‚poetry export --without-hashes --without development -f requirements.txt -o requirements.txt‘ to update the requirements.txt file

Copy link
Collaborator

@PascalIversen PascalIversen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I walked through the full set of changes across the nine modified files, and the restructuring you’ve done is a clear improvement both in maintainability and clarity. The introduction of the new _load_zenodo_dataset parent loader is especially welcome. Consolidating the duplicated dataset-loading logic into a single pathway dramatically reduces code redundancy and the risk of dataset-specific drift in future updates. It also makes the semantics of “dataset families” much clearer, since GDSC1, GDSC2, CCLE, CTRPv1, CTRPv2, TOYv1, TOYv2, BeatAML2, and now PDX_Bruna can all flow through a unified mechanism.

The naming is consistent and the docstrings reflect the new architecture accurately. I also appreciate the correction of several previously misaligned parameters — some of those inherited defaults (e.g., file_name vs. dataset_name) were confusing before, and the new form is cleaner and more predictable. The switch to explicitly setting dtypes including both pubchem_id and cell_line_name is a good catch and avoids downstream string-handling inconsistencies.

The PDX_Bruna integration looks very solid. The dataset behaves slightly differently than the others in that its “cell line” dimension corresponds to mouse passages, and your handling of that (including the tissue override and the distinctions in the documentation table) makes the semantics explicit. The shallow mutation & methylation notes are also correctly captured. Having this dataset available is going to be very helpful for anyone doing cross-study comparisons with more clinically proximate data.

The documentation updates are extensive but warranted. The big tables with DRP curve counts, drug counts, and cell-line/patient/passages counts are accurate and much easier to scan. The addition of BeatAML2 and PDX_Bruna to both the high-level table and the detailed subsection is consistent, clearly written, and matches the known sources (BeatAML2 website, Bruna figshare, etc.). Also nice to see the clarification of which omics are shared vs. selectively available across datasets — that’s been a recurring source of confusion for new users.

In models/utils.py, the fix to gene list loading (moving everything into the meta/gene_lists directory) aligns with the new folder structure and with how we already treat other meta-level omics resources. This was a common stumbling point in end-to-end workflows, so the correction is very welcome.

The CI change (switching the workflow to fail-fast: true) is a reasonable choice now that our test suite is more stable, and it will save quite a bit of wasted compute time on GitHub runners. The adjustments to requirements (cachecontrol bump, extras updates, pydantic optional flag) all look consistent with the lockfile diff.

Tests were updated properly:
• factory test now includes PDX_Bruna
• dataset count incremented to 9
• explicit checks for BeatAML2 and PDX_Bruna row counts
• gene list path changes reflected in model tests

The fact that all tests pass under these structural changes is a strong sign that the refactor was done carefully.

Overall, this PR substantially improves the clarity of the dataset-loading pipeline, brings in a valuable new dataset, corrects lingering path inconsistencies, and refreshes the documentation in a way that will immediately help users. Everything is well-structured, internally consistent, and aligns perfectly with the long-term direction of the data layer.

Fantastic work — merging this is an easy decision.

@JudithBernett
Copy link
Contributor Author

pls stop using AI for the review messages 😭

@JudithBernett JudithBernett merged commit 1e6e713 into development Nov 14, 2025
30 checks passed
@JudithBernett JudithBernett deleted the pdx_beataml branch November 14, 2025 10:25
@JudithBernett JudithBernett mentioned this pull request Nov 20, 2025
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants