-
Notifications
You must be signed in to change notification settings - Fork 3
restructured data loading, included PDX_Bruna, updated documentation #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PascalIversen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I walked through the full set of changes across the nine modified files, and the restructuring you’ve done is a clear improvement both in maintainability and clarity. The introduction of the new _load_zenodo_dataset parent loader is especially welcome. Consolidating the duplicated dataset-loading logic into a single pathway dramatically reduces code redundancy and the risk of dataset-specific drift in future updates. It also makes the semantics of “dataset families” much clearer, since GDSC1, GDSC2, CCLE, CTRPv1, CTRPv2, TOYv1, TOYv2, BeatAML2, and now PDX_Bruna can all flow through a unified mechanism.
The naming is consistent and the docstrings reflect the new architecture accurately. I also appreciate the correction of several previously misaligned parameters — some of those inherited defaults (e.g., file_name vs. dataset_name) were confusing before, and the new form is cleaner and more predictable. The switch to explicitly setting dtypes including both pubchem_id and cell_line_name is a good catch and avoids downstream string-handling inconsistencies.
The PDX_Bruna integration looks very solid. The dataset behaves slightly differently than the others in that its “cell line” dimension corresponds to mouse passages, and your handling of that (including the tissue override and the distinctions in the documentation table) makes the semantics explicit. The shallow mutation & methylation notes are also correctly captured. Having this dataset available is going to be very helpful for anyone doing cross-study comparisons with more clinically proximate data.
The documentation updates are extensive but warranted. The big tables with DRP curve counts, drug counts, and cell-line/patient/passages counts are accurate and much easier to scan. The addition of BeatAML2 and PDX_Bruna to both the high-level table and the detailed subsection is consistent, clearly written, and matches the known sources (BeatAML2 website, Bruna figshare, etc.). Also nice to see the clarification of which omics are shared vs. selectively available across datasets — that’s been a recurring source of confusion for new users.
In models/utils.py, the fix to gene list loading (moving everything into the meta/gene_lists directory) aligns with the new folder structure and with how we already treat other meta-level omics resources. This was a common stumbling point in end-to-end workflows, so the correction is very welcome.
The CI change (switching the workflow to fail-fast: true) is a reasonable choice now that our test suite is more stable, and it will save quite a bit of wasted compute time on GitHub runners. The adjustments to requirements (cachecontrol bump, extras updates, pydantic optional flag) all look consistent with the lockfile diff.
Tests were updated properly:
• factory test now includes PDX_Bruna
• dataset count incremented to 9
• explicit checks for BeatAML2 and PDX_Bruna row counts
• gene list path changes reflected in model tests
The fact that all tests pass under these structural changes is a strong sign that the refactor was done carefully.
Overall, this PR substantially improves the clarity of the dataset-loading pipeline, brings in a valuable new dataset, corrects lingering path inconsistencies, and refreshes the documentation in a way that will immediately help users. Everything is well-structured, internally consistent, and aligns perfectly with the long-term direction of the data layer.
Fantastic work — merging this is an easy decision.
|
pls stop using AI for the review messages 😭 |
PR Checklist for all PRs
docsis updated. If you've created a new file, add it to the API documentation pages.Changes
Bug fixes
New features
Maintenance