-
Notifications
You must be signed in to change notification settings - Fork 39
Add cohort downsampling support to PCA and update tests #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Introduced `cohorts`, `cohort_size`, `min_cohort_size`, and `max_cohort_size` parameters in the PCA method. - Updated PCA docstring to reflect new parameters. - Added example usage for cohort downsampling in the notebook. - Implemented tests for cohort downsampling functionality, including validation of parameter combinations.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the PCA method to support cohort-based downsampling and updates related tests and documentation.
- Add
cohorts
,cohort_size
,min_cohort_size
, andmax_cohort_size
parameters topca
and implement downsampling logic. - Update tests to cover cohort downsampling behavior and invalid parameter combinations.
- Include a notebook example demonstrating PCA with cohort downsampling.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tests/anoph/test_pca.py | Add test_pca_cohort_downsampling to verify downsampling and error cases for new parameters. |
notebooks/plot_pca.ipynb | Insert a markdown section and code cell showing how to run PCA with cohort downsampling. |
malariagen_data/anoph/pca.py | Extend pca signature and docstring, implement cohort downsampling, adjust result unpacking, and update _pca signature to accept extra kwargs. |
Comments suppressed due to low confidence (3)
malariagen_data/anoph/pca.py:46
- The docstring notes the new
cohorts
parameter but does not document thecohort_size
andmin_cohort_size
parameters in the parameter list; please update the method’s argument documentation to include all new parameters.
.. versionchanged:: 9.0.0
malariagen_data/anoph/pca.py:223
- [nitpick] Introducing
**kwargs
into_pca
can silently swallow unexpected keyword arguments and make the API less explicit; consider defining explicit parameters or forwarding only intended arguments.
**kwargs,
tests/anoph/test_pca.py:297
- The
random
module is used here but not imported in this file; this will raise a NameError unlessimport random
is added at the top.
sample_sets = random.sample(all_sample_sets, 2)
@parametrize_with_cases("fixture,api", cases=".") | ||
def test_pca_cohort_downsampling(fixture, api: AnophelesPca): | ||
# Parameters for selecting input data. | ||
all_sample_sets = api.sample_sets()["sample_set"].to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using random.sample
and random.choice
introduces non-determinism into the test, which can lead to flaky failures; consider using fixed test inputs or seeding the random module at the start of the test.
all_sample_sets = api.sample_sets()["sample_set"].to_list() | |
all_sample_sets = api.sample_sets()["sample_set"].to_list() | |
random.seed(random_seed) # Seed the random module for deterministic sampling |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that these "flaky failures" seem to be a design decision, rather than an oversight, in order to cover a greater range of variables than would be covered by fixed test inputs.
For what it's worth, I do find it a little annoying when these random tests failures occur on PRs that aren't modifying anything related to the code involved in the random test failures. The natural response is often to log the unrelated issue separately and then to try re-running the tests in hope of green light, which doesn't seem ideal.
sample_query_options: Optional[base_params.sample_query_options] = None, | ||
sample_indices: Optional[base_params.sample_indices] = None, | ||
cohorts: Optional[base_params.cohorts] = None, | ||
cohort_size: Optional[base_params.cohort_size] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The cohort_size
and min_cohort_size
parameters are accepted but not implemented beyond raising errors; consider either implementing their logic or removing them to avoid confusion.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that Copilot's nitpick above does not appear to be true because cohort_size
and min_cohort_size
are either used to retrieve the cached results (which are based on certain params, including these) or they are passed to the _pca
function, i.e.
params = dict(
region=region_prepped,
n_snps=n_snps,
thin_offset=thin_offset,
sample_sets=sample_sets_prepped,
sample_indices=sample_indices_prepped,
site_mask=site_mask_prepped,
site_class=site_class,
min_minor_ac=min_minor_ac,
max_missing_an=max_missing_an,
n_components=n_components,
cohorts=cohorts,
cohort_size=cohort_size,
min_cohort_size=min_cohort_size,
max_cohort_size=max_cohort_size,
exclude_samples=exclude_samples,
fit_exclude_samples=fit_exclude_samples,
random_seed=random_seed,
)
# Try to retrieve results from the cache.
try:
results = self.results_cache_get(name=name, params=params)
except CacheMiss:
results = self._pca(chunks=chunks, inline_array=inline_array, **params)
self.results_cache_set(name=name, params=params, results=results)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks @leehart. We decided on Friday that Copilot was less nitpicking and more plain old wrong ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is the first time I've seen version change notes being added to the docs. Should we try to keep these in the release notes rather than the codebase, to avoid accumulating clutter and allow easy modification? Putting this in might set a precedent and encourage future changes to be noted in this way. From
Is this also the first time we've added example code usage to the inline docs?
Again, this might set a precedent so might need some higher-level design or policy decision. Unless that's already been discussed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @leehart. I don't think we have discussed it and it is more of a high-level decision. I think we should keep the status quo (until we decide otherwise) where the changes are listed when a new version of the package is released and the example code can be found in the notebooks.
@DeepakSilaych, would you mind removing these comments (at least for now)?
Looks like this is just waiting for the requested changes and then a re-review. |
cohorts
,cohort_size
,min_cohort_size
, andmax_cohort_size
parameters in the PCA method.