Add cohort downsampling support to PCA and update tests #803

DeepakSilaych · 2025-06-21T09:47:23Z

Introduced cohorts, cohort_size, min_cohort_size, and max_cohort_size parameters in the PCA method.
Updated PCA docstring to reflect new parameters.
Added example usage for cohort downsampling in the notebook.
Implemented tests for cohort downsampling functionality, including validation of parameter combinations.

- Introduced `cohorts`, `cohort_size`, `min_cohort_size`, and `max_cohort_size` parameters in the PCA method. - Updated PCA docstring to reflect new parameters. - Added example usage for cohort downsampling in the notebook. - Implemented tests for cohort downsampling functionality, including validation of parameter combinations.

review-notebook-app · 2025-06-21T09:47:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Copilot

Pull Request Overview

This PR extends the PCA method to support cohort-based downsampling and updates related tests and documentation.

Add cohorts, cohort_size, min_cohort_size, and max_cohort_size parameters to pca and implement downsampling logic.
Update tests to cover cohort downsampling behavior and invalid parameter combinations.
Include a notebook example demonstrating PCA with cohort downsampling.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
tests/anoph/test_pca.py	Add `test_pca_cohort_downsampling` to verify downsampling and error cases for new parameters.
notebooks/plot_pca.ipynb	Insert a markdown section and code cell showing how to run PCA with cohort downsampling.
malariagen_data/anoph/pca.py	Extend `pca` signature and docstring, implement cohort downsampling, adjust result unpacking, and update `_pca` signature to accept extra kwargs.

Comments suppressed due to low confidence (3)

malariagen_data/anoph/pca.py:46

The docstring notes the new cohorts parameter but does not document the cohort_size and min_cohort_size parameters in the parameter list; please update the method’s argument documentation to include all new parameters.

            .. versionchanged:: 9.0.0

malariagen_data/anoph/pca.py:223

[nitpick] Introducing **kwargs into _pca can silently swallow unexpected keyword arguments and make the API less explicit; consider defining explicit parameters or forwarding only intended arguments.

        **kwargs,

tests/anoph/test_pca.py:297

The random module is used here but not imported in this file; this will raise a NameError unless import random is added at the top.

    sample_sets = random.sample(all_sample_sets, 2)

Copilot · 2025-06-21T11:26:53Z

tests/anoph/test_pca.py

+@parametrize_with_cases("fixture,api", cases=".")
+def test_pca_cohort_downsampling(fixture, api: AnophelesPca):
+    # Parameters for selecting input data.
+    all_sample_sets = api.sample_sets()["sample_set"].to_list()


[nitpick] Using random.sample and random.choice introduces non-determinism into the test, which can lead to flaky failures; consider using fixed test inputs or seeding the random module at the start of the test.

Suggested change

all_sample_sets = api.sample_sets()["sample_set"].to_list()

all_sample_sets = api.sample_sets()["sample_set"].to_list()

random.seed(random_seed) # Seed the random module for deterministic sampling

Just to note that these "flaky failures" seem to be a design decision, rather than an oversight, in order to cover a greater range of variables than would be covered by fixed test inputs.

For what it's worth, I do find it a little annoying when these random tests failures occur on PRs that aren't modifying anything related to the code involved in the random test failures. The natural response is often to log the unrelated issue separately and then to try re-running the tests in hope of green light, which doesn't seem ideal.

Copilot · 2025-06-21T11:26:53Z

malariagen_data/anoph/pca.py

        sample_query_options: Optional[base_params.sample_query_options] = None,
        sample_indices: Optional[base_params.sample_indices] = None,
+        cohorts: Optional[base_params.cohorts] = None,
+        cohort_size: Optional[base_params.cohort_size] = None,


[nitpick] The cohort_size and min_cohort_size parameters are accepted but not implemented beyond raising errors; consider either implementing their logic or removing them to avoid confusion.

Just to note that Copilot's nitpick above does not appear to be true because cohort_size and min_cohort_size are either used to retrieve the cached results (which are based on certain params, including these) or they are passed to the _pca function, i.e.

params = dict( region=region_prepped, n_snps=n_snps, thin_offset=thin_offset, sample_sets=sample_sets_prepped, sample_indices=sample_indices_prepped, site_mask=site_mask_prepped, site_class=site_class, min_minor_ac=min_minor_ac, max_missing_an=max_missing_an, n_components=n_components, cohorts=cohorts, cohort_size=cohort_size, min_cohort_size=min_cohort_size, max_cohort_size=max_cohort_size, exclude_samples=exclude_samples, fit_exclude_samples=fit_exclude_samples, random_seed=random_seed, ) # Try to retrieve results from the cache. try: results = self.results_cache_get(name=name, params=params) except CacheMiss: results = self._pca(chunks=chunks, inline_array=inline_array, **params) self.results_cache_set(name=name, params=params, results=results)

Yes, thanks @leehart. We decided on Friday that Copilot was less nitpicking and more plain old wrong ;).

jonbrenas

LGTM

leehart · 2025-06-30T09:44:38Z

This is the first time I've seen version change notes being added to the docs. Should we try to keep these in the release notes rather than the codebase, to avoid accumulating clutter and allow easy modification? Putting this in might set a precedent and encourage future changes to be noted in this way.

From pca.py in this PR:

            .. versionchanged:: 9.0.0
               The `cohorts` parameter has been added to enable cohort-based
               downsampling via the `max_cohort_size` parameter.

Is this also the first time we've added example code usage to the inline docs?

        examples="""
            Run a PCA, downsampling to a maximum of 20 samples per country::
                >>> import malariagen_data
                >>> ag3 = malariagen_data.Ag3()
                >>> df_pca, evr = ag3.pca(
                ...     region="3R",
                ...     n_snps=1000,
                ...     cohorts="country",
                ...     max_cohort_size=20,
                ... )
        """,

Again, this might set a precedent so might need some higher-level design or policy decision. Unless that's already been discussed?

jonbrenas

You are right @leehart. I don't think we have discussed it and it is more of a high-level decision. I think we should keep the status quo (until we decide otherwise) where the changes are listed when a new version of the package is released and the example code can be found in the notebooks.

@DeepakSilaych, would you mind removing these comments (at least for now)?

leehart · 2025-08-08T08:35:13Z

Looks like this is just waiting for the requested changes and then a re-review.

Improve error message formatting for cohort downsampling validation

8228261

DeepakSilaych requested review from Copilot, ahernank and jonbrenas June 21, 2025 11:25

Copilot AI reviewed Jun 21, 2025

View reviewed changes

jonbrenas approved these changes Jun 27, 2025

View reviewed changes

jonbrenas requested changes Jun 30, 2025

View reviewed changes

Merge branch 'master' into pca

ef22fbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cohort downsampling support to PCA and update tests #803

Add cohort downsampling support to PCA and update tests #803

DeepakSilaych commented Jun 21, 2025

Uh oh!

review-notebook-app bot commented Jun 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 21, 2025

Uh oh!

leehart Jun 30, 2025

Uh oh!

Copilot AI Jun 21, 2025

Uh oh!

leehart Jun 30, 2025

Uh oh!

jonbrenas Jun 30, 2025

Uh oh!

jonbrenas left a comment

Uh oh!

leehart commented Jun 30, 2025

Uh oh!

jonbrenas left a comment

Uh oh!

leehart commented Aug 8, 2025

Uh oh!

Uh oh!

	all_sample_sets = api.sample_sets()["sample_set"].to_list()
	all_sample_sets = api.sample_sets()["sample_set"].to_list()
	random.seed(random_seed) # Seed the random module for deterministic sampling

Add cohort downsampling support to PCA and update tests #803

Are you sure you want to change the base?

Add cohort downsampling support to PCA and update tests #803

Conversation

DeepakSilaych commented Jun 21, 2025

Uh oh!

review-notebook-app bot commented Jun 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

leehart Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

leehart Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

jonbrenas Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

leehart commented Jun 30, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

leehart commented Aug 8, 2025

Uh oh!

Uh oh!