Skip to content

Conversation

DeepakSilaych
Copy link
Collaborator

  • Introduced cohorts, cohort_size, min_cohort_size, and max_cohort_size parameters in the PCA method.
  • Updated PCA docstring to reflect new parameters.
  • Added example usage for cohort downsampling in the notebook.
  • Implemented tests for cohort downsampling functionality, including validation of parameter combinations.

- Introduced `cohorts`, `cohort_size`, `min_cohort_size`, and `max_cohort_size` parameters in the PCA method.
- Updated PCA docstring to reflect new parameters.
- Added example usage for cohort downsampling in the notebook.
- Implemented tests for cohort downsampling functionality, including validation of parameter combinations.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the PCA method to support cohort-based downsampling and updates related tests and documentation.

  • Add cohorts, cohort_size, min_cohort_size, and max_cohort_size parameters to pca and implement downsampling logic.
  • Update tests to cover cohort downsampling behavior and invalid parameter combinations.
  • Include a notebook example demonstrating PCA with cohort downsampling.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/anoph/test_pca.py Add test_pca_cohort_downsampling to verify downsampling and error cases for new parameters.
notebooks/plot_pca.ipynb Insert a markdown section and code cell showing how to run PCA with cohort downsampling.
malariagen_data/anoph/pca.py Extend pca signature and docstring, implement cohort downsampling, adjust result unpacking, and update _pca signature to accept extra kwargs.
Comments suppressed due to low confidence (3)

malariagen_data/anoph/pca.py:46

  • The docstring notes the new cohorts parameter but does not document the cohort_size and min_cohort_size parameters in the parameter list; please update the method’s argument documentation to include all new parameters.
            .. versionchanged:: 9.0.0

malariagen_data/anoph/pca.py:223

  • [nitpick] Introducing **kwargs into _pca can silently swallow unexpected keyword arguments and make the API less explicit; consider defining explicit parameters or forwarding only intended arguments.
        **kwargs,

tests/anoph/test_pca.py:297

  • The random module is used here but not imported in this file; this will raise a NameError unless import random is added at the top.
    sample_sets = random.sample(all_sample_sets, 2)

@parametrize_with_cases("fixture,api", cases=".")
def test_pca_cohort_downsampling(fixture, api: AnophelesPca):
# Parameters for selecting input data.
all_sample_sets = api.sample_sets()["sample_set"].to_list()
Copy link
Preview

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using random.sample and random.choice introduces non-determinism into the test, which can lead to flaky failures; consider using fixed test inputs or seeding the random module at the start of the test.

Suggested change
all_sample_sets = api.sample_sets()["sample_set"].to_list()
all_sample_sets = api.sample_sets()["sample_set"].to_list()
random.seed(random_seed) # Seed the random module for deterministic sampling

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that these "flaky failures" seem to be a design decision, rather than an oversight, in order to cover a greater range of variables than would be covered by fixed test inputs.

For what it's worth, I do find it a little annoying when these random tests failures occur on PRs that aren't modifying anything related to the code involved in the random test failures. The natural response is often to log the unrelated issue separately and then to try re-running the tests in hope of green light, which doesn't seem ideal.

sample_query_options: Optional[base_params.sample_query_options] = None,
sample_indices: Optional[base_params.sample_indices] = None,
cohorts: Optional[base_params.cohorts] = None,
cohort_size: Optional[base_params.cohort_size] = None,
Copy link
Preview

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The cohort_size and min_cohort_size parameters are accepted but not implemented beyond raising errors; consider either implementing their logic or removing them to avoid confusion.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that Copilot's nitpick above does not appear to be true because cohort_size and min_cohort_size are either used to retrieve the cached results (which are based on certain params, including these) or they are passed to the _pca function, i.e.

        params = dict(
            region=region_prepped,
            n_snps=n_snps,
            thin_offset=thin_offset,
            sample_sets=sample_sets_prepped,
            sample_indices=sample_indices_prepped,
            site_mask=site_mask_prepped,
            site_class=site_class,
            min_minor_ac=min_minor_ac,
            max_missing_an=max_missing_an,
            n_components=n_components,
            cohorts=cohorts,
            cohort_size=cohort_size,
            min_cohort_size=min_cohort_size,
            max_cohort_size=max_cohort_size,
            exclude_samples=exclude_samples,
            fit_exclude_samples=fit_exclude_samples,
            random_seed=random_seed,
        )

        # Try to retrieve results from the cache.
        try:
            results = self.results_cache_get(name=name, params=params)

        except CacheMiss:
            results = self._pca(chunks=chunks, inline_array=inline_array, **params)
            self.results_cache_set(name=name, params=params, results=results)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks @leehart. We decided on Friday that Copilot was less nitpicking and more plain old wrong ;).

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leehart
Copy link
Collaborator

leehart commented Jun 30, 2025

This is the first time I've seen version change notes being added to the docs. Should we try to keep these in the release notes rather than the codebase, to avoid accumulating clutter and allow easy modification? Putting this in might set a precedent and encourage future changes to be noted in this way.

From pca.py in this PR:

            .. versionchanged:: 9.0.0
               The `cohorts` parameter has been added to enable cohort-based
               downsampling via the `max_cohort_size` parameter.

Is this also the first time we've added example code usage to the inline docs?

        examples="""
            Run a PCA, downsampling to a maximum of 20 samples per country::
                >>> import malariagen_data
                >>> ag3 = malariagen_data.Ag3()
                >>> df_pca, evr = ag3.pca(
                ...     region="3R",
                ...     n_snps=1000,
                ...     cohorts="country",
                ...     max_cohort_size=20,
                ... )
        """,

Again, this might set a precedent so might need some higher-level design or policy decision. Unless that's already been discussed?

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right @leehart. I don't think we have discussed it and it is more of a high-level decision. I think we should keep the status quo (until we decide otherwise) where the changes are listed when a new version of the package is released and the example code can be found in the notebooks.

@DeepakSilaych, would you mind removing these comments (at least for now)?

@leehart
Copy link
Collaborator

leehart commented Aug 8, 2025

Looks like this is just waiting for the requested changes and then a re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants