-
Notifications
You must be signed in to change notification settings - Fork 39
Add cohort downsampling support to PCA and update tests #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -288,3 +288,82 @@ def test_pca_fit_exclude_samples(fixture, api: AnophelesPca): | |||||||
len(pca_df.query(f"sample_id in {exclude_samples} and not pca_fit")) | ||||||||
== n_samples_excluded | ||||||||
) | ||||||||
|
||||||||
|
||||||||
@parametrize_with_cases("fixture,api", cases=".") | ||||||||
def test_pca_cohort_downsampling(fixture, api: AnophelesPca): | ||||||||
# Parameters for selecting input data. | ||||||||
all_sample_sets = api.sample_sets()["sample_set"].to_list() | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nitpick] Using
Suggested change
Copilot uses AI. Check for mistakes. Positive FeedbackNegative Feedback There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to note that these "flaky failures" seem to be a design decision, rather than an oversight, in order to cover a greater range of variables than would be covered by fixed test inputs. For what it's worth, I do find it a little annoying when these random tests failures occur on PRs that aren't modifying anything related to the code involved in the random test failures. The natural response is often to log the unrelated issue separately and then to try re-running the tests in hope of green light, which doesn't seem ideal. |
||||||||
sample_sets = random.sample(all_sample_sets, 2) | ||||||||
data_params = dict( | ||||||||
region=random.choice(api.contigs), | ||||||||
sample_sets=sample_sets, | ||||||||
site_mask=random.choice((None,) + api.site_mask_ids), | ||||||||
) | ||||||||
|
||||||||
# Test cohort downsampling. | ||||||||
cohort_col = "country" | ||||||||
max_cohort_size = 10 | ||||||||
random_seed = 42 | ||||||||
|
||||||||
# Try to run the PCA with cohort downsampling. | ||||||||
try: | ||||||||
pca_df, pca_evr = api.pca( | ||||||||
n_snps=100, # Use a small number to avoid "Not enough SNPs" errors | ||||||||
n_components=2, | ||||||||
cohorts=cohort_col, | ||||||||
max_cohort_size=max_cohort_size, | ||||||||
random_seed=random_seed, | ||||||||
**data_params, | ||||||||
) | ||||||||
except ValueError as e: | ||||||||
if "Not enough SNPs" in str(e): | ||||||||
pytest.skip("Not enough SNPs available after downsampling to run test.") | ||||||||
else: | ||||||||
raise | ||||||||
|
||||||||
# Check types. | ||||||||
assert isinstance(pca_df, pd.DataFrame) | ||||||||
assert isinstance(pca_evr, np.ndarray) | ||||||||
|
||||||||
# Check basic structure. | ||||||||
assert len(pca_df) > 0 | ||||||||
assert "PC1" in pca_df.columns | ||||||||
assert "PC2" in pca_df.columns | ||||||||
assert "pca_fit" in pca_df.columns | ||||||||
assert pca_df["pca_fit"].all() | ||||||||
assert pca_evr.ndim == 1 | ||||||||
assert pca_evr.shape[0] == 2 | ||||||||
|
||||||||
# Check cohort counts. | ||||||||
final_cohort_counts = pca_df[cohort_col].value_counts() | ||||||||
for cohort, count in final_cohort_counts.items(): | ||||||||
assert count <= max_cohort_size | ||||||||
|
||||||||
# Test bad parameter combinations. | ||||||||
with pytest.raises(ValueError): | ||||||||
api.pca( | ||||||||
n_snps=100, | ||||||||
n_components=2, | ||||||||
cohorts=cohort_col, | ||||||||
# max_cohort_size is missing | ||||||||
**data_params, | ||||||||
) | ||||||||
with pytest.raises(ValueError): | ||||||||
api.pca( | ||||||||
n_snps=100, | ||||||||
n_components=2, | ||||||||
cohorts=cohort_col, | ||||||||
max_cohort_size=max_cohort_size, | ||||||||
sample_indices=[0, 1, 2], | ||||||||
**data_params, | ||||||||
) | ||||||||
with pytest.raises(ValueError): | ||||||||
api.pca( | ||||||||
n_snps=100, | ||||||||
n_components=2, | ||||||||
cohorts=cohort_col, | ||||||||
max_cohort_size=max_cohort_size, | ||||||||
cohort_size=10, | ||||||||
**data_params, | ||||||||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The
cohort_size
andmin_cohort_size
parameters are accepted but not implemented beyond raising errors; consider either implementing their logic or removing them to avoid confusion.Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that Copilot's nitpick above does not appear to be true because
cohort_size
andmin_cohort_size
are either used to retrieve the cached results (which are based on certain params, including these) or they are passed to the_pca
function, i.e.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks @leehart. We decided on Friday that Copilot was less nitpicking and more plain old wrong ;).