Skip to content

Conversation

leehart
Copy link
Collaborator

@leehart leehart commented Feb 7, 2025

Re: issue #716

@leehart leehart changed the title Support unrestricted_use_only and surveillance_use_only Support unrestricted_use_only and surveillance_use_only constructor params Feb 14, 2025
@leehart leehart marked this pull request as ready for review February 20, 2025 15:36
@leehart
Copy link
Collaborator Author

leehart commented Feb 20, 2025

Opening this for a WIP review, to potentially avoid going too far down the wrong path.

@alimanfoo @cclarkson @ahernank @jonbrenas I'm trying to identify other functions that we need to filter results for, according to:

  • unrestricted_use_only, which I've so far applied to sample_sets(), and
  • surveillance_use_only. which I've so far applied to sample_metadata().

Those seem like the main ones, which some other functions use, but I want to make sure we plug all potential leaks.

@leehart
Copy link
Collaborator Author

leehart commented Feb 25, 2025

  • To do: add _prep_sample_query_param()

@leehart leehart marked this pull request as draft March 18, 2025 09:23
@leehart
Copy link
Collaborator Author

leehart commented Mar 21, 2025

  • To do: check that all public functions honour the unrestricted_use_only and surveillance_use_only constructor params.

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

It looks like Ag3 currently has about 137 public methods... 🤔

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

It also looks like about 119 of Ag3's public methods cannot be called without specifying params (cannot rely on defaults), which makes the testing of these constructor params somewhat difficult to automate. This also smells a lot like a "god object" anti-pattern https://en.wikipedia.org/wiki/God_object

We should probably consider re-organising all of those methods, despite the inconvenience, but that will probably have to wait. In the meantime, I hope to be able to figure out which functions are vulnerable to leaking unfiltered data, relating to either the surveillance-only or unrestricted-use-only flags.

@leehart
Copy link
Collaborator Author

leehart commented Apr 8, 2025

@ahernank @jonbrenas Checkmarks indicate whether the function gets its data from an upstream public function, or file, or param, or otherwise looks covered. Unchecked functions indicate some doubt and require further investigation, discussion or coding.

  • aa_allele_frequencies - gets its data from snp_allele_frequencies
  • aa_allele_frequencies_advanced - gets its data from snp_allele_frequencies_advanced
  • add_extra_metadata - checks data from sample_metadata
  • aim_calls - uses _prep_sample_query_param and sample_metadata
  • aim_metadata - gets its data from samples.species_aim.csv
  • aim_variants - gets its data from aim_defs_{analysis}/{aims}.zarr
  • average_fst - uses snp_allele_counts
  • biallelic_diplotype_pairwise_distances - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • biallelic_diplotypes - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • biallelic_snp_calls - uses snp_allele_counts and snp_calls
  • biallelic_snps_to_plink - uses biallelic_snp_calls
  • clear_extra_metadata - is simply self._extra_metadata = []
  • cnv_coverage_calls - TO INVESTIGATE: uses _cnv_coverage_calls_dataset
  • cnv_discordant_read_calls - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_discordant_read_calls_dataset
  • cnv_hmm - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_hmm_dataset
  • cohort_diversity_stats - uses sample_metadata and snp_allele_counts
  • cohorts - gets its data from cohorts_{cohort_set}.csv
  • cohorts_metadata - gets its data from samples.cohorts.csv
  • count_samples - uses sample_metadata
  • cross_metadata - gets its data from crosses.fam
  • diplotype_pairwise_distances - uses _prep_sample_query_param
  • diversity_stats - TO INVESTIGATE: uses _setup_cohort_queries and cohort_diversity_stats not _prep_sample_query_param
  • fst_gwss - uses _prep_sample_query_param
  • g123_calibration - uses _prep_sample_query_param
  • g123_gwss - uses _prep_sample_query_param
  • gene_cnv - TO INVESTIGATE: uses _gene_cnv not _prep_sample_query_param
  • gene_cnv_frequencies - TO INVESTIGATE: uses _gene_cnv_frequencies not _prep_sample_query_param
  • gene_cnv_frequencies_advanced - TO INVESTIGATE: uses _gene_cnv_frequencies_advanced not _prep_sample_query_param
  • general_metadata - TO CONSIDER: gets its data from samples.meta.csv (NOT FILTERED)
  • geneset - uses genome_features
  • genome_feature_children - TO INVESTIGATE: uses _genome_features
  • genome_features - TO INVESTIGATE: uses _genome_features_for_contig and _genome_features
  • genome_sequence - TO INVESTIGATE: uses _genome_sequence_for_contig
  • h12_calibration - uses _prep_sample_query_param
  • h12_gwss - uses _prep_sample_query_param
  • h1x_gwss - uses _prep_sample_query_param
  • haplotype_pairwise_distances - uses _prep_sample_query_param
  • haplotype_sites - uses _haplotype_sites_for_region
  • haplotypes - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _haplotypes_for_contig
  • haplotypes_frequencies - uses sample_metadata and locate_cohorts and haplotypes and haplotype_frequencies
  • haplotypes_frequencies_advanced - uses sample_metadata and haplotypes and haplotype_frequencies
  • igv - uses _igv_config and igv_notebook (LOOKS COVERED)
  • ihs_gwss - TO INVESTIGATE: uses _prep_sample_query_param but also _ihs_gwss
  • is_accessible - uses genome_sequence and snp_sites and _site_filters_for_region
  • karyotype - TO INVESTIGATE: uses load_inversion_tags and snp_calls but not _prep_sample_query_param
  • load_inversion_tags - gets its data from self._inversion_tag_path
  • lookup_release - uses sample_sets
  • lookup_sample - uses sample_metadata
  • lookup_study - uses sample_sets
  • lookup_study_info - uses sample_sets
  • lookup_terms_of_use_info - uses sample_sets
  • njt - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • open_cnv_coverage_calls - gets its data from coverage_calls/{analysis}/zarr
  • open_cnv_discordant_read_calls - gets its data from cnv/{sample_set}/{calls_version}/zarr
  • open_cnv_hmm - gets its data from cnv/{sample_set}/hmm/zarr
  • open_file - simply self._fs.open(f"{self._base_path}/{path}")
  • open_genome - gets its data from {self._base_path}/{self._genome_zarr_path}
  • open_haplotype_sites - gets its data from snp_haplotypes/sites/{analysis}/zarr
  • open_haplotypes - gets its data from snp_haplotypes/{sample_set}/{analysis}/zarr
  • open_site_annotations - gets its data from {self._base_path}/{self._site_annotations_zarr_path}
  • open_site_filters - gets its data from site_filters/{self._site_filters_analysis}/{mask_prepped}/
  • open_snp_genotypes - gets its data from snp_genotypes/all/{sample_set}/
  • open_snp_sites - gets its data from snp_genotypes/all/sites/
  • pairwise_average_fst - TO INVESTIGATE: uses _setup_cohort_queries not _prep_sample_query_param
  • pca - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • plot_aim_heatmap - uses aim_calls and go_make_subplots
  • plot_cnv_hmm_coverage - uses plot_cnv_hmm_coverage_track and plot_genes
  • plot_cnv_hmm_coverage_track - uses cnv_hmm
  • plot_cnv_hmm_heatmap - uses plot_cnv_hmm_heatmap_track and plot_genes
  • plot_cnv_hmm_heatmap_track - uses cnv_hmm
  • plot_diplotype_clustering - uses sample_metadata and diplotype_pairwise_distances
  • plot_diplotype_clustering_advanced - TO INVESTIGATE: uses plot_diplotype_clustering but also _dipclust_het_bar_trace and _dipclust_cnv_bar_trace and _insert_dipclust_snp_trace and _dipclust_concat_subplots
  • plot_diversity_stats - gets its data from df_stats param
  • plot_frequencies_heatmap - gets its data from df param
  • plot_frequencies_interactive_map - gets its data from ds param
  • plot_frequencies_map_markers - gets its data from ds param
  • plot_frequencies_time_series - gets its data from ds param
  • plot_fst_gwss - uses plot_fst_gwss_track and plot_genes
  • plot_fst_gwss_track - uses fst_gwss
  • plot_g123_calibration - uses g123_calibration
  • plot_g123_gwss - uses plot_g123_gwss_track and plot_genes
  • plot_g123_gwss_track - uses g123_gwss
  • plot_genes - TO INVESTIGATE: uses genome_sequence and _plot_genes_setup_data
  • plot_h12_calibration - uses h12_calibration
  • plot_h12_gwss - uses plot_h12_gwss_track and plot_genes
  • plot_h12_gwss_multi_overlay - uses plot_h12_gwss_multi_overlay_track and plot_genes
  • plot_h12_gwss_multi_overlay_track - TO INVESTIGATE: uses h12_gwss and _setup_cohort_queries not _prep_sample_query_param
  • plot_h12_gwss_multi_panel - TO INVESTIGATE: uses plot_h12_gwss_track and plot_genes and _setup_cohort_queries not _prep_sample_query_param
  • plot_h12_gwss_track - uses h12_gwss
  • plot_h1x_gwss - uses plot_h1x_gwss_track and plot_genes
  • plot_h1x_gwss_track - uses h1x_gwss
  • plot_haplotype_clustering - TO INVESTIGATE: uses sample_metadata and haplotype_pairwise_distances but also _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly and plot_dendrogram
  • plot_haplotype_network - uses haplotypes and sample_metadata and median_joining_network and mjn_graph and plotly_discrete_legend (I expect _setup_sample_colors_plotly is OK)
  • plot_heterozygosity - uses plot_heterozygosity_track and plot_genes
  • plot_heterozygosity_track - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track
  • plot_ihs_gwss - uses plot_ihs_gwss_track and plot_genes
  • plot_ihs_gwss_track - uses ihs_gwss
  • plot_njt - uses njt and sample_metadata (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pairwise_average_fst - gets its data from fst_df param
  • plot_pca_coords - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pca_coords_3d - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
  • plot_pca_variance - gets its data from evr param (simple bar plot)
  • plot_roh - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track and _roh_hmm_predict and plot_roh_track and plot_genes
  • plot_roh_track - gets its data from df_roh param
  • plot_sample_location_geo - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_sample_location_mapbox - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_samples_bar - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_samples_interactive_map - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
  • plot_snps - uses plot_snps_track and plot_genes
  • plot_snps_track - uses snp_allele_counts and snp_variants and genome_sequence
  • plot_transcript - uses genome_features and genome_feature_children
  • plot_xpehh_gwss - uses plot_xpehh_gwss_track and plot_genes
  • plot_xpehh_gwss_track - uses xpehh_gwss
  • read_files - gets its data from its paths param
  • results_cache_get - gets its data from results.zarr.zip or results.npz
  • results_cache_set - gets its data from its params and results params and writes params.json and results.zarr.zip
  • roh_hmm - TO INVESTIGATE: uses _sample_count_het and _roh_hmm_predict
  • sample_metadata - uses _prep_sample_query_param and general_metadata and sequence_qc_metadata and surveillance_flags and aim_metadata and cohorts_metadata
  • sample_sets - uses _read_sample_sets and _unrestricted_use_only
  • sequence_qc_metadata - gets its data from sequence_qc_stats.csv
  • site_annotations - uses _site_annotations_raw and snp_sites
  • site_filters - uses _site_filters_for_region
  • snp_allele_counts - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
  • snp_allele_frequencies - uses sample_metadata and snp_calls
  • snp_allele_frequencies_advanced - uses sample_metadata and snp_calls
  • snp_calls - uses _prep_sample_query_param and _snp_calls
  • snp_dataset - uses snp_calls
  • snp_effects - uses snp_variants and _snp_df_melt and _snp_effect_annotator
  • snp_genotype_allele_counts - uses snp_calls
  • snp_genotypes - uses _prep_sample_query_param and _snp_genotypes_for_contig
  • snp_sites - uses _snp_sites_for_region and site_filters
  • snp_variants - uses _snp_variants_for_contig
  • surveillance_flags - gets its data from surveillance.flags.csv
  • view_alignments - uses _igv_view_alignments_tracks and igv
  • wgs_data_catalog - TO CONSIDER: gets its data from wgs_snp_data.csv (NOT FILTERED)
  • wgs_run_accessions - TO CONSIDER: gets its data from wgs_accession_data.csv (NOT FILTERED)
  • xpehh_gwss - uses _prep_sample_query_param and _xpehh_gwss

@leehart leehart changed the title Support unrestricted_use_only and surveillance_use_only constructor params Add unrestricted_use_only and surveillance_use_only constructor params Apr 24, 2025
@leehart
Copy link
Collaborator Author

leehart commented Apr 24, 2025

Now investigating unexpected test failures after merging with master branch.

Local pytest:

446 passed, 38 warnings, 224 errors

CI pytest:

474 passed, 36 warnings, 224 errors

@jonbrenas
Copy link
Collaborator

Thanks @leehart. For IGV to work, it needed to have access to a 'public' version of the reference genomes, e.g., at gs://vo_anoph_temp_us_central1/vo_afun_release/reference/genome/idAnoFuneDA-416_04. This, in turn, needed to be passed to the AnophelesBase constructor. The tests use simulated data so the link needs to be specified (but None should work).

@leehart
Copy link
Collaborator Author

leehart commented Jun 3, 2025

At this point, I reckon we only need to talk about and resolve:

  1. the cohort_size returned by the cohorts function, which comes directly from cohorts_{cohort_set}.csv; and
  2. the sample_indices parameter, which can be used as a mutually exclusive alternative to the sample_query parameter in about 11 functions, which is presenting a problem.

@leehart leehart marked this pull request as ready for review June 5, 2025 08:10
@leehart
Copy link
Collaborator Author

leehart commented Jun 5, 2025

Submitting this for review while we figure out sample_indices.

@leehart
Copy link
Collaborator Author

leehart commented Jun 5, 2025

  • Testing this on Colab is throwing out a bunch of these ugly warnings, which would be good to resolve:

/usr/local/lib/python3.11/dist-packages/malariagen_data/anoph/sample_metadata.py:780: RuntimeWarning: Engine has switched to 'python' because numexpr does not support extension array dtypes. Please set your engine to python manually.
df_samples = df_samples.query(prepared_sample_query, **sample_query_options)

anoph/sample_metadata.py:780 currently (in this PR) has:

        # Apply the sample_query or sample_indices, if specified.
        if prepared_sample_query is not None:
            # Assume a pandas query string.
            sample_query_options = sample_query_options or {}
            df_samples = df_samples.query(prepared_sample_query, **sample_query_options)
            df_samples = df_samples.reset_index(drop=True)

anoph/base.py:951 currently (in this PR) has:

        # Determine which samples match the sample query.
        if sample_query != "":
            loc_samples = df_samples.eval(sample_query, **sample_query_options)

anoph/sample_metadata.py:1071 currently (in this PR) has:

        if prepared_sample_query is not None:
            # Resolve query to a list of integers for more cache hits - we
            # do this because there are different ways to write the same pandas
            # query, and so it's better to evaluate the query and use a list of
            # integer indices instead.
            df_samples = self.sample_metadata(sample_sets=prepared_sample_sets)
            sample_query_options = sample_query_options or {}
            loc_samples = df_samples.eval(
                prepared_sample_query, **sample_query_options
            ).values
            sample_indices = np.nonzero(loc_samples)[0].tolist()

Basically, it seems to be around using Pandas query or eval in a way that's incompatible with the more efficient numexpr engine. I gather it might be because the DataFrame in question is using exotic data types, for some reason, which would want to avoid if we want to use the numexpr engine, rather than specifying the python engine just to quash this warning. We'll need to track down wherever these "extension array dtypes" are being used and see how they might be avoided, for the sake of performance. If they're unavoidable, then I guess we might be forced to use the less efficient python engine.

@leehart
Copy link
Collaborator Author

leehart commented Jun 5, 2025

Since the numexpr engine doesn't support dtypes like Int64, Float64 and even boolean, (judging from test_extension_array_eval in https://github.com/pandas-dev/pandas/blob/main/pandas/tests/frame/test_query_eval.py ) which we currently use in the sample_metadata DataFrame in order to support missing values, e.g.:

mean_cov                           Float64
median_cov                           Int64
modal_cov                            Int64
mean_cov_2L                        Float64
median_cov_2L                        Int64
mode_cov_2L                          Int64
mean_cov_2R                        Float64
median_cov_2R                        Int64
mode_cov_2R                          Int64
mean_cov_3L                        Float64
median_cov_3L                        Int64
mode_cov_3L                          Int64
mean_cov_3R                        Float64
median_cov_3R                        Int64
mode_cov_3R                          Int64
mean_cov_X                         Float64
median_cov_X                         Int64
mode_cov_X                           Int64
frac_gen_cov                       Float64
divergence                         Float64
contam_pct                         Float64
contam_LLR                         Float64

I'm tempted to force the use of the less efficient python engine, despite this not being recommended by the Pandas docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

You can change the semantics of the expression by passing the keyword argument parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.

@leehart
Copy link
Collaborator Author

leehart commented Jun 10, 2025

Hi @jonbrenas @ahernank - I've tried to elucidate the problem with sample_indices (for this PR) below. I'm looking into resolving it now, so I'll put this PR back into draft mode.

What is the sample_indices parameter used for?

The sample_indices parameter is available for some functions as a way to select the samples returned by the function by providing the corresponding indices of the required samples with reference to the sample_metadata function. It is intended as an alternative to the sample_query parameter, which selects samples based on Pandas query criteria. Some functions prevent both the sample_query and the sample_indices parameters being used at the same time. Some functions internally convert the results of the sample_query parameter into an equivalent sample_indices parameter.

Public functions that support the sample_indices param are:

  • biallelic_diplotype_pairwise_distances
  • njt
  • plot_njt
  • pca
  • sample_metadata
  • snp_genotypes
  • snp_calls
  • snp_allele_counts
  • biallelic_snp_calls
  • biallelic_diplotypes
  • biallelic_snps_to_plink

Functions that use _prep_sample_selection_cache_params to convert sample_query to sample_indices:

  • biallelic_diplotype_pairwise_distances
  • njt
  • pca
  • snp_allele_counts
  • biallelic_diplotypes

Functions that use base_params.validate_sample_selection_params to prevent both sample_query and sample_indices:

  • sample_metadata
  • snp_genotypes
  • snp_calls

What is the problem?

One of the mechanisms being used to honour the surveillance_use_only setting is to internally inject is_surveillance == True into the sample_query parameter via the _prep_sample_query_param function. This means that functions that were originally called with no sample_query parameter gain a sample_query parameter, regardless of whether they also had a sample_indices parameter passed to them. For the sample_metadata function, which uses validate_sample_selection_params to ensure both selection methods are not used, using sample_indices with a surveillance-only object will currently (in this PR) result in a silent failure, e.g.

sample_sets = "AG1000G-FR"
sample_indices = [0, 1, 2, 3]
default_sample_metadata_df = ag3_default.sample_metadata(sample_sets=sample_sets, sample_indices=sample_indices)
surveillance_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=sample_sets, sample_indices=sample_indices)
print('Number of samples from default object:', len(default_sample_metadata_df))
print('Number of samples from surveillance object:', len(surveillance_sample_metadata_df))
Number of samples from default object: 4
Number of samples from surveillance object: 23

In this case, sample_indices are ignored due to logic in the sample_metadata function:

        if prepared_sample_query is not None:
		# ...
        elif sample_indices is not None:
            # ...

For the pca function, which doesn't use validate_sample_selection_params but uses _prep_sample_selection_cache_params, which internally converts any sample_query into sample_indices, using sample_indices with a surveillance-only object will currently (in this PR) also result in a silent failure, e.g.

sample_indices = list(range(0, 500))
n_snps = 10_000
pca_region = '3L:15,000,000-16,000,000'
ag3_pca_df, evr = ag3_default.pca(region=pca_region, sample_sets=mixed_sample_set, n_snps=n_snps, sample_indices=sample_indices)
ag3_surveillance_pca_df, evr = ag3_surveillance.pca(region=pca_region, sample_sets=mixed_sample_set, n_snps=n_snps, sample_indices=sample_indices)
print('Samples returned by the default object:', len(ag3_pca_df['sample_id'].unique()))
print('Samples returned by the surveillance object:', len(ag3_surveillance_pca_df['sample_id'].unique()))
Samples returned by the default object: 500
Samples returned by the surveillance object: 86

In this case, sample_indices are ignored due to logic in the _prep_sample_selection_cache_params function:

        if prepared_sample_query is not None:
            # ...
            sample_indices = np.nonzero(loc_samples)[0].tolist()

        return prepared_sample_sets, sample_indices

For the functions that don't use either validate_sample_selection_params nor _prep_sample_selection_cache_params:

  • plot_njt passes its sample_indices to both njt and sample_metadata, so is reliant on their behaviour.
  • biallelic_snp_calls uses snp_allele_counts and snp_calls, so is reliant on their behaviour.
  • biallelic_snps_to_plink uses biallelic_snp_calls, so is reliant on its behaviour.

What is the proposed solution in principle?

To restore proper support for the sample_indices param for affected functions in this PR, when surveillance_use_only is set:

  • Ensure that the validate_sample_selection_params check is used at the beginning of any function that accepts both sample_query and sample_indices param. (This doesn't directly solve the problem but tightens up the user-interface and protects the assumptions of downstream code.)
  • Ensure that wherever sample_query is not provided by the user (before being "prepared" and hijacked internally) and sample_indices is provided, that sample_indices are still applied appropriately. For instance, if a sample_query starts as None but becomes is_surveillance == True due to _prep_sample_query_param, then any sample_indices should correspond to the results of the is_surveillance == True sample query, rather than None. In other words, sample_indices should be relative to whatever samples the function returns (or uses, in the case of plots) when passed no sample_query nor sample_indices by the user.

@leehart leehart marked this pull request as draft June 10, 2025 10:07
@leehart
Copy link
Collaborator Author

leehart commented Jun 10, 2025

Plan to amend the behaviour for the following functions, with regards to using the sample_indices param with a surveillance-only object:

  • sample_metadata - fixed the logic to allow sample_indices after prepared_sample_query.
  • pca - uses _prep_sample_selection_cache_params. Fixed the logic to handle sample_indices, which went down to the level of _snp_calls.
  • biallelic_diplotype_pairwise_distances - uses _prep_sample_selection_cache_params. Now also uses validate_sample_selection_params.
  • njt - uses _prep_sample_selection_cache_params. Now also uses validate_sample_selection_params.
  • plot_njt - uses njt and sample_metadata.
  • snp_genotypes - fixed the logic to allow sample_indices.
    Uses validate_sample_selection_params. This was unusually tricky because the snp_genotypes data is an unfiltered Dask array and does not contain a sample_id field. I needed to use both the filtered sample_metadata (to process any sample query) and the unfiltered general_metadata in order to determine the appropriate filter.
  • snp_calls - uses _snp_calls, which needed amends to its logic in a similar way to sample_metadata, except it uses an Xarray Dataset rather than a Pandas DataFrame.
  • snp_allele_counts - uses _prep_sample_selection_cache_params. Now also uses validate_sample_selection_params.
  • biallelic_snp_calls - uses snp_allele_counts and snp_calls. Uses validate_sample_selection_params.
  • biallelic_diplotypes - uses _prep_sample_selection_cache_params. Uses validate_sample_selection_params.
  • biallelic_snps_to_plink - uses biallelic_snp_calls which uses snp_allele_counts and snp_calls

@jonbrenas
Copy link
Collaborator

Thanks @leehart. It is kind of wild (and also wrong) that we had three different potential behaviours for the interaction between sample_query and sample_indices. I agree that we need to remediate that but I am not sure I agree with the chosen solution.

As far as I am concerned, surveillance_use_only and unrestricted_use_only are queries (and, as you say, it is how they are handled internally) so IMHO going with the middle road isn't working. The easiest solution would be to simply ban the use of sample_indices if any of the "query" options is used but it might be too restrictive. A better solution would be to end up with a process like what is down for min_cohort_size where any ID explicitly listed in sample_indices that doesn't make the cut of the sample_query/surveillance_use_only/unrestricted_use_only is listed in a warning (if possible with an explanation of why it failed but that may be harder to produce).

@leehart
Copy link
Collaborator Author

leehart commented Jun 10, 2025

Thanks @jonbrenas. I think I understand some of your concerns, but I reckon I still need some clarification.

I guess it might boil down to how the users are (or should be) determining the sample_indices to pass into these functions in the first place. One assumption I had was that the full list of sample indices would be aligned with the results of the sample_metadata without any kind of query. So for an object with no surveillance_use_only setting, the full list for a particular sample set would be given by something like:

samples_df = ag3_default.sample_metadata(sample_sets=sample_set)

Then it would seem natural to the user to expect the same behaviour for an object with surveillance_use_only set to True, e.g.

samples_df = ag3_surveillance.sample_metadata(sample_sets=sample_set)

In this context, the user is not explicitly passing any sample_query to the function, regardless of whether we inject is_surveillance == True into sample_query behind the scenes. In other words, as far as the user is concerned, there is no sample_query, so they have no obvious reason to expect that sample_indices cannot or should not be used. (I don't personally like the hijacking of sample_query for this purpose, but that's already water under the bridge.)

However, if the user is getting their sample_indices from some other source of truth, besides sample_metadata, then I can see how things can go wrong. For instance, sample_indices of [0, 1, 2, 3] in the example above is going to return different samples for ag3_default.sample_metadata(sample_sets=sample_set) compared to ag3_surveillance.sample_metadata(sample_sets=sample_set). However, I see this as similar to the misalignment that we might see from sample_indices for ag3_default.sample_metadata(sample_sets=sample_set) not aligning with the same samples from ag3_default.sample_metadata(), for example. That's an explainable mistake or misuse.

As far as I can tell, these sample_indices are meant to be used in a similar context to how df_samples = df_samples.iloc[sample_indices] is used within sample_metadata, and the documentation clearly says:

A list of indices of samples to select, corresponding to the order in which the samples are found within the sample metadata. Either provide this parameter or sample_query, not both.

Banning the use of sample_indices wherever we have surveillance_use_only switched on would mean that we'd effectively have different function signatures depending on different constructor params, which seems to confuse matters and deprives the user of previously available features. But it also seems unnecessary, since it looks like we can restore this functionality (which is only temporarily lost in this PR) with a few tweaks to the internal logic, as demonstrated in the updated sample_metadata function in this PR, which I believe can be also achieved for the other affected functions, as detailed above.

I'm not sure I understand the solution regarding min_cohort_size. Sample ids are not explicitly determined by sample_indices, which is part of the complication here. sample_indices only identify samples with reference to some kind of original full list of samples, represented by df_samples in the df_samples.iloc[sample_indices] example. As with the existing sample_metadata function and the documentation for the sample_indices param, it seems that the user-level expectation is that the full list of samples, upon which sample_indices must be based, ought to come from the "sample metadata", which would naturally seem to be functionally equivalent to (or semantically synonymous with) the sample_metadata function, depending on its parameters (such as sample_sets) or the higher-level constructor params (such as surveillance_use_only).

In any case, it would be good to have a clear understanding and position on this. Ideally, I think we want to preserve existing functionality and expectations, and minimise potential confusion and complexity, while keeping the user experience as unsurprising and intuitive as possible.

I feel like I'm missing something or overlooking something important here!

@jonbrenas
Copy link
Collaborator

Thanks @leehart.

Sorry, I had misinterpreted the way these parameters were going to be used and, as a result, my comment didn't make much sense. I thought that unrestricted_use_only and surveillance_use_only could be used when the AnophelesBase was defined (which is the case) and/or when a function is called (which is not the case). My objection was only applicable in the second case (e.g., I have run a PCA analysis with 100 samples chosen at random and I want to do it again with the same samples but only if they are surveillance in which case I would have asked for surveillance_use_only and sample_indices when calling pca). In that case, one would have to use a convoluted query with Ag3 defined using surveillance_use_only. My bad!

@leehart
Copy link
Collaborator Author

leehart commented Jun 10, 2025

Ah, I see, thanks @jonbrenas . Yeah, these params are only going to be used (AFAIK) during object construction, e.g. ag3_unrestricted_surveillance = malariagen_data.Ag3(unrestricted_use_only=True, surveillance_use_only=True). I hadn't even considered these params being used for individual functions... Interesting / mind-boggling! No worries, I'll continue on this path then, unless there are other objections. Thanks again for your thoughts on this, which made me think more deeply about the problem and consider other angles. 👍

@leehart
Copy link
Collaborator Author

leehart commented Jun 17, 2025

OK, I think I've sorted out the sample_indices support now.

snp_genotypes was a little tricky, and the solution here might be a little fragile since it relies (in this PR) on the unfiltered general_metadata, which we should probably make private using an underscore prefix, along with a bunch of other functions, in a follow-up PR. (Issue #796.)

It seems good to at least have an internal method of getting an unfiltered list of samples though, e.g. via _general_metadata, for cases such as this, e.g. aligning with unfiltered unlabelled SNP data.

@leehart leehart marked this pull request as ready for review June 17, 2025 09:46
@leehart
Copy link
Collaborator Author

leehart commented Jun 17, 2025

Investigating CI test failures, which didn't fail locally:

=========================== short test summary info ============================
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_period_by[af1_sim-year] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_sample_query[af1_sim] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 18 (11.1%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'amp', 'del',
       'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp'],
      dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'amp', 'amp',
       'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp'],
      dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_min_cohort_size[af1_sim-0] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_min_cohort_size[af1_sim-10] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_nobs_mode[af1_sim-fixed] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
=========== 5 failed, 693 passed, 125 warnings in 857.50s (0:14:17) ============
Error: Process completed with exit code 1.

[ It looks like that was one of those transient random failures, due to simulated test data. Resolved by re-running. ]

@leehart
Copy link
Collaborator Author

leehart commented Jun 20, 2025

@jonbrenas Just to note that this code hasn't been reviewed or merged in yet, so work on the advanced course might need to be provisional or require changes if something here needs to change. But I expect the general behaviour will be the same.

@jonbrenas
Copy link
Collaborator

Thanks @leehart. Yes, the work on the advanced course is going to be provisional, it is also going to be mostly textual (i.e., explaining why some sample sets are restricted, why some samples might be biased in a way that means they are not to be used for surveillance, ...).

I don't remember if we had agreed that we would meet for you to show what has changed how and why or if you thought that I could figure it out my own. If the latter, I'll get to reviewing this PR asap.

@leehart
Copy link
Collaborator Author

leehart commented Jun 20, 2025

Hi @jonbrenas . Maybe take a look and take some notes, and then we can meet to discuss, if you like. There's quite a bit, so we might need to do a few rounds. Or you might think it all looks OK! 🚀

@leehart leehart removed the request for review from alimanfoo August 8, 2025 08:48
@leehart
Copy link
Collaborator Author

leehart commented Aug 8, 2025

@jonbrenas Let's try to meet to go through this PR in September.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants