diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index 10f7421..842eaae 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -70,7 +70,7 @@ advanced editor to help you with some of the common steps described below, e.g. $ cd acore/ $ python -m venv .env $ source .env/bin/activate - $ pip install -e .[dev] + $ pip install -e ".[dev]" If you work on a Windows shell, see the docs for instructions: `How venvs work `_ @@ -125,7 +125,7 @@ Before you submit a pull request, check that it meets these guidelines: 3. The pull request should pass the workflows on GitHub. See for example the PR-Template for a module: -`Add module PR template `_. +`Add module PR template `_. diff --git a/docs/api_examples/permutation_testing.ipynb b/docs/api_examples/permutation_testing.ipynb new file mode 100644 index 0000000..cb92f75 --- /dev/null +++ b/docs/api_examples/permutation_testing.ipynb @@ -0,0 +1,262 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e9bdb7c9", + "metadata": {}, + "source": [ + "# Tutorial: Permutation Testing using `acore`\n", + "\n", + "In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n", + "\n", + "The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n", + "\n", + "For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. \n" + ] + }, + { + "cell_type": "markdown", + "id": "dc9f17b2", + "metadata": {}, + "source": [ + "## Data preparation details\n", + "\n", + "### Downloading\n", + "The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n", + "\n", + "### Preprocessing of abundances\n", + "- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. \n", + "- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do\n", + "\n", + "### Preprocessing of the metadata \n", + "- the sample metadata needed for this demo (sampling location) were available in their \"sample-desc\" \n", + "- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off\n", + "\n", + "### Subset of data for demo\n", + "- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)\n", + "- It's expected that antibiotic catabolic processes to be higher in INF vs EFF\n", + "\n", + "### Saving the demo dataset\n", + "This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:\n", + "\n", + "| column | description | dtype |\n", + "|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|\n", + "| eff_id | The run id for the mgnify analysis of the effluent sample. | str |\n", + "| inf_id | The run id for the mgnify analysis of the influent sample. | str |\n", + "| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str |\n", + "| sampling_read | Replicates? | str |\n", + "| eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float |\n", + "| inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float |\n", + "\n", + "-----\n", + "\n", + "We will now proceed with reading in the prepared dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "00d8f038", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
eff_idinf_idsampling_locationsampling_readeff_abundanceinf_abundance
0ERR2985255ERR2814663TGREAD2 Taxonomy ID:2563183.2572834.226819
1ERR2985256ERR2814664MNREAD2 Taxonomy ID:2563182.5728413.847191
2ERR2985257ERR2814651AHREAD1 Taxonomy ID:2563184.2987774.086841
3ERR2985258ERR2814667TEREAD1 Taxonomy ID:2563182.7589823.436752
4ERR2985259ERR2814660FDREAD1 Taxonomy ID:2563183.3646753.486673
\n", + "
" + ], + "text/plain": [ + " eff_id inf_id sampling_location sampling_read \\\n", + "0 ERR2985255 ERR2814663 TG READ2 Taxonomy ID:256318 \n", + "1 ERR2985256 ERR2814664 MN READ2 Taxonomy ID:256318 \n", + "2 ERR2985257 ERR2814651 AH READ1 Taxonomy ID:256318 \n", + "3 ERR2985258 ERR2814667 TE READ1 Taxonomy ID:256318 \n", + "4 ERR2985259 ERR2814660 FD READ1 Taxonomy ID:256318 \n", + "\n", + " eff_abundance inf_abundance \n", + "0 3.257283 4.226819 \n", + "1 2.572841 3.847191 \n", + "2 4.298777 4.086841 \n", + "3 2.758982 3.436752 \n", + "4 3.364675 3.486673 " + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd \n", + "\n", + "df_data = pd.read_csv(\n", + " 'https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv'\n", + ")\n", + "# sanity check \n", + "df_data.head()" + ] + }, + { + "cell_type": "markdown", + "id": "7cf6c3d0", + "metadata": {}, + "source": [ + "## The permutation test\n", + "\n", + "Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. \n", + "\n", + "The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. \n", + "\n", + "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5eefb720", + "metadata": {}, + "outputs": [], + "source": [ + "from acore.permutation_test import paired_permutation\n", + "\n", + "# optional choice of random number generatorfor repro\n", + "import numpy as np\n", + "rng = np.random.default_rng(12345)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9eabb911", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'metric': , 'observed': (np.float64(6.7389860601792275), np.float64(7.122287781830209e-07)), 'p_value': 0.0}\n", + "{'metric': , 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n", + "{'metric': , 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n" + ] + } + ], + "source": [ + "# trying diff metrics to demo functionality also\n", + "for metric in ['t-statistic', 'mean', np.mean]:\n", + " result = paired_permutation(\n", + " df_data['inf_abundance'].to_numpy(),\n", + " df_data['eff_abundance'].to_numpy(), \n", + " metric=metric, \n", + " n_permutations=10000, \n", + " rng=rng\n", + " )\n", + " # verbosity\n", + " print(result)" + ] + }, + { + "cell_type": "markdown", + "id": "1ad22530", + "metadata": {}, + "source": [ + "## Result\n", + "\n", + "Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/api_examples/permutation_testing.py b/docs/api_examples/permutation_testing.py new file mode 100644 index 0000000..e920df2 --- /dev/null +++ b/docs/api_examples/permutation_testing.py @@ -0,0 +1,101 @@ +# --- +# jupyter: +# jupytext: +# text_representation: +# extension: .py +# format_name: percent +# format_version: '1.3' +# jupytext_version: 1.17.3 +# kernelspec: +# display_name: .venv +# language: python +# name: python3 +# --- + +# %% [markdown] +# # Permutation Tests +# +# In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8). +# +# The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058). +# +# For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. +# + +# %% [markdown] +# ## Data preparation details +# +# ### Downloading +# The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing. +# +# ### Preprocessing of abundances +# - To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. +# - The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation [`acore.microbiome.internal_functions.calc_clr`](`acore.microbiome.internal_functions.calc_clr`) to not violate assumptions of any frequentist stats we do +# +# ### Preprocessing of the metadata +# - the sample metadata needed for this demo (sampling location) were available in their "sample-desc" +# - the sample-desc for each sample in both INF and EFF were parsed and used for pairing off +# +# ### Subset of data for demo +# - For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001) +# - It's expected that antibiotic catabolic processes to be higher in INF vs EFF +# +# ### Saving the demo dataset +# This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below: +# +# | column | description | dtype | +# |-------------------|-------------------------------------------------------------------------------------------------------------------|-------| +# | eff_id | The run id for the mgnify analysis of the effluent sample. | str | +# | inf_id | The run id for the mgnify analysis of the influent sample. | str | +# | sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str | +# | sampling_read | Replicates? | str | +# | eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float | +# | inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float | +# +# ----- +# +# We will now proceed with reading in the prepared dataset. + +# %% +import pandas as pd + +df_data = pd.read_csv( + "https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv" +) +# sanity check +df_data.head() + +# %% [markdown] +# ## The permutation test +# +# Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. +# +# The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. +# +# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. + +# %% +from acore.permutation_test import paired_permutation + +# optional choice of random number generatorfor repro +import numpy as np + +rng = np.random.default_rng(12345) + +# %% +# trying diff metrics to demo functionality also +for metric in ["t-statistic", "mean", np.mean]: + result = paired_permutation( + df_data["inf_abundance"].to_numpy(), + df_data["eff_abundance"].to_numpy(), + metric=metric, + n_permutations=10000, + rng=rng, + ) + # verbosity + print(result) + +# %% [markdown] +# ## Result +# +# Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001. diff --git a/docs/index.rst b/docs/index.rst index 35f5751..af0844f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -19,6 +19,7 @@ Acore documentation api_examples/normalization_analysis api_examples/ANCOVA_analysis api_examples/enrichment_analysis + api_examples/permutation_testing .. toctree:: :maxdepth: 1 diff --git a/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv b/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv new file mode 100644 index 0000000..ca3c6b6 --- /dev/null +++ b/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv @@ -0,0 +1,25 @@ +eff_id,inf_id,sampling_location,sampling_read,eff_abundance,inf_abundance +ERR2985255,ERR2814663,TG,READ2 Taxonomy ID:256318,3.2572833167019404,4.226818522502555 +ERR2985256,ERR2814664,MN,READ2 Taxonomy ID:256318,2.5728406558389647,3.8471905353288207 +ERR2985257,ERR2814651,AH,READ1 Taxonomy ID:256318,4.298777376040054,4.08684074171776 +ERR2985258,ERR2814667,TE,READ1 Taxonomy ID:256318,2.758981522176203,3.4367515193484355 +ERR2985259,ERR2814660,FD,READ1 Taxonomy ID:256318,3.3646753829076825,3.4866734935494126 +ERR2985260,ERR2814652,AD,READ1 Taxonomy ID:256318,3.122830873754367,3.1959892495165043 +ERR2985261,ERR2814655,EM,READ2 Taxonomy ID:256318,2.8111912227938394,3.6796595362263957 +ERR2985262,ERR2814647,MN,READ1 Taxonomy ID:256318,2.64787572226993,3.854349549040254 +ERR2985263,ERR2814649,ZR,READ2 Taxonomy ID:256318,3.5745130928438336,3.898617360802433 +ERR2985264,ERR2814646,FD,READ2 Taxonomy ID:256318,3.1932134470444886,3.4511721120494254 +ERR2985265,ERR2814658,AD,READ2 Taxonomy ID:256318,2.9090652423743357,3.3030032029809693 +ERR2985266,ERR2814650,ZR,READ1 Taxonomy ID:256318,3.3891206874721362,3.84895839035568 +ERR2985267,ERR2814648,WT,READ2 Taxonomy ID:256318,2.8069435917365073,3.5322169927895697 +ERR2985268,ERR2814645,TG,READ1 Taxonomy ID:256318,3.241378979031497,4.052892167654379 +ERR2985269,ERR2814656,TU,READ2 Taxonomy ID:256318,2.929680670135557,3.616255621958018 +ERR2985270,ERR2814659,AH,READ2 Taxonomy ID:256318,4.400504997765587,4.149306529774245 +ERR2985271,ERR2814657,TE,READ2 Taxonomy ID:256318,2.901002904935801,3.75827367299984 +ERR2985272,ERR2814666,BE,READ1 Taxonomy ID:256318,4.125262599471004,4.398085076927169 +ERR2985273,ERR2814665,BE,READ2 Taxonomy ID:256318,4.076901525292074,4.39475485971623 +ERR2985274,ERR2814662,WT,READ1 Taxonomy ID:256318,3.1091661937084183,3.5130581102452503 +ERR2985275,ERR2814653,FR,READ2 Taxonomy ID:256318,3.0000096382997192,3.594857935168328 +ERR2985276,ERR2814668,EM,READ1 Taxonomy ID:256318,2.752778722206287,3.547368740393445 +ERR2985277,ERR2814661,TU,READ1 Taxonomy ID:256318,2.8771702480026953,3.5132674150221757 +ERR2985278,ERR2814654,FR,READ1 Taxonomy ID:256318,2.7336315345563857,3.310422165293325 diff --git a/src/acore/decomposition/umap.py b/src/acore/decomposition/umap.py index 2a05dd7..6d5ab21 100644 --- a/src/acore/decomposition/umap.py +++ b/src/acore/decomposition/umap.py @@ -1,4 +1,5 @@ """Run UMAP on DataFrame and return result.""" + import pandas as pd import umap diff --git a/src/acore/enrichment_analysis/__init__.py b/src/acore/enrichment_analysis/__init__.py index b0a09a6..b3ae590 100644 --- a/src/acore/enrichment_analysis/__init__.py +++ b/src/acore/enrichment_analysis/__init__.py @@ -552,13 +552,13 @@ def run_ssgsea( index.index = name df = df.set_index("Name").transpose() - if not annotation_col in annotation: + if annotation_col not in annotation: raise ValueError( f"Missing Annotation Column: {annotation_col}" " as specified by `annotation_col`" ) - if not identifier_col in annotation: + if identifier_col not in annotation: raise ValueError( f"Missing Identifier Column: {identifier_col}" " as specified by `identifier_col`" diff --git a/src/acore/io/uniprot/uniprot.py b/src/acore/io/uniprot/uniprot.py index c9abf05..f359eb1 100644 --- a/src/acore/io/uniprot/uniprot.py +++ b/src/acore/io/uniprot/uniprot.py @@ -177,8 +177,6 @@ def get_id_mapping_results_stream(url): return decode_results(request, file_format, compressed) - - if __name__ == "__main__": # id mapping is used to create a link to a query (you can see the json in the browser) # UniProtKB is the knowleadgebase integrating all kind of other databases diff --git a/src/acore/microbiome/internal_functions.py b/src/acore/microbiome/internal_functions.py new file mode 100644 index 0000000..637bc21 --- /dev/null +++ b/src/acore/microbiome/internal_functions.py @@ -0,0 +1,53 @@ +# preprocessing +from sklearn.preprocessing import FunctionTransformer +import numpy as np +from sklearn.pipeline import make_pipeline +import pandas as pd + + +def calc_clr(x): + """ + Calculate the centered log-ratio (CLR) transformation. + + Parameters + ---------- + x : array-like + Input data to transform. + + Returns + ------- + array-like + CLR transformed data. + """ + # replace zeros with small + new_x = np.where(x > 0, x, 1e-10) + return np.log(new_x) - np.mean(np.log(new_x), axis=0) + + +def coda_clr(df: pd.DataFrame) -> pd.DataFrame: + """ + Apply CoDA and CLR transformations to the numeric columns of a DataFrame. + Non-numeric columns are retained without transformation. + + Parameters + ---------- + df : pd.DataFrame + Input DataFrame with numeric and non-numeric columns. + + Returns + ------- + pd.DataFrame + DataFrame with transformed numeric columns and original non-numeric columns. + """ + # Define the CoDA and CLR transformers + coda = FunctionTransformer( + lambda x: x / x.sum(axis=0), feature_names_out="one-to-one" + ).set_output(transform="pandas") + clr = FunctionTransformer(calc_clr, feature_names_out="one-to-one").set_output( + transform="pandas" + ) + # Create a pipeline with CoDA and CLR transformations + pipe = make_pipeline(coda, clr).set_output(transform="pandas") + # Apply the pipeline to numeric columns and concatenate with non-numeric columns + transformed = pipe.fit_transform(df.select_dtypes(include="number")) + return pd.concat([df.select_dtypes(include="object"), transformed], axis=1) diff --git a/src/acore/multiple_testing/__init__.py b/src/acore/multiple_testing/__init__.py index 3ca9316..1949daa 100644 --- a/src/acore/multiple_testing/__init__.py +++ b/src/acore/multiple_testing/__init__.py @@ -207,7 +207,7 @@ def correct_pairwise_ttest(df, alpha, correction="fdr_bh"): required_col = ["group1", "group2", "posthoc pvalue"] for _col in required_col: - if not _col in df: + if _col not in df: raise KeyError(f"Did not find '{_col}' in columns of data.") for comparison in df.groupby(["group1", "group2"]).groups: diff --git a/src/acore/permutation_test/__init__.py b/src/acore/permutation_test/__init__.py new file mode 100644 index 0000000..b6bc69c --- /dev/null +++ b/src/acore/permutation_test/__init__.py @@ -0,0 +1,268 @@ +import numpy as np +from scipy.stats import ( + chi2_contingency, + ttest_rel, + ttest_ind, + f_oneway, +) +from .internal_functions import _permute, _contingency_table, _check_degeneracy +import warnings + +from acore.types.permutation_test import PermutationResult + +warnings.simplefilter("always", UserWarning) + + +def paired_permutation( + cond1: np.ndarray, + cond2: np.ndarray, + metric: str = "t-statistic", + n_permutations: int = 10000, + rng: np.random.Generator = np.random.default_rng(seed=12345), + **kwargs, +) -> dict: + """ + Perform a permutation test for paired samples. + + Parameters + ---------- + cond1 : np.ndarray + First condition (paired samples). + cond2 : np.ndarray + Second condition (paired samples). + metric : str or callable, optional + Metric to compute ('t-statistic', 'mean', 'median', or a custom function). + n_permutations : int, optional + Number of permutations to perform (default is 10000). + rng : np.random.Generator, optional + Random number generator (default is np.random.default_rng(seed=12345)). + **kwargs + Additional arguments passed to the metric function. + + Returns + ------- + dict + Dictionary with keys: + - 'metric': Metric function used. + - 'observed': Observed metric value. + - 'p_value': Permutation test p-value (np.nan if degenerate). + """ # Validate input + if not isinstance(cond1, np.ndarray) or not isinstance(cond2, np.ndarray): + raise TypeError("Input must be numpy arrays.") + if cond1.shape != cond2.shape: + raise ValueError("Input arrays must have the same shape.") + + # paired differences + diff = cond1 - cond2 + args = [diff] + + # compute observed metric + if metric == "t-statistic": + calculator = ttest_rel + args = [cond1, cond2] + elif metric == "mean": + calculator = np.mean + elif metric == "median": + calculator = np.median + elif callable(metric): + calculator = metric + else: + raise ValueError( + "Invalid metric specified. Acceptable metrics are: " + "'t-statistic', 'mean', 'median', or a custom function" + "that takes `cond1-cond2` as input." + ) + + observed_metric = calculator(*args, **kwargs) + if metric == "t-statistic": + abs_met = abs(observed_metric.statistic) + else: + abs_met = abs(observed_metric) + + # Perform permutations + permuted_f = [] + for _ in range(n_permutations): + # randomly flip direction of differences + permuted_diff = diff * rng.choice([-1, 1], size=diff.shape) + new_cond1 = np.where((permuted_diff == diff), cond1, cond2) + new_cond2 = np.where((permuted_diff == diff), cond2, cond1) + # postcondition check + if not (permuted_diff == (new_cond1 - new_cond2)).all(): + raise ArithmeticError( + "Postcondition failed: Issue with permuted differences" + ) + # prep args + if metric == "t-statistic": + new_args = [new_cond1, new_cond2] + new_result = abs(calculator(*new_args, **kwargs).statistic) + else: + new_args = [permuted_diff] + new_result = abs(calculator(*new_args, **kwargs)) + # compute permuted metric + permuted_f.append(new_result) + + if _check_degeneracy(diff): + identical_warn = ( + "Degenerate conditions detected (the data are identical). " + "Consider using a different statistical test. " + "Results may be unreliable." + ) + warnings.warn(identical_warn) + + val_result = PermutationResult.model_validate( + { + "metric": calculator, + "observed": observed_metric, + "p_value": np.nan, + } + ) + + else: + # Compute p-value + p_value = np.mean(permuted_f >= abs_met) + + val_result = PermutationResult.model_validate( + {"metric": calculator, "observed": observed_metric, "p_value": p_value} + ) + + return val_result.model_dump() + + +def chi2_permutation( + *groups, + n_permutations: int = 10000, + rng: np.random.Generator = np.random.default_rng(seed=12345), +) -> dict: + """ + Perform a permutation test for categorical data using the chi-squared statistic. + + Parameters + ---------- + *groups : array-like + Arrays representing categorical groups. + n_permutations : int, optional + Number of permutations to perform (default is 10000). + rng : np.random.Generator, optional + Random number generator (default is np.random.default_rng(seed=12345)). + + Returns + ------- + dict + Dictionary with keys: + - 'observed': Observed chi-squared test result. + - 'p_value': Permutation test p-value. + """ + # generate contingency table + cont_table = _contingency_table(*groups, to_np=True) + + # Compute observed chi-squared statistic + observed_test = chi2_contingency(cont_table) + observed_chi2 = observed_test.statistic + + # Perform permutations + permuted_chi2 = [] + for _ in range(n_permutations): + # shuffle + perm_cont_table = _contingency_table(*_permute(*groups, rng=rng)) + + # calculate permuted chi-squared statistic + permuted_chi2.append(chi2_contingency(perm_cont_table).statistic) + + # Compute p-value + p_value = np.mean(permuted_chi2 >= observed_chi2) + + val_result = PermutationResult.model_validate( + {"observed": observed_test, "p_value": p_value} + ) + + return val_result.model_dump(exclude_none=True) + + +def indep_permutation( + group1: np.ndarray, + group2: np.ndarray, + metric: str = "t-statistic", + n_permutations: int = 10000, + rng: np.random.Generator = np.random.default_rng(seed=12345), + **kwargs, +) -> dict: + """ + Perform a permutation test for independent samples. + + Parameters + ---------- + group1 : np.ndarray + First group of samples. + group2 : np.ndarray + Second group of samples. + metric : str or callable, optional + Metric to compute ('t-statistic', 'anova', 'mean', 'median', or a custom function). + n_permutations : int, optional + Number of permutations to perform (default is 10000). + rng : np.random.Generator, optional + Random number generator (default is np.random.default_rng(seed=12345)). + **kwargs + Additional arguments passed to the metric function. + + Returns + ------- + dict + Dictionary with keys: + - 'metric': Metric function used. + - 'observed': Observed metric value. + - 'p_value': Permutation test p-value. + """ + + ## PRECONDITIONS + if not isinstance(group1, np.ndarray) or not isinstance(group2, np.ndarray): + raise TypeError("Input must be numpy arrays.") + + stat = False + # what metric to use + if metric == "t-statistic": + calculator = ttest_ind + stat = True + elif metric == "anova": + calculator = f_oneway + stat = True + elif metric == "mean": + calculator = np.mean + elif metric == "median": + calculator = np.median + elif callable(metric): + calculator = metric + else: + raise ValueError( + "Invalid metric specified. Acceptable metrics are: " + "'t-statistic', 'mean', 'median', or a custom function" + "that takes each group as input." + ) + + # compute observed metric + if stat: + observed_metric = calculator(group1, group2, **kwargs) + else: + observed_metric = abs(calculator(group1) - calculator(group2)) + + # Perform permutations + permuted_f = [] + for _ in range(n_permutations): + new_group1, new_group2 = _permute(group1, group2, rng=rng) + # prep args + if stat: + new_result = abs(calculator(new_group1, new_group2, **kwargs).statistic) + abs_met = abs(observed_metric.statistic) + else: + new_result = abs(calculator(new_group1) - calculator(new_group2)) + abs_met = abs(observed_metric) + # compute permuted metric + permuted_f.append(new_result) + + # Compute p-value + p_value = np.mean(permuted_f >= abs_met) + + val_result = PermutationResult.model_validate( + {"metric": calculator, "observed": observed_metric, "p_value": p_value} + ) + + return val_result.model_dump() diff --git a/src/acore/permutation_test/internal_functions.py b/src/acore/permutation_test/internal_functions.py new file mode 100644 index 0000000..80fdddd --- /dev/null +++ b/src/acore/permutation_test/internal_functions.py @@ -0,0 +1,84 @@ +import numpy as np +import pandas as pd + + +def _permute(*groups, rng=np.random.default_rng(seed=12345)): + """ + Perform a single permutation of the groups. + + Parameters + ---------- + *groups : iterable + The groups to permute. + rng : np.random.Generator, optional + The random number generator to use for shuffling. + Defaults to a new random number generator. + + Returns + ------- + list + The permuted groups. + """ + # shuffle the combined array + combined = np.concatenate(groups) + rng.shuffle(combined) + # split the shuffled array into new groups of og size + new_groups = [] + start = 0 + for group in groups: + end = start + len(group) + new_groups.append(combined[start:end]) + start = end + return new_groups + + +def _contingency_table(*groups, to_np: bool = True) -> np.array: + """ + Create a contingency table from the provided groups. + + Parameters + ---------- + *groups : iterable + The groups to include in the contingency table. + to_np : bool, optional + Whether to return the table as a NumPy array (default is True). + + Returns + ------- + np.array + The contingency table as a NumPy array or a pandas DataFrame. + """ + # PRECONDITIONS + for group in groups: + if not isinstance(group, (list, pd.Series, np.ndarray)): + raise TypeError("Each group must be a list, pandas Series, or numpy array.") + + # MAIN FUNCTION + # Create contingency table + table = ( + pd.concat([pd.Series(group).value_counts() for group in groups], axis=1) + .fillna(0) + .T + ) + + if to_np: + return table.to_numpy() + else: + return table + + +def _check_degeneracy(array: np.ndarray) -> bool: + """ + Check for degeneracy in the differences of paired samples. + + Parameters + ---------- + array : np.ndarray + The differences array. e.g. cond1 - cond2 + + Returns + ------- + bool + True if the conditions are degenerate, False otherwise. + """ + return np.all(array == array[0]) diff --git a/src/acore/permutation_test/jaccard.py b/src/acore/permutation_test/jaccard.py new file mode 100644 index 0000000..f8ad37c --- /dev/null +++ b/src/acore/permutation_test/jaccard.py @@ -0,0 +1,142 @@ +import numpy as np +import itertools +from collections.abc import Iterable + + +def jaccard_similarity(set1: set, set2: set) -> float: + """ + Compute the Jaccard similarity between two sets. + + Parameters + ---------- + set1 : set + First set of nodes. + set2 : set + Second set of nodes. + Returns + ------- + float + Jaccard similarity coefficient, + which is the size of the intersection divided by + the size of the union of the two sets. + + Example + ------- + >>> set1 = {1, 2, 3} + >>> set2 = {2, 3, 4} + >>> jaccard_similarity(set1, set2) + 0.5 + """ + intersection = len(set1 & set2) + union = len(set1 | set2) + + if union != 0: + return intersection / union + else: + return 0 + + +def avg_jaccard(group: Iterable) -> tuple: + """ + Compute the average pairwise Jaccard similarity within a group of graphs. + + Parameters + ---------- + group : Iterable + An iterator of iterable objects, such as a list of sets. + + Returns + ------- + tuple + A tuple containing the average Jaccard similarity and its standard deviation. + + Example + ------- + >>> group = [{1, 2, 3}, {2, 3, 4}, {3, 4, 5}] + >>> avg_jaccard(group) + (0.3333333333333333, 0.0) + """ + + ## PRECONDITIONS + if not isinstance(group, Iterable): + raise TypeError("Input must be an iterable of iterables.") + if len(group) > 0: + for g in group: + if not isinstance(g, Iterable): + raise TypeError( + "Each element in the group must be an iterable (set, list, or tuple)." + ) + elif len(group) == 0: + raise ValueError("Input group is empty. Cannot compute Jaccard similarity.") + + # init the list + scores = [] + + for g1, g2 in itertools.combinations(group, 2): + # get sets + set1 = set(g1) + set2 = set(g2) + # compute Jaccard similarity + scores.append(jaccard_similarity(set1, set2)) + + if scores: + # if scores is not empty, return the average + return np.mean(scores), np.std(scores) + # if scores is empty (less than 2 in group), return 0 + else: + return 0, 0 + + +def btwn_jaccard(group1: Iterable, group2: Iterable) -> tuple: + """ + Compute the average Jaccard similarity between two groups of graphs. + + Parameters + ---------- + group1 : Iterable + An iterator of iterable objects, such as a list of sets. + group2 : Iterable + Another iterator of iterable objects. + + Returns + ------- + tuple + A tuple containing the average Jaccard similarity and its standard deviation. + + Example + ------- + >>> group1 = [{1, 2, 3}, {2, 3, 4}] + >>> group2 = [{3, 4, 5}, {4, 5, 6}] + >>> btwn_jaccard(group1, group2) + (0.3333333333333333, 0.0) + + """ + + if not isinstance(group1, Iterable) or not isinstance(group2, Iterable): + raise TypeError("Both inputs must be iterables of iterables.") + if len(group1) == 0 or len(group2) == 0: + raise ValueError("Both input groups must be non-empty.") + + # init the list + scores = [] + + for g1, g2 in itertools.product(group1, group2): + # check + if not isinstance(g1, Iterable) or not isinstance(g2, Iterable): + raise TypeError( + "Each element in the groups must be an iterable (set, list, or tuple)." + ) + + # get sets + set1 = set(g1) + set2 = set(g2) + + # compute Jaccard similarity + scores.append(jaccard_similarity(set1, set2)) + + if scores: + # if scores is not empty, return the average + return np.mean(scores), np.std(scores) + # if scores is empty (less than 2 in group), return 0 + else: + return 0, 0 diff --git a/src/acore/types/permutation_test.py b/src/acore/types/permutation_test.py new file mode 100644 index 0000000..42d3831 --- /dev/null +++ b/src/acore/types/permutation_test.py @@ -0,0 +1,11 @@ +# using pydantic since these are just dictionary outputs no df +from typing import Any, Optional, Callable, Union +from pydantic import BaseModel, Field + + +class PermutationResult(BaseModel): + metric: Optional[Union[str, Callable]] = Field( + default=None, description="Name of the metric used in the permutation test" + ) + observed: Any = Field(default=None, description="Observed value of the metric") + p_value: float = Field(description="p-value from the permutation test") diff --git a/tests/test_permutation_test.py b/tests/test_permutation_test.py new file mode 100644 index 0000000..c755cbc --- /dev/null +++ b/tests/test_permutation_test.py @@ -0,0 +1,85 @@ +from acore import permutation_test as pt +import pytest +import numpy as np +from scipy.stats import ttest_rel, ttest_ind, mannwhitneyu, wilcoxon + + +@pytest.fixture +def set_array(): + return np.arange(1, 10) + + +@pytest.fixture +def random_low(): + return np.random.negative_binomial(10, 0.3, 100) + + +@pytest.fixture +def random_high(): + return np.random.negative_binomial(20, 0.1, 100) + + +@pytest.fixture +def random_choice_equal(): + return np.random.choice( + ["A", "B", "C", "D"], size=100, replace=True, p=[0.25, 0.25, 0.25, 0.25] + ) + + +@pytest.fixture +def random_choice_unequal(): + return np.random.choice( + ["A", "B", "C", "D"], size=100, replace=True, p=[0.1, 0.25, 0.05, 0.6] + ) + + +def test_paired_permutation_same(set_array): + cond1 = set_array + cond2 = set_array + result = pt.paired_permutation(cond1, cond2, metric=np.mean) + assert result["metric"] == np.mean + assert result["observed"] == 0.0 + assert result["p_value"] is np.nan + + +def test_paired_permutation_reject(random_low, random_high): + cond1 = random_low + cond2 = random_high + for stat in ["t-statistic", "mean", "median"]: + result = pt.paired_permutation(cond1, cond2, metric=stat) + print(result) + print(ttest_rel(cond1, cond2)) + assert result["p_value"] < 0.05 + + +def test_paired_permutation_degen(set_array): + for stat in ["t-statistic", "mean", "median"]: + assert ( + pt.paired_permutation(set_array, set_array + 30, metric=stat)["p_value"] + is np.nan + ) + + +def test_indep_permutation_reject(random_low, random_high): + group1 = random_low + group2 = random_high + for stat in ["t-statistic", "anova", "median", "mean"]: + result = pt.indep_permutation(group1, group2, metric=stat) + assert result["p_value"] < 0.05 + + +def test_indep_permutation_accept(random_low): + for stat in ["t-statistic", "anova", "median", "mean"]: + assert ( + pt.indep_permutation(random_low, random_low, metric=stat)["p_value"] > 0.05 + ) + + +def test_chi2_permutation_accept(random_choice_equal): + result = pt.chi2_permutation(random_choice_equal, random_choice_equal) + assert result["p_value"] >= 0.05 + + +def test_chi2_permutation_reject(random_choice_unequal, random_choice_equal): + result = pt.chi2_permutation(random_choice_equal, random_choice_unequal) + assert result["p_value"] < 0.05