Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,312 changes: 1,312 additions & 0 deletions docs/auto_examples/plot_replicability.codeobj.json

Large diffs are not rendered by default.

168 changes: 168 additions & 0 deletions docs/auto_examples/plot_replicability.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n# Determine the number of components through replicability analysis\n\nThis example shows how to select the number of components for PARAFAC2 models by checking if patterns are replicable :cite:p:`erdos2025extracting`. The process involves fitting the model to different subsets of your data to see if the results stay consistent (i.e. replicable across data subsets). To maximize explanatory power, typically, we select the highest number of components that remains replicable across the data subsets.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports and utilities\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport numpy as np\nimport tensorly as tl\nfrom tensorly.metrics import congruence_coefficient\nfrom matcouply.decomposition import parafac2_aoadmm\nfrom matcouply.coupled_matrices import CoupledMatrixFactorization\nimport sklearn\nfrom sklearn.model_selection import RepeatedKFold\n\nimport tlviz\n\nrng = np.random.default_rng(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To fit PARAFAC2 models, we need to solve a non-convex optimization problem, possibly with local minima. It is\ntherefore useful to fit several models with the same number of components using many different random\ninitialisations.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def fit_many_parafac2(X, num_components, num_inits=5):\n \n best_err = np.inf\n decomposition = None\n for i in range(num_inits):\n trial_decomposition, trial_errs = parafac2_aoadmm(\n matrices=X,\n rank=num_components,\n return_errors=True,\n non_negative=[True, True, True],\n n_iter_max=500,\n absolute_tol=1e-4,\n feasibility_tol=1e-4,\n inner_tol=1e-4,\n inner_n_iter_max=5,\n feasibility_penalty_scale=5,\n tol=1e-5,\n random_state=i,\n verbose=0,\n )\n \n if best_err < trial_errs.rec_errors[-1]:\n continue\n \n best_err = trial_errs.rec_errors[-1]\n decomposition = trial_decomposition # note, with real data, convergence should be checked\n\n (est_weights, (est_A, est_B, est_C)) = decomposition\n est_B = np.asarray(est_B)\n\n # Normalize the decomposition:\n A_norm = tl.norm(est_A, axis=0)\n B_norm = tl.norm(est_B[0], axis=0) # This is the same for all B_i because of the PARAFAC2 constraint\n C_norm = tl.norm(est_C, axis=0)\n est_weights = A_norm * B_norm * C_norm # The PARAFAC2 AO-ADMM returns None as the weights\n\n As = est_A / A_norm\n Bs = [est_B_i / B_norm for est_B_i in est_B]\n Cs = est_C / C_norm\n\n # calculate scaled B; \n # since the loadings in B are specific to levels of A, we absorb A into the corresponding B for each component\n aB = [a_i * B_i for a_i, B_i in zip(As, Bs)]\n\n return (est_weights, (aB, Cs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate simulated data\n\nSimulate noisy data with 2 components.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def truncated_normal(size):\n x = rng.standard_normal(size=size)\n x[x < 0] = 0\n return tl.tensor(x)\n\nI, J, K = 25, 20, 35\nrank = 2\n\nA = rng.uniform(size=(I, rank)) + 0.1 # Add 0.1 to ensure that there is signal for all components for all slices\nA = tl.tensor(A)\n\nB_blueprint = truncated_normal(size=(J, rank))\nB_is = [np.roll(B_blueprint, i, axis=0) for i in range(I)]\nB_is = [tl.tensor(B_i) for B_i in B_is]\n\nC = rng.uniform(size=(K, rank))\nC = tl.tensor(C)\n\ndataset = CoupledMatrixFactorization((None, (A, B_is, C)))\n\ndataset = dataset.to_tensor()\neta = 0.3 # noise level\nnoise = np.random.normal(0, 1, dataset.shape)\ndataset = dataset + tl.norm(dataset) * eta * noise / tl.norm(noise)\ndataset = dataset / tl.norm(dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The replicability analysis boils down to the following steps:\n\n1. Split the data in a (user-chosen) mode into $N$ folds (user-chosen).\n2. Create $N$ subsets by subtracting each fold from the complete dataset.\n3. Fit multiple initializations to each subset and choose the *best* run\n according to lowest loss (total of $N$ *best* runs).\n4. Compare, in terms of FMS, the best runs across the different subsets\n to evaluate the replicability of the uncovered patterns ($\\binom{N}{2}$ comparisons).\n5. Repeat the above process $M$ times (user-chosen), to find a total of\n $M \\binom{N}{2}$ comparisons.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split the data and fit PARAFAC2 on each data subset\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"splits = 5 # N\nrepeats = 5 # M\n\nmodels = {}\nsplit_indices = {} # Keeps track of which indices are used in each subset\n\nfor rank in [1, 2, 3, 4]:\n\n print(f\"{rank} components\")\n\n rskf = RepeatedKFold(n_splits=splits, n_repeats=repeats, random_state=1)\n\n models[rank] = [[] for _ in range(repeats)]\n split_indices[rank] = [[] for _ in range(repeats)]\n\n for split_no, (train_index, _) in enumerate(rskf.split(dataset)):\n repeat_no = split_no // splits\n\n # Append indices to the current repeat\n split_indices[rank][repeat_no].append(train_index)\n \n train = dataset[train_index]\n train = train / tl.norm(train)\n\n current_model = fit_many_parafac2(train, rank)\n \n # Append model to the current repeat\n models[rank][repeat_no].append(current_model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Often, the mode we will be splitting within refers to different samples,\ndepending on the use-case, it might be reasonable to retain the\ndistributions of some properties in each subset. For this goal,\n[RepeatedStratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html#sklearn.model_selection.RepeatedStratifiedKFold)\ncan be used.\n\nIf pre-processing is used, it is important to apply it to\neach subset in isolation to avoid leaking information from the omitted part of the data.\nFor example, in this case we normalize each subset to unit norm independently.\nNote, that ``for split_no, (train_index, _) in enumerate(rskf.split(dataset)):`` may be run in parallel\nfor efficiency.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute and assess replicability I.\nSince we are subsetting the data on ``mode=0``, and the ``mode=1`` factors of PARAFAC2 are specific \nto the corresponding level in ``mode=0``, only the shared factor matrix can be compared using FMS: \n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"replicability_stability = {}\nfor rank in models.keys():\n replicability_stability[rank] = []\n for repeat, current_models in enumerate(models[rank]):\n for i, model_i in enumerate(current_models):\n for j, model_j in enumerate(current_models):\n if i >= j: # include every pair only once and omit i == j\n continue\n weights_i, (_, C_i) = model_i\n weights_j, (_, C_j) = model_j\n fms = congruence_coefficient(C_i, C_j)[0]\n replicability_stability[rank].append(fms)\n\nranks = sorted(replicability_stability.keys())\ndata = [np.ravel(replicability_stability[r]) for r in ranks]\n\nfig, ax = plt.subplots()\nax.boxplot(data,whis=(0.95,0.05), positions=ranks)\nax.set_xlabel(\"Number of components\")\nax.set_ylabel(\"FMS_C\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we observe that over-estimating the number of components\nresults in a loss of replicable of the patterns, indicated by low FMS.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute and assess replicability II.\nThere is an alternative way to estimate the replicability of the uncovered patterns, \nincluding the factors corresponding to ``mode=0``, and ``mode=1`` :cite:p:`erdos2025extracting`. \nBy using only the indices present in both subsets (e.g. the factors \ncorresponding to the subjects' data included in both factorizations)\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"replicability_alt = {}\nfor rank in models.keys():\n replicability_alt[rank] = []\n for repeat in range(repeats):\n for i, model_i in enumerate(models[rank][repeat]):\n for j, model_j in enumerate(models[rank][repeat]):\n if i >= j: # include every pair only once and omit i == j\n continue\n weights_i, (aB_i, C_i) = model_i\n weights_j, (aB_j, C_j) = model_j\n\n indices_subset_i = sorted(split_indices[rank][repeat][i])\n indices_subset_j = sorted(split_indices[rank][repeat][j])\n \n common_indices = sorted(list(set(indices_subset_i).intersection(set(indices_subset_j))))\n indices2use_i = []\n indices2use_j = []\n\n for common_idx in common_indices:\n indices2use_i.append(indices_subset_i.index(common_idx))\n indices2use_j.append(indices_subset_j.index(common_idx))\n\n aB_i = np.concatenate([aB_i[idx] for idx in indices2use_i])\n aB_j = np.concatenate([aB_j[idx] for idx in indices2use_j])\n fms = tlviz.factor_tools.factor_match_score(\n (weights_i, (C_i, aB_i)), (weights_j, (C_j, aB_j)), consider_weights=False\n )\n replicability_alt[rank].append(fms)\n\nranks = sorted(replicability_alt.keys())\ndata = [np.ravel(replicability_alt[r]) for r in ranks]\n\nfig, ax = plt.subplots()\nax.boxplot(data, positions=ranks)\nax.set_xlabel(\"Number of components\")\nax.set_ylabel(\"FMS_aB\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"``common_indices`` contains the indices (e.g. subjects/samples) present in both subsets,\nbut since the position of each index can change (e.g. sample no 3 is not guaranteeed at\nthe third position in all subsets as the first and second samples might be omitted) we need to\nutilize the indices in the original tensor input.\n\nSimilar results can be observed with this approach in terms of the replicability of the patterns.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading