A Configurable Pipeline for Mass Univariate Aggregation (MUA) Methods

1. A Configurable Pipeline for Mass Univariate Aggregation Methods

We present a unified, flexible, and accessible configurable pipeline that can be used for implementing CPM (Shen et al., 2017), PNRS (Byington et al., 2023), and many new MUA configurations facilitated by user-specified parameters. The pipeline's core is built around two classes—FeatureVectorizer and MUA—both of which are engineered to be fully compatible with the scikit-learn ecosystem.

2. Installation

pip install git+https://github.com/neuroprismlab/_MUA_Pipeline.git

3. Project Structure

_MUA_Pipeline/
├── mua_pipeline/                    # Main package directory
│   ├── __init__.py                  # Package initialization
│   ├── core.py                      # Core classes: FeatureVectorizer and MUA
│   ├── preprocessing.py             # Data format auto-detection and missing data removal
│   └── visualization.py             # Plotting and visualization utilities
├── examples/
│   ├── CPM_PNRS_tutorial.ipynb      # Notebook walkthrough tutorial of CPM and PNRS with MUA
│   └── example_usage.py             # Example usage of MUA with a python script
├── validation/                      # Validation scripts and reference comparisons
│   ├── CPM_Validation.m             # MATLAB CPM validation
│   └── PNRS_Validation.m            # MATLAB PNRS validation
├── setup.py                         # Package installation configuration
├── requirements.txt                 # Python dependencies
└── README.md                        # This file

4. The Implementation Details

4.1 FeatureVectorizer Class

The FeatureVectorizer class helps to handle both 2D and 3D input data. In this manner, our configurable pipeline can be used to apply MUA methods on neuroimaging data (connectivity matrices) and any other feature-outcome data.

4.2 MUA Class

The MUA class is our main Python class that extends scikit-learn's BaseEstimator and TransformerMixin classes to ensure compatibility with standard machine learning pipelines; the MUA class operates through three steps:

Feature Selection and Organization: Determines which features are included and how they are organized (e.g., split into positive/negative networks or combined)
Feature Weighting: Specifies how selected features are weighted (e.g., binary, correlation-based, or regression-derived weights)
Feature Aggregation: Defines how weighted features are combined into predictive scores (e.g., sum or mean aggregation)

4.2.1 MUA Class's Parameters

The MUA class employs eight configurable parameters organized across the three computational steps described above that enable flexible implementation of established methods while facilitating exploration of novel approaches.

Parameter	Value	Description
filter_by_sign	`True`	Separate positive/negative features (CPM-style)
	`False`	Keep all features together (single weighted score)
direction	`'difference'`	Single score = mean(pos_edges) - mean(neg_edges) (original MATLAB CPM)
	`'positive'`	Positive network score only
	`'negative'`	Negative network score only
	Ignored	When `filter_by_sign=False`
selection_method	`'all'`	Use all features
	`'pvalue'`	Select features with p < α
	`'top_k'`	Select the top k features by absolute correlation
selection_threshold	`float (0, 1)`	'pvalue' method: p-value threshold
	`integer`	'top_k' method: number of features
	Ignored	For 'all' method
weighting_method	`'binary'`	±1 based on correlation sign
	`'correlation'`	The strength of the correlation
	`'squared_correlation'`	r² preserving sign
	`'regression'`	Feature-specific regression coefficients
correlation_type	`'pearson'`	Linear correlation
	`'spearman'`	Rank-based correlation
feature_aggregation	`'sum'`	The sum of the weighted features
	`'mean'`	The mean of the weighted features
standardize_scores	`True`	Z-score normalize final scores
	`False`	Keep raw scores

4.3 Prediction Strategy

Our configurable pipeline facilitates the use of the various regression options available in scikit-learn. The following section demonstrates the practical application of this approach. While LinearRegression() is used here as an example, the pipeline supports the integration of any alternative scikit-learn regression method.

The_pipeline = Pipeline([
    ('vectorize', FeatureVectorizer()),
    ('mua', MUA(
        ...,
        ...,
    )),
    ('regressor', LinearRegression())  # Linear regression
])

5. Tutorial: Getting Started

See examples directory for specific examples of using the MUA pipeline.

5.1 Importing the Pipeline Components

from mua_pipeline import FeatureVectorizer, MUA, preprocess, plot_results

5.2 Preprocessing Your Data

Before building a pipeline, you can use the Preprocessing module to handle data formatting and missing data removal. The preprocess function automatically detects the orientation of your data and converts it to the standard format expected by the pipeline:

3D connectivity matrices are converted to (n_subjects, n_regions, n_regions) regardless of whether the input is (n_regions, n_regions, n_subjects), (n_regions, n_subjects, n_regions), or already in the standard format.
2D feature matrices are converted to (n_subjects, n_features) even if provided as (n_features, n_subjects).
Behavioral data is similarly standardized.

# Raw data — may have missing values and non-standard orientation
connectivity_matrices = ...   # e.g., shape (n_regions, n_regions, n_subjects)
behavioral_scores = ...       # e.g., shape (n_subjects,)

# Clean the data: auto-detects format, removes missing subjects, returns standard format
clean_connectivity, clean_behavioral, removed_indices = preprocess(
    connectivity_matrices,
    behavioral_scores,
    missing_strategy='any',   # Remove subjects with zeros, NaNs, or Infs
    verbose=True
)

The missing_strategy parameter controls what counts as missing data:

Strategy	Removes subjects with
`'zero'`	Behavioral values equal to 0
`'nan'`	NaN values
`'inf'`	Inf or -Inf values
`'any'`	Any of the above (default)

5.3 Implementing CPM (Shen et al., 2017)

CPM uses binary weights, p-value-based feature selection, and splits features into positive and negative networks. A final linear regression maps the network score to the behavioral outcome.

import numpy as np
from sklearn.model_selection import cross_val_predict, cross_val_score, KFold

# Build the CPM pipeline
cpm_pipeline = Pipeline([
    ('vectorize', FeatureVectorizer()),
    ('mua', MUA(
        filter_by_sign=True,           # Separate positive/negative networks
        direction='difference',        # mean(pos) - mean(neg), matches Original CPM
        selection_method='pvalue',     # p-value thresholding
        selection_threshold=0.05,      # p < 0.05
        weighting_method='binary',     # Binary weights (+1/−1)
        correlation_type='pearson',    # Pearson correlation
        feature_aggregation='mean',    # Mean of selected features (scale-invariant)
    )),
    ('regressor', LinearRegression())  # Final linear regression
])

# Cross-validation
cpm_scores = cross_val_score(cpm_pipeline, brain_data, behavior, cv=10)
cpm_predictions = cross_val_predict(cpm_pipeline, brain_data, behavior, cv=10)
print(f"CPM R² (10-fold CV): {cpm_scores.mean():.3f} ± {cpm_scores.std():.3f}")

# Evaluation
cpm_r, cpm_p = pearsonr(behavior, cpm_predictions)
mae = mean_absolute_error(behavior, cpm_predictions)
rmse = np.sqrt(mean_squared_error(behavior, cpm_predictions))
r2 = r2_score(behavior, cpm_predictions)

With filter_by_sign=True and direction='difference', the MUA transformer outputs a single column representing mean(pos_edges) - mean(neg_edges), matching the original MATLAB CPM implementation. The LinearRegression() then fits on this score to predict the behavioral outcome.

5.4 Implementing PNRS (Byington et al., 2023)

PNRS uses regression-derived weights, includes all features, and produces a single combined score.

# Build the PNRS pipeline
pnrs_pipeline = Pipeline([
    ('vectorize', FeatureVectorizer()),
    ('mua', MUA(
        filter_by_sign=False,              # Single combined score
        selection_method='all',            # Use all features
        weighting_method='regression',     # Regression-derived weights
        feature_aggregation='sum',         # Sum of weighted features
    ))
    # No regressor 
])

pnrs_scores = pnrs_pipeline.fit_transform(brain_data, behavior)

# Use scores directly as predictions
pnrs_predictions = pnrs_scores.flatten()

# Evaluation
pnrs_r, pnrs_p = pearsonr(behavior, pnrs_predictions)

5.5 Visualizing the Results

The Visualization module provides the plot_results function for visualizing predicted vs. observed behavioral scores:

plot_results(cpm_predictions, behavioral_scores, title="CPM")
plot_results(pnrs_predictions, behavioral_scores, title="PNRS")

6. Reference and Documentation

For comprehensive details, theoretical background, and extensive explanations regarding the use and validation of this pipeline, please refer to our preprint:

Mass Univariate Aggregation Methods for Machine Learning in Neuroscience DOI: [https://doi.org/10.5281/zenodo.18436701]

References

Shen X, Finn ES, Scheinost D, et al. Using connectome-based predictive modeling to predict individual behavior from brain connectivity. Nat Protoc. 2017;12(3):506-518. doi:10.1038/nprot.2016.178
Finn ES, Shen X, Scheinost D, et al. Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity. Nat Neurosci. 2015;18(11):1664-1671. doi:10.1038/nn.4135
Byington N, Grimsrud G, Mooney MA, et al. Polyneuro risk scores capture widely distributed connectivity patterns of cognition. Dev Cogn Neurosci. 2023;60:101231. doi:10.1016/j.dcn.2023.101231
YaleMRRC. Connectome-based Predictive Modeling (CPM) [Code repository]. https://github.com/YaleMRRC/CPM
DCAN-Labs. BWAS: Polyneuro Risk Score [Code repository]. https://github.com/DCAN-Labs/BWAS

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
Validation		Validation
examples		examples
mua_pipeline		mua_pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Configurable Pipeline for Mass Univariate Aggregation (MUA) Methods

1. A Configurable Pipeline for Mass Univariate Aggregation Methods

2. Installation

3. Project Structure

4. The Implementation Details

4.1 FeatureVectorizer Class

4.2 MUA Class

4.2.1 MUA Class's Parameters

4.3 Prediction Strategy

5. Tutorial: Getting Started

5.1 Importing the Pipeline Components

5.2 Preprocessing Your Data

5.3 Implementing CPM (Shen et al., 2017)

5.4 Implementing PNRS (Byington et al., 2023)

5.5 Visualizing the Results

6. Reference and Documentation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Configurable Pipeline for Mass Univariate Aggregation (MUA) Methods

1. A Configurable Pipeline for Mass Univariate Aggregation Methods

2. Installation

3. Project Structure

4. The Implementation Details

4.1 FeatureVectorizer Class

4.2 MUA Class

4.2.1 MUA Class's Parameters

4.3 Prediction Strategy

5. Tutorial: Getting Started

5.1 Importing the Pipeline Components

5.2 Preprocessing Your Data

5.3 Implementing CPM (Shen et al., 2017)

5.4 Implementing PNRS (Byington et al., 2023)

5.5 Visualizing the Results

6. Reference and Documentation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages