pipeML/README.Rmd at main · VeraPancaldiLab/pipeML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: "pipeML"
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction

<!-- badges: start -->

<!-- badges: end -->

## Installation

You can install the development version of `pipeML` from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")
```

## Description

`pipeML` is a flexible and leakage-aware machine learning framework for R designed for predictive modeling in high-dimensional biological data. The package integrates all key steps of the machine learning workflow — feature selection, model training, validation, prediction, and interpretation — into a single reproducible pipeline.

A key design goal of `pipeML` is to support fold-aware feature construction, allowing features that depend on the dataset (e.g. enrichment scores, correlation-based features, or network-derived features) to be recomputed within each cross-validation fold. This prevents information leakage and ensures reliable performance estimation.

The framework is designed to integrate naturally with R/Bioconductor workflows, making it particularly suitable for omics and biomedical machine learning applications.

<p align="center">

<img src="man/figures/pipeML.svg?raw=true"/>

</p>

<p align="center">

<i> Figure 1. General structure of the `pipeML` machine learning pipeline. </i>

</p>

## Key Features

### End-to-end ML workflow

* Integrated pipeline for feature selection, model training, validation, prediction, and interpretation

### Leakage-aware validation

* Custom cross-validation fold construction
* Support for fold-aware feature recomputation
* Prevents information leakage when using dataset-dependent features

### Flexible model evaluation

* Repeated and stratified k-fold cross-validation
* Leave-one-dataset-out (LODO) evaluation for cross-cohort generalization

### Feature selection

* Boruta-based feature selection
* Optional correlation-based feature filtering

### Hyperparameter tuning

* Automatic optimization based on:

  * AUROC
  * AUPRC
  * Accuracy

### Model interpretation

* SHAP-based feature importance
* Variable importance summaries
* Performance visualization (ROC and PR curves)

### Ensemble learning

* Model stacking

### Parallel computing

* Multi-core support for faster model training and cross-validation

### Custom workflows

* Users can define custom fold construction functions
* These functions can receive a bestTune argument after hyperparameter optimization to retrain models on the full training dataset.

## Supported Machine Learning Methods

### Classification algorithms:

For classification tasks, we implemented a diverse set of classification algorithms that are benchmarked on the fly making extensive use of the R package `caret`.

- Bagged classification trees
- Random forests
- C5.0 decision trees
- Regularized logistic regression (elastic net)
- k-nearest neighbors (KNN)
- Classification and regression trees (CART)
- Lasso regression
- Ridge regression
- Support vector machines with linear and radial kernels
- Extreme Gradient Boosting (XGBoost)

### Survival algorithms:

For time-to-event outcomes, `pipeML` implements a unified survival modeling framework based on the `parsnip` and `workflows` ecosystems, enabling consistent training, hyperparameter tuning, and evaluation across multiple survival model families.

- Cox proportional hazards model
- Elastic net–regularized Cox regression
- Parametric accelerated failure time (AFT) models
- Conditional inference survival trees
- Bagged CART survival models
- Random survival forests
- Gradient boosting for censored outcomes

## General usage

Below are basic examples showing how to use `pipeML`

For a detailed tutorial, see [Get started](https://VeraPancaldiLab.github.io/pipeML/articles/pipeML.html)

``` r
library(pipeML)
```

### Training models

``` r
res <- compute_features.training.ML(features_train = X_train,
                                    target_var = y_train,
                                    task_type = "classification",
                                    trait.positive = "1",
                                    metric = "AUROC",
                                    k_folds = 5,
                                    n_rep = 10,
                                    return = F)
```

### Predicting on new data

``` r
pred = compute_prediction(model = res$Model,
                          test_data = X_test,
                          target_var = y_test,
                          task_type = "classification",
                          trait.positive = "1",
                          return = F)
```

### Training and Testing Workflow

``` r
res <- compute_features.ML(features_train = X_train,
                           features_test = X_test,
                           coldata = data,
                           task_type = "classification",
                           trait = "target",
                           trait.positive = "1",
                           metric = "AUROC",
                           k_folds = 5,
                           n_rep = 10,
                           ncores = 2)
```

## Issues

If you encounter any problems or have questions about the package, we encourage you to open an issue [here](https://github.com/VeraPancaldiLab/pipeML/issues). We’ll do our best to assist you!

## Authors

`pipeML` was developed by [Marcelo Hurtado](https://github.com/mhurtado13) in supervision of [Vera Pancaldi](https://github.com/VeraPancaldi) and is part of the [Pancaldi](https://github.com/VeraPancaldiLab) team. Currently, Marcelo is the primary maintainer of this package.

## Citing pipeML

If you use `pipeML` in a scientific publication, we would appreciate citation to the :

Hurtado M, Pancaldi V (2026). pipeML: A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction. R package version 0.0.1, https://verapancaldilab.github.io/pipeML