Article Screening ML Pipeline (R + tidymodels + OpenAI embeddings)

This repo is a script-based pipeline to:

prepare screened/unscreened articles,
optionally generate OpenAI embeddings,
train and tune LASSO classifiers (TF‑IDF + embedding-based),
evaluate and pick thresholds (Youden or fixed sensitivity),
predict inclusion probabilities for previously unscreened records.

The scripts are designed to be run in numeric order:

scripts/01_prepare.R
scripts/02_embeddings.R
scripts/03_training.R
scripts/04_evaluate.R
scripts/05_predict.R

Data

Your raw .rds file should contain (at minimum) these columns:

article_id (unique identifier)
title
abstract
a label column (logical TRUE/FALSE) for screened records (e.g. include)

01_prepare.R standardises names via the config section so your raw columns can be renamed.

Setup

R packages

The scripts load packages including:

tidymodels, textrecipes, themis, dplyr, readr, tibble, purrr, tidyr, stringr
reticulate (to call the Python OpenAI client)

Install in R (example):

install.packages(c(
  "tidymodels","textrecipes","themis","dplyr","readr","tibble",
  "purrr","tidyr","stringr","reticulate","doParallel","tictoc"
))

Python (for embeddings)

02_embeddings.R uses reticulate to run Python code that calls the OpenAI API.

Create/choose a Python env that contains:

pip install openai

Then set your key:

export OPENAI_API_KEY="sk-..."

How to run

1) Prepare processed datasets (`01_prepare.R`)

Edit the paths at the top:

RAW_PATH <- "data/raw/<yourfile>.rds"
OUT_FILE <- "<name>"

Then run:

Produces:
- data/processed/<OUT_FILE>_df.rds (screened records with labels)
- data/processed/<OUT_FILE>_df_to_predict.rds (unscreened records)

2) Generate embeddings (optional) (`02_embeddings.R`)

This file defines helpers used by training/prediction:

run_embed(df, model, batch_size, verbose)
make_cost_table(logs, price_table)

Models:

text-embedding-3-small
text-embedding-3-large

If you generate embeddings, they are saved in:

output/<OUT_FILE>/<OUT_FILE>_embed_small.rds
output/<OUT_FILE>/<OUT_FILE>_embed_large.rds and API-call logs in logs/<OUT_FILE>/. The script generates a table showing the costs associated with running embedding models. This is saved in results/tables/<OUT_FILE>_emebedding_cost_summary.csv

3) Train and tune models (`03_training.R`)

In the config section set:

OUT_FILE <- "<name>"

Inputs:

data/processed/<OUT_FILE>_df.rds
embedding files in output/<OUT_FILE>/ (if training embedding models)

This script:

splits train/test (default 80/20, stratified)
builds CV folds (5x10 repetitions)
fits and tunes:
- NOTE: step_upsample is used to handle unbalanced dataset (e.g., positive class prevalence of 0.2%), if the dataset is balanced this step can be ignored.
- TF‑IDF model (text recipe)
- embedding models (small/large) using embedding columns
saves workflows, best params, final fits, and CV predictions to output/<OUT_FILE>/

4) Evaluate and choose thresholds (`04_evaluate.R`)

Also set:

OUT_FILE <- "<name>"

This script:

loads CV predictions + best params + final fits
computes exclusion-rate curves and an optimised threshold using:
- Youden’s J (youden_exclusion())
- fixed sensitivity (default NULL) (sens_exclusion())
evaluates on the held-out test set using the chosen threshold
writes metrics to results/tables/ and saves plots in figures/

5) Predict unscreened records (`05_predict.R`)

Set:

OUT_FILE <- "<name>"
BEST_MODEL <- "embed_small" | "embed_large" | "tfidf"
THRESHOLD <- "youden" | "sens95"

Inputs:

data/processed/<OUT_FILE>_df_to_predict.rds
raw file for metadata join: data/raw/<OUT_FILE>_full.rds
embedding file for unscreened records (if using embedding models):
- output/<OUT_FILE>/<OUT_FILE>_to_predict_<BEST_MODEL>.rds

This script:

refits the chosen workflow on the full screened dataset
loads the chosen threshold
produces probabilities + predicted class for unscreened records
writes:
- data/processed/<OUT_FILE>_predictions_on_raw.csv

Diagram of the pipeline

flowchart TB
    A["Raw data<br>article_id • title • abstract • label (if exists)"] --> B["01_prepare.R<br>Clean text + standardise columns"]

    B -- Label exists --> E{"Use embeddings?"}
    B -- No label yet --> M["05_predict.R<br>Last model fit on full labelled dataset<br>Using best parameters<br>Predict on new records"]

    E -- No --> F["03_training.R<br>Train TF-IDF model<br>Cross-validation → best model parameters → final fit for evaluation"]
    E -- Yes --> G["02_embeddings.R<br>Create embeddings"]
    G --> L["03_training.R<br>Train embedding model<br>Cross-validation → best model parameters → final fit for evaluation"]

    F --> J["04_evaluate.R<br>Metrics + plots + thresholds"]
    L --> J

    J --> M
    M --> N["Predictions output<br>predictions.csv (probability + class)"]

Reproducibility

Random seeds are set in training (SEED <- 42).
For consistent embedding results, keep the same input pre-processing (clean_text()).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
R		R
data		data
logs		logs
output		output
results		results
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MedRev.Rproj		MedRev.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Screening ML Pipeline (R + tidymodels + OpenAI embeddings)

Data

Setup

R packages

Python (for embeddings)

How to run

1) Prepare processed datasets (`01_prepare.R`)

2) Generate embeddings (optional) (`02_embeddings.R`)

3) Train and tune models (`03_training.R`)

4) Evaluate and choose thresholds (`04_evaluate.R`)

5) Predict unscreened records (`05_predict.R`)

Diagram of the pipeline

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Article Screening ML Pipeline (R + tidymodels + OpenAI embeddings)

Data

Setup

R packages

Python (for embeddings)

How to run

1) Prepare processed datasets (01_prepare.R)

2) Generate embeddings (optional) (02_embeddings.R)

3) Train and tune models (03_training.R)

4) Evaluate and choose thresholds (04_evaluate.R)

5) Predict unscreened records (05_predict.R)

Diagram of the pipeline

Reproducibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1) Prepare processed datasets (`01_prepare.R`)

2) Generate embeddings (optional) (`02_embeddings.R`)

3) Train and tune models (`03_training.R`)

4) Evaluate and choose thresholds (`04_evaluate.R`)

5) Predict unscreened records (`05_predict.R`)

Packages