Skip to content

francivita/MedRev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Article Screening ML Pipeline (R + tidymodels + OpenAI embeddings)

This repo is a script-based pipeline to:

  1. prepare screened/unscreened articles,
  2. optionally generate OpenAI embeddings,
  3. train and tune LASSO classifiers (TF‑IDF + embedding-based),
  4. evaluate and pick thresholds (Youden or fixed sensitivity),
  5. predict inclusion probabilities for previously unscreened records.

The scripts are designed to be run in numeric order:

  • scripts/01_prepare.R
  • scripts/02_embeddings.R
  • scripts/03_training.R
  • scripts/04_evaluate.R
  • scripts/05_predict.R

Data

Your raw .rds file should contain (at minimum) these columns:

  • article_id (unique identifier)
  • title
  • abstract
  • a label column (logical TRUE/FALSE) for screened records (e.g. include)

01_prepare.R standardises names via the config section so your raw columns can be renamed.


Setup

R packages

The scripts load packages including:

  • tidymodels, textrecipes, themis, dplyr, readr, tibble, purrr, tidyr, stringr
  • reticulate (to call the Python OpenAI client)

Install in R (example):

install.packages(c(
  "tidymodels","textrecipes","themis","dplyr","readr","tibble",
  "purrr","tidyr","stringr","reticulate","doParallel","tictoc"
))

Python (for embeddings)

02_embeddings.R uses reticulate to run Python code that calls the OpenAI API.

Create/choose a Python env that contains:

pip install openai

Then set your key:

export OPENAI_API_KEY="sk-..."

How to run

1) Prepare processed datasets (01_prepare.R)

Edit the paths at the top:

  • RAW_PATH <- "data/raw/<yourfile>.rds"
  • OUT_FILE <- "<name>"

Then run:

  • Produces:
    • data/processed/<OUT_FILE>_df.rds (screened records with labels)
    • data/processed/<OUT_FILE>_df_to_predict.rds (unscreened records)

2) Generate embeddings (optional) (02_embeddings.R)

This file defines helpers used by training/prediction:

  • run_embed(df, model, batch_size, verbose)
  • make_cost_table(logs, price_table)

Models:

  • text-embedding-3-small
  • text-embedding-3-large

If you generate embeddings, they are saved in:

  • output/<OUT_FILE>/<OUT_FILE>_embed_small.rds
  • output/<OUT_FILE>/<OUT_FILE>_embed_large.rds and API-call logs in logs/<OUT_FILE>/. The script generates a table showing the costs associated with running embedding models. This is saved in results/tables/<OUT_FILE>_emebedding_cost_summary.csv

3) Train and tune models (03_training.R)

In the config section set:

  • OUT_FILE <- "<name>"

Inputs:

  • data/processed/<OUT_FILE>_df.rds
  • embedding files in output/<OUT_FILE>/ (if training embedding models)

This script:

  • splits train/test (default 80/20, stratified)
  • builds CV folds (5x10 repetitions)
  • fits and tunes:
    • NOTE: step_upsample is used to handle unbalanced dataset (e.g., positive class prevalence of 0.2%), if the dataset is balanced this step can be ignored.
    • TF‑IDF model (text recipe)
    • embedding models (small/large) using embedding columns
  • saves workflows, best params, final fits, and CV predictions to output/<OUT_FILE>/

4) Evaluate and choose thresholds (04_evaluate.R)

Also set:

  • OUT_FILE <- "<name>"

This script:

  • loads CV predictions + best params + final fits
  • computes exclusion-rate curves and an optimised threshold using:
    • Youden’s J (youden_exclusion())
    • fixed sensitivity (default NULL) (sens_exclusion())
  • evaluates on the held-out test set using the chosen threshold
  • writes metrics to results/tables/ and saves plots in figures/

5) Predict unscreened records (05_predict.R)

Set:

  • OUT_FILE <- "<name>"
  • BEST_MODEL <- "embed_small" | "embed_large" | "tfidf"
  • THRESHOLD <- "youden" | "sens95"

Inputs:

  • data/processed/<OUT_FILE>_df_to_predict.rds
  • raw file for metadata join: data/raw/<OUT_FILE>_full.rds
  • embedding file for unscreened records (if using embedding models):
    • output/<OUT_FILE>/<OUT_FILE>_to_predict_<BEST_MODEL>.rds

This script:

  • refits the chosen workflow on the full screened dataset
  • loads the chosen threshold
  • produces probabilities + predicted class for unscreened records
  • writes:
    • data/processed/<OUT_FILE>_predictions_on_raw.csv

Diagram of the pipeline

flowchart TB
    A["Raw data<br>article_id • title • abstract • label (if exists)"] --> B["01_prepare.R<br>Clean text + standardise columns"]

    B -- Label exists --> E{"Use embeddings?"}
    B -- No label yet --> M["05_predict.R<br>Last model fit on full labelled dataset<br>Using best parameters<br>Predict on new records"]

    E -- No --> F["03_training.R<br>Train TF-IDF model<br>Cross-validation → best model parameters → final fit for evaluation"]
    E -- Yes --> G["02_embeddings.R<br>Create embeddings"]
    G --> L["03_training.R<br>Train embedding model<br>Cross-validation → best model parameters → final fit for evaluation"]

    F --> J["04_evaluate.R<br>Metrics + plots + thresholds"]
    L --> J

    J --> M
    M --> N["Predictions output<br>predictions.csv (probability + class)"]
Loading

Reproducibility

  • Random seeds are set in training (SEED <- 42).
  • For consistent embedding results, keep the same input pre-processing (clean_text()).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors