This repo is a script-based pipeline to:
- prepare screened/unscreened articles,
- optionally generate OpenAI embeddings,
- train and tune LASSO classifiers (TF‑IDF + embedding-based),
- evaluate and pick thresholds (Youden or fixed sensitivity),
- predict inclusion probabilities for previously unscreened records.
The scripts are designed to be run in numeric order:
scripts/01_prepare.Rscripts/02_embeddings.Rscripts/03_training.Rscripts/04_evaluate.Rscripts/05_predict.R
Your raw .rds file should contain (at minimum) these columns:
article_id(unique identifier)titleabstract- a label column (logical TRUE/FALSE) for screened records (e.g.
include)
01_prepare.R standardises names via the config section so your raw columns can be renamed.
The scripts load packages including:
tidymodels,textrecipes,themis,dplyr,readr,tibble,purrr,tidyr,stringrreticulate(to call the Python OpenAI client)
Install in R (example):
install.packages(c(
"tidymodels","textrecipes","themis","dplyr","readr","tibble",
"purrr","tidyr","stringr","reticulate","doParallel","tictoc"
))02_embeddings.R uses reticulate to run Python code that calls the OpenAI API.
Create/choose a Python env that contains:
pip install openaiThen set your key:
export OPENAI_API_KEY="sk-..."Edit the paths at the top:
RAW_PATH <- "data/raw/<yourfile>.rds"OUT_FILE <- "<name>"
Then run:
- Produces:
data/processed/<OUT_FILE>_df.rds(screened records with labels)data/processed/<OUT_FILE>_df_to_predict.rds(unscreened records)
This file defines helpers used by training/prediction:
run_embed(df, model, batch_size, verbose)make_cost_table(logs, price_table)
Models:
text-embedding-3-smalltext-embedding-3-large
If you generate embeddings, they are saved in:
output/<OUT_FILE>/<OUT_FILE>_embed_small.rdsoutput/<OUT_FILE>/<OUT_FILE>_embed_large.rdsand API-call logs inlogs/<OUT_FILE>/. The script generates a table showing the costs associated with running embedding models. This is saved inresults/tables/<OUT_FILE>_emebedding_cost_summary.csv
In the config section set:
OUT_FILE <- "<name>"
Inputs:
data/processed/<OUT_FILE>_df.rds- embedding files in
output/<OUT_FILE>/(if training embedding models)
This script:
- splits train/test (default 80/20, stratified)
- builds CV folds (5x10 repetitions)
- fits and tunes:
- NOTE:
step_upsampleis used to handle unbalanced dataset (e.g., positive class prevalence of 0.2%), if the dataset is balanced this step can be ignored. - TF‑IDF model (text recipe)
- embedding models (small/large) using embedding columns
- NOTE:
- saves workflows, best params, final fits, and CV predictions to
output/<OUT_FILE>/
Also set:
OUT_FILE <- "<name>"
This script:
- loads CV predictions + best params + final fits
- computes exclusion-rate curves and an optimised threshold using:
- Youden’s J (
youden_exclusion()) - fixed sensitivity (default NULL) (
sens_exclusion())
- Youden’s J (
- evaluates on the held-out test set using the chosen threshold
- writes metrics to
results/tables/and saves plots infigures/
Set:
OUT_FILE <- "<name>"BEST_MODEL <- "embed_small" | "embed_large" | "tfidf"THRESHOLD <- "youden" | "sens95"
Inputs:
data/processed/<OUT_FILE>_df_to_predict.rds- raw file for metadata join:
data/raw/<OUT_FILE>_full.rds - embedding file for unscreened records (if using embedding models):
output/<OUT_FILE>/<OUT_FILE>_to_predict_<BEST_MODEL>.rds
This script:
- refits the chosen workflow on the full screened dataset
- loads the chosen threshold
- produces probabilities + predicted class for unscreened records
- writes:
data/processed/<OUT_FILE>_predictions_on_raw.csv
flowchart TB
A["Raw data<br>article_id • title • abstract • label (if exists)"] --> B["01_prepare.R<br>Clean text + standardise columns"]
B -- Label exists --> E{"Use embeddings?"}
B -- No label yet --> M["05_predict.R<br>Last model fit on full labelled dataset<br>Using best parameters<br>Predict on new records"]
E -- No --> F["03_training.R<br>Train TF-IDF model<br>Cross-validation → best model parameters → final fit for evaluation"]
E -- Yes --> G["02_embeddings.R<br>Create embeddings"]
G --> L["03_training.R<br>Train embedding model<br>Cross-validation → best model parameters → final fit for evaluation"]
F --> J["04_evaluate.R<br>Metrics + plots + thresholds"]
L --> J
J --> M
M --> N["Predictions output<br>predictions.csv (probability + class)"]
- Random seeds are set in training (
SEED <- 42). - For consistent embedding results, keep the same input pre-processing (
clean_text()).