R workflow for preprocessing, analyzing, and annotating Illumina HumanMethylationEPIC hydroxymethylation data.
Developed and tested on Linux. Other platforms (e.g., macOS, Windows) might also work.
- A Linux-based computer (tested on Ubuntu)
R >= 4.5Bioconductor >= 3.2ChAMP >= 2.36
This workflow is built to:
- Import and preprocess bisulfite (BS) and oxidative bisulfite (oxBS) array data.
- Normalize data using NOOB/FunNorm/RAW and filter out problematic probes.
- Estimate sex, cell type proportions and predict smoking status and age.
- Run the MLML method to quantify hydroxymethylation (5hmC) levels.
preprocess_hydroxymethylation_data(
ox_file, bs_file,
annotation_array = "IlluminaHumanMethylationEPICv2",
annotation_version = "20a1.hg38",
normalization = "NOOB",
ChAMPfilter_arraytype_bs = "EPICv2",
ChAMPfilter_ProbeCutoff_bs = 0.01,
ChAMPfilter_arraytype_ox = "EPICv2",
ChAMPfilter_ProbeCutoff_ox = 0.01,
file_inaccuracies = NULL,
low_variance_threshold_hmc = 0,
predictSex = FALSE,
predictSmoking = FALSE,
predictAge = FALSE,
calculateCellPropPCs = FALSE,
plotCellProps = FALSE,
plotPCA = FALSE,
plotSVD = FALSE,
plotHmC = FALSE,
output_dir = getwd()
)| Argument | Type / Accepted values | Default | Description |
|---|---|---|---|
ox_file |
character (path) |
required | csv of metadata for OxBS arrays. |
bs_file |
character (path) |
required | csv of metadata for BS arrays. |
annotation_array |
Valid minfi array string |
"IlluminaHumanMethylationEPICv2" |
Probe annotation. |
annotation_version |
character |
"20a1.hg38" |
Annotation version. |
normalization |
"NOOB", "FUNORM", "RAW" |
"NOOB" |
Choose normalisation. |
ChAMPfilter_arraytype_bs |
"450K", "EPIC", "EPICv2" |
"EPICv2" |
Array type for ChAMP filter (BS). |
ChAMPfilter_ProbeCutoff_bs |
numeric 0-1 |
0.01 |
ProbeCutoff (BS). |
ChAMPfilter_arraytype_ox |
As above | "EPICv2" |
Array type for ChAMP filter (OxBS). |
ChAMPfilter_ProbeCutoff_ox |
numeric 0–1 |
0.01 |
ProbeCutoff (OxBS). |
file_inaccuracies |
NULL or path |
NULL |
Inaccurancies probes list (column IlmnID). |
low_variance_threshold_hmc |
numeric ≥ 0 |
0 |
Low variance threshold 5hmc. |
predictSex |
logical |
FALSE |
Add sex prediction via minfi::getSex(). |
predictSmoking |
logical |
FALSE |
Add smoking score via EpiSmokEr. |
predictAge |
logical |
FALSE |
Add DNAm age (Horvath) via wateRmelon. |
calculateCellPropPCs |
logical |
FALSE |
Estimate blood-cell composition-PCs. |
plotCellProps |
logical |
FALSE |
Save stacked-bar chart cell-composition plot. |
plotPCA |
logical |
FALSE |
Save PCA. |
plotSVD |
logical |
FALSE |
Save ChAMP SVD plots. |
plotHmC |
logical |
FALSE |
Save 5hmC density plot. |
output_dir |
character (path) |
getwd() |
Destination folder for all outputs. |
Each csv must contain one row per array and these five columns (case-sensitive):
| Column | Description | Example |
|---|---|---|
Sample_Name |
Unique experiment ID (overwritten internally). | S01 |
Array |
Illumina barcode (last 10 digits of iDAT filenames). | 1234567890 |
Slide |
Illumina slide ID (first part of iDAT filenames). | 204905210066 |
iDAT_PATH |
Directory containing Red + Grn iDATs for that slide. | /data/iDATs/ |
status |
Custom label (case, control, etc.). |
case |
Expected folder layout
iDAT_PATH/ └── SLIDE/ ├── SLIDE_ARRAY_Red.iDAT └── SLIDE_ARRAY_Grn.iDAT
Csv with a column IlmnID listing probes to exclude.
Everything is written to output_dir (default: working directory):
output_dir/
├── phenotype_table.csv # per-sample metadata (+ optional sex, PCs, etc.)
├── filtered_hmC.csv # long-format 5hmC after variance filtering
├── cell_props.png # optional barplot of blood-cell composition
├── explained_variance.png # Explained variance
├── Hydroxymethylation Density by Sample.png # optional 5hmC densities
└── SVDsummary.pdf # created when plotSVD = TRUE
The function returns an invisible list:
phenotype_df_bs– BS sample metadatafiltered_hmC– long-format 5hmC
- Read and validate metadata
- Read OxBS iDAT
- Read BS iDAT
- (Optional) predict sex
- Normalise (NOOB / FunNorm / Raw)
- ChAMP filter - BS
- ChAMP filter - OxBS
- Remove inaccuracies probes
- Build phenotype dataframe for BS
- (Optional) estimate cell proportions
- (Optional) predict smoking score
- (Optional) predict DNAm age (Horvath)
- Estimate 5hmC via MLML2R
- Low-variance filter & write outputs
Each step frees memory with rm(); gc().
library(HydroxymethylateR)
results <- preprocess_hydroxymethylation_data(
ox_file = "metadata_oxbs.csv",
bs_file = "metadata_bs.csv",
output_dir = "results"
)
# Access outputs
head(results$phenotype_df_bs)
head(results$filtered_hmC)The following R packages (from CRAN and Bioconductor) are required:
viridis,ggplot2,reshape2
FlowSorted.Blood.EPIC,sesame,wateRmelon,MLML2R,EpiSmokEr,minfi,ChAMP
To install this workflow:
devtools::intall_github("eirinisparaki/HydroxymethylateR")Here's the Markdown text you can add to your README.md to cover both the citation of your tool and a reference to the citation list for dependencies:
If you use HydroxymethylateR in your research, please cite:
- This GitHub repository: eirinisparaki/HydroxymethylateR
For citations of the R packages used in this project, please refer to CITATIONS.md.
For questions or collaborations, feel free to contact:
Eirini Sparaki
📧 sparakiirini@gmail.com
🔗 https://github.com/eirinisparaki
