This repository contains the code, notebooks, training scripts, and figure-generation workflows used to evaluate multiple MHC-I allele representations for neoantigen prediction. This project examines how different MHC-I allele representations influence neoantigen prediction. We compare several pseudosequence definitions (BigMHC, NetMHCpan-4.1, and a random baseline), evaluate a range of pseudosequence lengths, test ESM-based embeddings, and introduce a graph-derived embedding built from HLA nomenclature, P-groups, and supertypes.
Preprocessing
- Annotation-Based Embedding.ipynb — generates the annotation-based embeddings
- Pseudoseqs.ipynb — generates BigMHC EL, NetMHCpan-4.1, and random pseudosequences
- Gather_Testing_Results.ipynb — collects raw BigMHC outputs into a single file for downstream analysis
Visualization
Notebooks used to generate the plots for the figures in the paper:
- Panel_1.ipynb
- Panel_2.ipynb
- Panel_3.ipynb
- Plot_Length_Values.ipynb
Training and evaluation code
-
run_bigmhc.sh — script to retrain BigMHC as done in the study.
Usage:sbatch run_bigmhc.sh pseudoseqs.csv mhclen_value
Example:sbatch run_bigmhc.sh Original_BigMHC_Pseudosequences.csv 414 -
mhcenc.py — replace the original
mhcenc.pyin the BigMHC source code when training with float-valued embeddings. -
esm.py — code to generate the ESM embeddings.
-
compute_auprc.py — computes AUPRC to identify the best training epoch.
- BigMHC (30-res positions)
- NetMHCpan (34-res positions)
- Random pseudosequence baseline
- Frozen ESM-2 embeddings (mean-pooled)
- Node2Vec embeddings using:
- locus
- two-field allele
- supertypes
- P-groups
Systematic evaluation of pseudosequence windows from 5 to 100 residues.