Do Pseudosequences Matter in Neoantigen Prediction?

This repository contains the code, notebooks, training scripts, and figure-generation workflows used to evaluate multiple MHC-I allele representations for neoantigen prediction. This project examines how different MHC-I allele representations influence neoantigen prediction. We compare several pseudosequence definitions (BigMHC, NetMHCpan-4.1, and a random baseline), evaluate a range of pseudosequence lengths, test ESM-based embeddings, and introduce a graph-derived embedding built from HLA nomenclature, P-groups, and supertypes.

Repository Structure

`nb/`

Preprocessing

Annotation-Based Embedding.ipynb — generates the annotation-based embeddings
Pseudoseqs.ipynb — generates BigMHC EL, NetMHCpan-4.1, and random pseudosequences
Gather_Testing_Results.ipynb — collects raw BigMHC outputs into a single file for downstream analysis

Visualization
Notebooks used to generate the plots for the figures in the paper:

Panel_1.ipynb
Panel_2.ipynb
Panel_3.ipynb
Plot_Length_Values.ipynb

`src/`

Training and evaluation code

run_bigmhc.sh — script to retrain BigMHC as done in the study.
Usage: sbatch run_bigmhc.sh pseudoseqs.csv mhclen_value
Example: sbatch run_bigmhc.sh Original_BigMHC_Pseudosequences.csv 414
mhcenc.py — replace the original mhcenc.py in the BigMHC source code when training with float-valued embeddings.
esm.py — code to generate the ESM embeddings.
compute_auprc.py — computes AUPRC to identify the best training epoch.

Main Experiments

1. Standard Pseudosequences

BigMHC (30-res positions)
NetMHCpan (34-res positions)
Random pseudosequence baseline

2. Embedding-Based Representations

Frozen ESM-2 embeddings (mean-pooled)
Node2Vec embeddings using:
- locus
- two-field allele
- supertypes
- P-groups

3. Length Sweep

Systematic evaluation of pseudosequence windows from 5 to 100 residues.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
figs		figs
nb		nb
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Pseudosequences Matter in Neoantigen Prediction?

Repository Structure

`nb/`

`src/`

Main Experiments

1. Standard Pseudosequences

2. Embedding-Based Representations

3. Length Sweep

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Do Pseudosequences Matter in Neoantigen Prediction?

Repository Structure

nb/

src/

Main Experiments

1. Standard Pseudosequences

2. Embedding-Based Representations

3. Length Sweep

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`nb/`

`src/`

Packages