Skip to content

KarchinLab/Do-Pseudoseqs-Matter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do Pseudosequences Matter in Neoantigen Prediction?

This repository contains the code, notebooks, training scripts, and figure-generation workflows used to evaluate multiple MHC-I allele representations for neoantigen prediction. This project examines how different MHC-I allele representations influence neoantigen prediction. We compare several pseudosequence definitions (BigMHC, NetMHCpan-4.1, and a random baseline), evaluate a range of pseudosequence lengths, test ESM-based embeddings, and introduce a graph-derived embedding built from HLA nomenclature, P-groups, and supertypes.


Repository Structure

nb/

Preprocessing

  • Annotation-Based Embedding.ipynb — generates the annotation-based embeddings
  • Pseudoseqs.ipynb — generates BigMHC EL, NetMHCpan-4.1, and random pseudosequences
  • Gather_Testing_Results.ipynb — collects raw BigMHC outputs into a single file for downstream analysis

Visualization
Notebooks used to generate the plots for the figures in the paper:

  • Panel_1.ipynb
  • Panel_2.ipynb
  • Panel_3.ipynb
  • Plot_Length_Values.ipynb

src/

Training and evaluation code

  • run_bigmhc.sh — script to retrain BigMHC as done in the study.
    Usage: sbatch run_bigmhc.sh pseudoseqs.csv mhclen_value
    Example: sbatch run_bigmhc.sh Original_BigMHC_Pseudosequences.csv 414

  • mhcenc.py — replace the original mhcenc.py in the BigMHC source code when training with float-valued embeddings.

  • esm.py — code to generate the ESM embeddings.

  • compute_auprc.py — computes AUPRC to identify the best training epoch.


Main Experiments

1. Standard Pseudosequences

  • BigMHC (30-res positions)
  • NetMHCpan (34-res positions)
  • Random pseudosequence baseline

2. Embedding-Based Representations

  • Frozen ESM-2 embeddings (mean-pooled)
  • Node2Vec embeddings using:
    • locus
    • two-field allele
    • supertypes
    • P-groups

3. Length Sweep

Systematic evaluation of pseudosequence windows from 5 to 100 residues.


About

Code and materials for the paper "Do Pseudosequences Matter in Neoantigen Prediction?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors