-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Hello,
First off, thank you for open-sourcing SegmentNT, and for the great work in “Annotating the genome at single-nucleotide resolution with DNA foundation models”.
I observed a strange boundary effect when applying SegmentNT to a long chromosome sequence.
I used the inference code from your Colab notebook (https://colab.research.google.com/#fileId=https%3A//huggingface.co/InstaDeepAI/segment_nt/blob/main/inference_segment_nt.ipynb) to run the model on chromosome 20 (NC_060944.1) from the T2T-CHM13v2.0 assembly (GCF_009914755.1). To process the full chromosome, I used a non-overlapping sliding-window approach with windows up to 49,992 bp.
Across all windows, the final nucleotide at each boundary shows a consistent and unexpectedly strong drop in prediction confidence. This strange drop also appears when I reduce the window size (for example, to ~30 kb). This behavior does not appear in the predictions shown in the SegmentNT paper. It happens only when I run the inference manually using the code in the notebook.
In IGV, the concatenated per-nucleotide probability tracks show a clear dip exactly at every window boundary. One example is region NC_060944.1: 9,248,521–9,348,504, which spans two 49,992 bp windows. Around position 9,298,512, the final base of the first window, the probabilities for most classes decrease a lot. I include screenshots showing this unusual pattern below.
Is this boundary effect expected, and is there a recommended way to mitigate it?