Low confidence at prediction boundaries

Hello,

First off, thank you for open-sourcing SegmentNT, and for the great work in “Annotating the genome at single-nucleotide resolution with DNA foundation models”.

I observed a strange boundary effect when applying SegmentNT to a long chromosome sequence.

I used the inference code from your Colab notebook (https://colab.research.google.com/#fileId=https%3A//huggingface.co/InstaDeepAI/segment_nt/blob/main/inference_segment_nt.ipynb) to run the model on chromosome 20 (NC_060944.1) from the T2T-CHM13v2.0 assembly (GCF_009914755.1). To process the full chromosome, I used a non-overlapping sliding-window approach with windows up to 49,992 bp.

Across all windows, the final nucleotide at each boundary shows a consistent and unexpectedly strong drop in prediction confidence. This strange drop also appears when I reduce the window size (for example, to ~30 kb). This behavior does not appear in the predictions shown in the SegmentNT paper. It happens only when I run the inference manually using the code in the notebook.

In IGV, the concatenated per-nucleotide probability tracks show a clear dip exactly at every window boundary. One example is region NC_060944.1: 9,248,521–9,348,504, which spans two 49,992 bp windows. Around position 9,298,512, the final base of the first window, the probabilities for most classes decrease a lot. I include screenshots showing this unusual pattern below.

<img width="2874" height="1596" alt="Image" src="https://github.com/user-attachments/assets/cf72fcf6-dcc8-4b45-b538-185f4338324d" />
<img width="2879" height="1622" alt="Image" src="https://github.com/user-attachments/assets/20c140ce-e8ee-46a4-a7e4-ad37d7bad4bd" />

Is this boundary effect expected, and is there a recommended way to mitigate it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low confidence at prediction boundaries #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low confidence at prediction boundaries #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions