Skip to content

Low confidence at prediction boundaries #120

@calengil

Description

@calengil

Hello,

First off, thank you for open-sourcing SegmentNT, and for the great work in “Annotating the genome at single-nucleotide resolution with DNA foundation models”.

I observed a strange boundary effect when applying SegmentNT to a long chromosome sequence.

I used the inference code from your Colab notebook (https://colab.research.google.com/#fileId=https%3A//huggingface.co/InstaDeepAI/segment_nt/blob/main/inference_segment_nt.ipynb) to run the model on chromosome 20 (NC_060944.1) from the T2T-CHM13v2.0 assembly (GCF_009914755.1). To process the full chromosome, I used a non-overlapping sliding-window approach with windows up to 49,992 bp.

Across all windows, the final nucleotide at each boundary shows a consistent and unexpectedly strong drop in prediction confidence. This strange drop also appears when I reduce the window size (for example, to ~30 kb). This behavior does not appear in the predictions shown in the SegmentNT paper. It happens only when I run the inference manually using the code in the notebook.

In IGV, the concatenated per-nucleotide probability tracks show a clear dip exactly at every window boundary. One example is region NC_060944.1: 9,248,521–9,348,504, which spans two 49,992 bp windows. Around position 9,298,512, the final base of the first window, the probabilities for most classes decrease a lot. I include screenshots showing this unusual pattern below.

Image Image

Is this boundary effect expected, and is there a recommended way to mitigate it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions