Skip to content

update to fail on duplicate LN tags, with different LN values#1882

Merged
daviesrob merged 2 commits intosamtools:developfrom
vasudeva8:SQ
Mar 6, 2025
Merged

update to fail on duplicate LN tags, with different LN values#1882
daviesrob merged 2 commits intosamtools:developfrom
vasudeva8:SQ

Conversation

@vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Jan 27, 2025

Fixes #1866
Updated to use 1st LN tag value as in cram handling.
Accepts duplicate LN tags as long as they have same value.
If the values differ, returns failure.

sam_hdr_create updated for sam file checks and sam_hdr_sanitise for handling bam and cram.
sam_hdr_sanitise calls sam_hdr_fill_hrecs and in turn sam_hrecs_update_hashes, where duplicate check is made.

For cram, there is no added overhead as fill_hrecs is already in use.
For sam and bam, the sanitise update adds as overhead which is required to identify the duplicates.
For sam, the check occurs during normal parsing in sam_hdr_create and sanitise update is redundant.
For bam, fill_hrecs update is required to identify the duplicates. sanitise seem to be the best common place to have this.

@daviesrob daviesrob self-assigned this Jan 30, 2025
sam.c Outdated
cp[h->l_text] = '\0';
}

if (sam_hdr_fill_hrecs(h) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to call sam_hdr_fill_hrecs() here. It will trigger a full parse of the header, which we may not want to do unless really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

sam.c Outdated
hts_pos_t tmp = strtoll(q + 3, (char**)&q, 10);
if (ln != -1 && tmp != ln) {
//duplicate LN tag with different value
hts_log_error("111Header includes @SQ line \"%s\" with \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message looks a bit odd...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@vasudeva8
Copy link
Contributor Author

updated to avoid hrec fill overhead.
bam has no change in processing and does not show any warning on duplicate LN tag in text part
cram shows the warning as earlier.
sam shows warning on duplicate LN tag and uses the 1st LN value as in CRAM processing.

bam processing is odd that it doesn't show the warning but avoids the overhead.

@daviesrob
Copy link
Member

I thik you may have removed too much in the last update. We still want to fail if SAM finds a header with two LN: tags.

@vasudeva8
Copy link
Contributor Author

If we have error return, then SAM and with some extra changes CRAM will fail in duplicate but BAM won't, as we will not invoke the hrec_fill due to overheads.
Is it not better to have consistent behaviour across formats, that all succeeds and just BAM won't warn?
As SAM uses the 1st LN value, the -ve value issue should not be there any more.

@vasudeva8
Copy link
Contributor Author

Updated to fail in case of duplicate LN tag with different values.

This change makes header read for sam, cram different from bam, where bam doesn't fail and sam, cram fails on header read api. This difference will be in library level but the library is mainly used in a utility (samtools) level where another operation like PG addition takes place, making the failure visible in bam as well and all formats have similar failure. (if --no-PG is used with bam it will not fail).

@daviesrob daviesrob merged commit 386de25 into samtools:develop Mar 6, 2025
9 checks passed
vasudeva8 added a commit to vasudeva8/htslib that referenced this pull request Mar 18, 2025
@vasudeva8 vasudeva8 deleted the SQ branch March 18, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistency in parsing @SQ headers with multiple LN: and SN: tags

2 participants