Skip to content

Ref alt allele#8

Merged
khetherin merged 57 commits intoEBIvariation:mainfrom
khetherin:ref_alt_allele
Oct 13, 2025
Merged

Ref alt allele#8
khetherin merged 57 commits intoEBIvariation:mainfrom
khetherin:ref_alt_allele

Conversation

@khetherin
Copy link
Collaborator

Added functionality to include REF and ALT alleles in the VCF file.
Also added functions to adjust position, to add padded bases, to convert IUPAC ambiguity codes
work done in line with REF and ALT definitions of vcf spec v4.4 (https://samtools.github.io/hts-specs/VCFv4.4.pdf)

…le and get_alt for alternative alleles, minor edits made to get_ref and generating_Symbolic alleles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth combining sequence_ontology_symbolic_alleles.tsv and svALTkeys.tsv in one file as that would give you only one file to update when adding new ALTs

attribute_dict[key] = attribute_tokens
return attribute_dict

def read_sequence_ontology_symbolic_allele(so_symbolic_allele_file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a common aspect between all the readers where we read a csv file and one of the column is a unique key (usually the first column) and the value is a list of all the column or a formatted string. There is some potential refactoring that could help with the readability of the code and remove some duplication.
It would be good to keep that in mind in future improvement.

symbolic_allele_dict.setdefault(name, []).append(description)
return symbolic_allele_dict

def generate_angled_bracket(id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you expect this formatting to become more complicated I don't think this is worthy of a function.

Comment on lines +468 to +474
def adjust_pos(self, adjustment):
""" Adjusts the position of the variant
:param adjustment: a positive or negative number
:return: self.pos adjusted position
"""
self.pos += adjustment
return self.pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is worthy of a function either plus it is not used.


def add_padded_base(self, placed_before : bool, ref, alt):
""" Adds padded base to REF and ALT allele
:param placed_before: True or False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to describe what placed_before actually change in the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it might just be personal preference but I tend to have flags the regulate behaviour of the function as last parameters rather than first:

    def add_padded_base(self, ref, alt, placed_before):

""" Checks whether a reference allele meets the requirements of the VCF specification
:param ref_allele_to_be_checked: reference allele to check
:return: checked_reference_allele: reference allele that meets the requirements of the VCF specification"""
if isinstance(ref_allele_to_be_checked, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything you read is from text (GVF or genome) it is unlikely you would end up with anything else than a string.

Comment on lines +546 to +547
iupac_ambiguity_dictionary = self.build_iupac_ambiguity_code()
checked_reference_allele = self.convert_iupac_ambiguity_code(iupac_ambiguity_dictionary, ref_allele_to_be_checked)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be calculated on the fly for every base checked.
Since you are creating a dictionary of all the possible bases corresponding to a IUPAC code (in build_iupac_ambiguity_code) then selecting only one (in convert_iupac_ambiguity_code), you would achieve the same thing by having a simple map of IUPAC code to the selected base.

This being said selecting the smallest base out of a list is arbitrary and will end up making a statement we cannot be certain about.
A more conservative approach would be replace all IUPAC code with N.


if symbolic_allele == "<INS>":
info_end ="END=" + str( self.pos + len(self.ref) - 1 )
elif symbolic_allele == "<DEL>" or symbolic_allele == "<DUP>" or symbolic_allele == "<INV>" or symbolic_allele == "<CNV>":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer the notation

Suggested change
elif symbolic_allele == "<DEL>" or symbolic_allele == "<DUP>" or symbolic_allele == "<INV>" or symbolic_allele == "<CNV>":
elif symbolic_allele in {"<DEL>", "<DUP>", "<INV>", "<CNV>"}:

:param all_possible_INFO_lines: dictionary of all possible INFO lines
:return: symbolic_allele, self.info, lines_standard_ALT, lines_standard_INFO
"""
if "A" in self.vcf_value["Variant_seq"] or "C" in self.vcf_value["Variant_seq"] or "T" in self.vcf_value["Variant_seq"] or "G" in self.vcf_value["Variant_seq"] or "N" in self.vcf_value["Variant_seq"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "A" in self.vcf_value["Variant_seq"] or "C" in self.vcf_value["Variant_seq"] or "T" in self.vcf_value["Variant_seq"] or "G" in self.vcf_value["Variant_seq"] or "N" in self.vcf_value["Variant_seq"]:
if any(base in self.vcf_value["Variant_seq"] for base in ["A", "C", "T", "G", "N"]):

symbolic_allele, self.info, lines_standard_ALT, lines_standard_INFO = self.generate_symbolic_allele(lines_standard_ALT, lines_standard_INFO, all_possible_ALT_lines, all_possible_INFO_lines)
if symbolic_allele is None:
alterative_allele = "."
elif (self.vcf_value["Variant_seq"] == "." or self.vcf_value["Variant_seq"] == "-") and symbolic_allele is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we know that self.vcf_value["Variant_seq"] == ".". I'm not sure why we're testing for self.vcf_value["Variant_seq"] == "-" since we were not before.

Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this PR and work on the next one.

@khetherin khetherin merged commit fbc110c into EBIvariation:main Oct 13, 2025
1 check failed
@khetherin khetherin deleted the ref_alt_allele branch December 8, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants