Conversation
…le and get_alt for alternative alleles, minor edits made to get_ref and generating_Symbolic alleles
There was a problem hiding this comment.
It's probably worth combining sequence_ontology_symbolic_alleles.tsv and svALTkeys.tsv in one file as that would give you only one file to update when adding new ALTs
| attribute_dict[key] = attribute_tokens | ||
| return attribute_dict | ||
|
|
||
| def read_sequence_ontology_symbolic_allele(so_symbolic_allele_file): |
There was a problem hiding this comment.
There seems to be a common aspect between all the readers where we read a csv file and one of the column is a unique key (usually the first column) and the value is a list of all the column or a formatted string. There is some potential refactoring that could help with the readability of the code and remove some duplication.
It would be good to keep that in mind in future improvement.
| symbolic_allele_dict.setdefault(name, []).append(description) | ||
| return symbolic_allele_dict | ||
|
|
||
| def generate_angled_bracket(id): |
There was a problem hiding this comment.
Unless you expect this formatting to become more complicated I don't think this is worthy of a function.
| def adjust_pos(self, adjustment): | ||
| """ Adjusts the position of the variant | ||
| :param adjustment: a positive or negative number | ||
| :return: self.pos adjusted position | ||
| """ | ||
| self.pos += adjustment | ||
| return self.pos |
There was a problem hiding this comment.
Don't think this is worthy of a function either plus it is not used.
|
|
||
| def add_padded_base(self, placed_before : bool, ref, alt): | ||
| """ Adds padded base to REF and ALT allele | ||
| :param placed_before: True or False |
There was a problem hiding this comment.
It would be useful to describe what placed_before actually change in the function.
There was a problem hiding this comment.
Also it might just be personal preference but I tend to have flags the regulate behaviour of the function as last parameters rather than first:
def add_padded_base(self, ref, alt, placed_before):| """ Checks whether a reference allele meets the requirements of the VCF specification | ||
| :param ref_allele_to_be_checked: reference allele to check | ||
| :return: checked_reference_allele: reference allele that meets the requirements of the VCF specification""" | ||
| if isinstance(ref_allele_to_be_checked, str): |
There was a problem hiding this comment.
Since everything you read is from text (GVF or genome) it is unlikely you would end up with anything else than a string.
| iupac_ambiguity_dictionary = self.build_iupac_ambiguity_code() | ||
| checked_reference_allele = self.convert_iupac_ambiguity_code(iupac_ambiguity_dictionary, ref_allele_to_be_checked) |
There was a problem hiding this comment.
This does not need to be calculated on the fly for every base checked.
Since you are creating a dictionary of all the possible bases corresponding to a IUPAC code (in build_iupac_ambiguity_code) then selecting only one (in convert_iupac_ambiguity_code), you would achieve the same thing by having a simple map of IUPAC code to the selected base.
This being said selecting the smallest base out of a list is arbitrary and will end up making a statement we cannot be certain about.
A more conservative approach would be replace all IUPAC code with N.
|
|
||
| if symbolic_allele == "<INS>": | ||
| info_end ="END=" + str( self.pos + len(self.ref) - 1 ) | ||
| elif symbolic_allele == "<DEL>" or symbolic_allele == "<DUP>" or symbolic_allele == "<INV>" or symbolic_allele == "<CNV>": |
There was a problem hiding this comment.
Prefer the notation
| elif symbolic_allele == "<DEL>" or symbolic_allele == "<DUP>" or symbolic_allele == "<INV>" or symbolic_allele == "<CNV>": | |
| elif symbolic_allele in {"<DEL>", "<DUP>", "<INV>", "<CNV>"}: |
| :param all_possible_INFO_lines: dictionary of all possible INFO lines | ||
| :return: symbolic_allele, self.info, lines_standard_ALT, lines_standard_INFO | ||
| """ | ||
| if "A" in self.vcf_value["Variant_seq"] or "C" in self.vcf_value["Variant_seq"] or "T" in self.vcf_value["Variant_seq"] or "G" in self.vcf_value["Variant_seq"] or "N" in self.vcf_value["Variant_seq"]: |
There was a problem hiding this comment.
| if "A" in self.vcf_value["Variant_seq"] or "C" in self.vcf_value["Variant_seq"] or "T" in self.vcf_value["Variant_seq"] or "G" in self.vcf_value["Variant_seq"] or "N" in self.vcf_value["Variant_seq"]: | |
| if any(base in self.vcf_value["Variant_seq"] for base in ["A", "C", "T", "G", "N"]): |
| symbolic_allele, self.info, lines_standard_ALT, lines_standard_INFO = self.generate_symbolic_allele(lines_standard_ALT, lines_standard_INFO, all_possible_ALT_lines, all_possible_INFO_lines) | ||
| if symbolic_allele is None: | ||
| alterative_allele = "." | ||
| elif (self.vcf_value["Variant_seq"] == "." or self.vcf_value["Variant_seq"] == "-") and symbolic_allele is not None: |
There was a problem hiding this comment.
At this point we know that self.vcf_value["Variant_seq"] == ".". I'm not sure why we're testing for self.vcf_value["Variant_seq"] == "-" since we were not before.
tcezard
left a comment
There was a problem hiding this comment.
We can merge this PR and work on the next one.
Added functionality to include REF and ALT alleles in the VCF file.
Also added functions to adjust position, to add padded bases, to convert IUPAC ambiguity codes
work done in line with REF and ALT definitions of vcf spec v4.4 (https://samtools.github.io/hts-specs/VCFv4.4.pdf)