Skip to content

Mergelines#15

Merged
tcezard merged 34 commits intoEBIvariation:mainfrom
khetherin:mergelines
Dec 1, 2025
Merged

Mergelines#15
tcezard merged 34 commits intoEBIvariation:mainfrom
khetherin:mergelines

Conversation

@khetherin
Copy link
Collaborator

No description provided.

@khetherin khetherin closed this Nov 12, 2025
@khetherin khetherin reopened this Nov 12, 2025
@khetherin khetherin requested a review from tcezard November 12, 2025 10:17
Major changes to convertGVFtoVCF:
- removed compare_and_merge lines and placed functions into VCFlines file
Major changes to VCF line:
- improved single responsibility principle
- added many functions.
Function moved from VCFlines to Utils.py (build_iupac_ambiguitycode)
New files added:
- Lookup.py for referencing
- New test files: test_assisting_converter, test_logger, test_utils, test_vcfline (separated out tests to match the file structure)
Optimised the function: read_in_gvf_file by removing extraneous for loop
@khetherin khetherin requested a review from tcezard November 20, 2025 17:03

self.vcf_value, self.info_string, self.format_dict = convert_gvf_attributes_to_vcf_values(gvf_feature_line_object.attributes, mapping_attribute_dict, field_lines_dictionary, all_possible_lines_dictionary)
(self.vcf_value, # used to populate the VCF fields. This is a dict of non-converted GVF attribute keys and their values.
self.info_string, # a formatted INFO string to form VCF line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should return and store a info_dict rather than an info_string and leave the formatting the the __str__ function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed: strings have been removed

@@ -67,10 +74,11 @@ def __init__(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the order_format_keys function although I don't thin you should construct the format string here. It should be reserved to the __str__

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed: used order_format_keys and removed strings

Comment on lines +273 to +274
self.format,
self.format_values_by_sample_string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not be using stored strings. The should be constructed in this function from the list of format tag and the format_dict.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed: removed strings

self.pos,
self.merge_and_add(self.id, other_vcf_line.id, ";"),
self.ref,
self.merge_and_add(self.alt, other_vcf_line.alt, ","),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you merge 3 lines. First contains a <DEL>, second contains a <INS> and third contains a <DEL>?

merge('<DEL>', '<INS>') --> '<DEL>,<INS>'
merge('<DEL>,<INS>', '<DEL>') --> '<DEL>,<INS>,<DEL>'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed yet

Comment on lines +225 to +241
# The functions below relate to the VCF objects
def compare_vcf_objects(list_of_vcf_objects):
""" Compares VCF objects in the list with the VCF object before it. Returns boolean values.
:params: list_of_vcf_objects: list of vcf objects
:return: comparison_results: list of booleans. For future reference, if True, this will determine merging lines; if False, this will determine use of the previous line.
"""
sample_format_value_tokens = []
for sample in list_of_sample_names:
if sample in sample_name_dict_format_kv:
format_value = sample_name_dict_format_kv[sample]
sample_format_value_tokens.append(':'.join(format_value.values()))
comparison_results = []
# For each vcf line object, compare with the previous vcf line object in the list
for index in range(1, len(list_of_vcf_objects)):
current_vcf_object = list_of_vcf_objects[index]
previous_vcf_object = list_of_vcf_objects[index - 1]
# Determines the VCF line objects as equal based on the CHROM, POS and REF being the same (__eq__ in Vcfline)
if current_vcf_object == previous_vcf_object:
comparison_results.append(True) # This will use require merging.
else:
format_value = "." # set to missing value
sample_format_value_tokens.append(format_value)
sample_format_values_string = '\t'.join(sample_format_value_tokens)
return sample_format_values_string
comparison_results.append(False) # No merging required. Use previous object.
return comparison_results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating the comparison and storing the results seem overly complicated and there are function that exists in python to do what you want.
You should look for itertools.groupby that will probably help

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed yet

Comment on lines +243 to +259
def merge_vcf_objects(previous, current, list_of_sample_names):
""" Merge VCF objects.
:params: previous: previous VCF line object
:params: current: current VCF line object
:params: list_of_sample_names: sample names
:return: merged_object
"""
merged_object = previous.merge(current, list_of_sample_names)
return merged_object

def keep_vcf_objects(previous, list_of_sample_names):
""" Keep VCF objects.
:params: previous VCF line object
:return: kept_object
"""
kept_object = previous.keep(list_of_sample_names)
return kept_object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions seems unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed yet

)

def keep(self, list_of_sample_names):
self.format_values_by_sample_string = self.combine_format_values_by_sample(self.format_dict, list_of_sample_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be assigned to self.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed yet

return self.info_string

# MERGE OR KEEP below
def merge(self, other_vcf_line, list_of_sample_names):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge and keep functions are returning a set of strings ready to be printed where I think you should keep the information in the VcfLine object until the end and print them using the __str__

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed: return self and formatted with str

self.format_values_by_sample_string
)

def keep(self, list_of_sample_names):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keep function does not do much except updating the list of samples in the VCF line. I think you should change the name of the function to describe what it actually does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed yet

@khetherin khetherin requested a review from tcezard November 28, 2025 12:46
@tcezard tcezard merged commit a358048 into EBIvariation:main Dec 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants