-
Notifications
You must be signed in to change notification settings - Fork 11
Labels
enhancementNew feature or requestNew feature or request
Description
I have a few suggestions regarding the MoleculeLoader:
- Add return type list as an option too, if we only want the sequences from the file (something like a
pdb/mmcif) file, a list output containing strings makes more sense for a single file, or a list of lists for multiple files. Renameto_df_seqin doing this, and add a parameterreturn_typeto this function to choose between df and list outputs. I remember @fkiraly mentioning that in the future we may want to also output 2D feature matrices (which lists cannot handle) but given this will not be the situation for all file formats/most use cases, we should not limit ourselves only to dfs. - Replace all non-pdb file translations with
SeqIO, so in the end, we will only have a pdb loader and a no-pdb loader. - A
fmtoverride which can be inferred from link if the file ending raises an error. I suggest making this a compulsory parameter and not an override given not all file endings will be understood bySeqIO, and leaving the parameter optional for a few cases and compulsory for others seems like bad design. - Remove the dispatcher. I assume the original idea was to keep it, expecting a lot of dispatcher functions, but given we will end up with only 2 (the current one for pdb and the one mentioned below
_load_seqio, we can replace this with a simple if-else statement.
For the user nothing else changes except the one additional argument for choosing whether the user wants a list or a df. There will be an addition of a_load_seqiofunction which will be used for all files non-pdb:
def _load_seqio(self, path, format):
"""Load any file format supported by Biopython SeqIO.
Parameters
----------
path : Path
Path to a sequence file readable by SeqIO.
format : str
Biopython SeqIO format string (e.g. ``"fasta"``, ``"genbank"``).
Returns
-------
list of str
Amino-acid sequences extracted from the file.
Raises
------
ValueError
If no sequences were found.
"""
seqs = [str(rec.seq) for rec in SeqIO.parse(str(path), format)]
if not seqs:
raise ValueError(f"No sequences found in {path}")
return seqs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request