Skip to content

[ENH] Extend MoleculeLoader #202

@satvshr

Description

@satvshr

I have a few suggestions regarding the MoleculeLoader:

  1. Add return type list as an option too, if we only want the sequences from the file (something like a pdb/mmcif) file, a list output containing strings makes more sense for a single file, or a list of lists for multiple files. Rename to_df_seq in doing this, and add a parameter return_type to this function to choose between df and list outputs. I remember @fkiraly mentioning that in the future we may want to also output 2D feature matrices (which lists cannot handle) but given this will not be the situation for all file formats/most use cases, we should not limit ourselves only to dfs.
  2. Replace all non-pdb file translations with SeqIO, so in the end, we will only have a pdb loader and a no-pdb loader.
  3. A fmt override which can be inferred from link if the file ending raises an error. I suggest making this a compulsory parameter and not an override given not all file endings will be understood by SeqIO, and leaving the parameter optional for a few cases and compulsory for others seems like bad design.
  4. Remove the dispatcher. I assume the original idea was to keep it, expecting a lot of dispatcher functions, but given we will end up with only 2 (the current one for pdb and the one mentioned below _load_seqio, we can replace this with a simple if-else statement.
    For the user nothing else changes except the one additional argument for choosing whether the user wants a list or a df. There will be an addition of a _load_seqio function which will be used for all files non-pdb:
    def _load_seqio(self, path, format):
        """Load any file format supported by Biopython SeqIO.

        Parameters
        ----------
        path : Path
            Path to a sequence file readable by SeqIO.
        format : str
            Biopython SeqIO format string (e.g. ``"fasta"``, ``"genbank"``).

        Returns
        -------
        list of str
            Amino-acid sequences extracted from the file.

        Raises
        ------
        ValueError
            If no sequences were found.
        """
        seqs = [str(rec.seq) for rec in SeqIO.parse(str(path), format)]
        if not seqs:
            raise ValueError(f"No sequences found in {path}")

        return seqs

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions