OPU pipeline processing large sets of files

clustering large sets of proteins.

01 fev 2024.

Python3.11 based -- Run Environment: Preferential Ubuntu / Linux64Bits

Installation

Dependencies:

biopython
pandas
mmseqs

To install you can simply:

conda env create --name OPUs python=3.11
conda activate OPUs
pip install pandas biopython
mamba install -c bioconda mmseqs2
python3 main.py -h

Parameters

Usage: main.py [options]

options:

  -h, --help             show this help message and exit
  -ofolder OFOLDER       Output folder path
  -pfolder PFOLDER       Project folder path
  -annofolder ANNOFOLDER Annotation folder path
  -annoxt ANNOXT         Annotation file extension
  -minseqid MINSEQID     Minimum sequence identity (0.975 default)
  -threads THREADS       Number of threads (3 default)
  -minocc MINOCC         Minimum number of occurrences to keep a OPU (2 default)
  -minlen MINLEN         Minimum number of residues to consider true a protein (35 default)
  -xt XT                 Protein sequences fasta file extension
  -maxmem MAXMEM         Maximum memory (4G default)

NOTE: The run with OPUs_pipe should processed with maxmem parameter set to a good ratio of RAM memory to be allocated. You may need at least 4GB of RAM to get results, otherwise the program gonna fail due to low memory resources.

Inputs

As inputs you need:

[i]. fasta files with the predicted proteins per sample, they can either be gzipped or xzipped;
[ii]. TSV files from the emapper2 output for each protein sequences file.

Files from [i] and [ii] should be named with the same preffix, equivalent to the sample they belong to. You can also place them at the same folder or alternatively to separate folders for sequences and annotations.

Outputs

You can get a detailed list and description of files generated by following the links below:

general_OPUs_annotation.tsv.xz

OPUs_cluster_relationship.tsv.xz

OPU_table.tsv.xz

result_rep_seq.fasta.xz

summarized_OPUs_annotation.tsv.xz

Testing script

After installation, you can also test the results obtained with your installation, by running the script supplied in the testing folder.

You gonna need at least 4GB of RAM to process the testing.

cd testing/
./test_run.sh

It should return the status of your installation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
lib		lib
testing		testing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OPU pipeline processing large sets of files

clustering large sets of proteins.

Installation

Parameters

Inputs

Outputs

Testing script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OPU pipeline processing large sets of files

clustering large sets of proteins.

Installation

Parameters

Inputs

Outputs

Testing script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages