Skip to content

celiosantosjr/OPUs_pipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OPU pipeline processing large sets of files

clustering large sets of proteins.

01 fev 2024.

Python3.11 based -- Run Environment: Preferential Ubuntu / Linux64Bits

Installation

Dependencies:

  • biopython
  • pandas
  • mmseqs

To install you can simply:

conda env create --name OPUs python=3.11
conda activate OPUs
pip install pandas biopython
mamba install -c bioconda mmseqs2
python3 main.py -h

Parameters

Usage: main.py [options]

options:

  -h, --help             show this help message and exit
  -ofolder OFOLDER       Output folder path
  -pfolder PFOLDER       Project folder path
  -annofolder ANNOFOLDER Annotation folder path
  -annoxt ANNOXT         Annotation file extension
  -minseqid MINSEQID     Minimum sequence identity (0.975 default)
  -threads THREADS       Number of threads (3 default)
  -minocc MINOCC         Minimum number of occurrences to keep a OPU (2 default)
  -minlen MINLEN         Minimum number of residues to consider true a protein (35 default)
  -xt XT                 Protein sequences fasta file extension
  -maxmem MAXMEM         Maximum memory (4G default)

NOTE: The run with OPUs_pipe should processed with maxmem parameter set to a good ratio of RAM memory to be allocated. You may need at least 4GB of RAM to get results, otherwise the program gonna fail due to low memory resources.

Inputs

As inputs you need:

  • [i]. fasta files with the predicted proteins per sample, they can either be gzipped or xzipped;
  • [ii]. TSV files from the emapper2 output for each protein sequences file.

Files from [i] and [ii] should be named with the same preffix, equivalent to the sample they belong to. You can also place them at the same folder or alternatively to separate folders for sequences and annotations.

Outputs

You can get a detailed list and description of files generated by following the links below:

general_OPUs_annotation.tsv.xz

OPUs_cluster_relationship.tsv.xz

OPU_table.tsv.xz

result_rep_seq.fasta.xz

summarized_OPUs_annotation.tsv.xz

Testing script

After installation, you can also test the results obtained with your installation, by running the script supplied in the testing folder.

You gonna need at least 4GB of RAM to process the testing.

cd testing/
./test_run.sh

It should return the status of your installation.

About

Protein operational taxonomic units clustering pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors