DessimozLab
diff --git a/‎README.md‎
Lines changed: 29 additions & 40 deletions b/‎README.md‎
Lines changed: 29 additions & 40 deletions
diff --git a/‎read2tree/Progress.py‎ ‎archive/Progress.py‎read2tree/Progress.py renamed to archive/Progress.py
Lines changed: 23 additions & 3 deletions b/‎read2tree/Progress.py‎ ‎archive/Progress.py‎read2tree/Progress.py renamed to archive/Progress.py
Lines changed: 23 additions & 3 deletions
diff --git a/‎archive/run_r2t.py‎
Lines changed: 20 additions & 0 deletions b/‎archive/run_r2t.py‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎archive/tests/test_use.py‎
Lines changed: 3 additions & 0 deletions b/‎archive/tests/test_use.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎environment.yml‎
Lines changed: 1 addition & 2 deletions b/‎environment.yml‎
Lines changed: 1 addition & 2 deletions
@@ -1,6 +1,6 @@
 # read2tree 
 
-read2tree is a software tool that allows to obtain alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bipasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtereing, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatination. 
+read2tree is a software tool that allows to obtain alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bipasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtereing, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatenation. 
 
 read2tree works in linux with  [![Python 3.10.8](https://img.shields.io/badge/python-3.10.8-blue.svg)](https://www.python.org/downloads/release/python-310/)
 
@@ -47,40 +47,21 @@ conda install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsi
 conda install -c bioconda dendropy pysam
 ```
 
-Besides, you need softwares including [mafft](http://mafft.cbrc.jp/alignment/software/) (multiple sequence aligner), [iqtree](http://www.iqtree.org/) (phylogenomic inference), [ngmlr](https://github.com/philres/ngmlr), [ngm/nextgenmap](https://github.com/Cibiv/NextGenMap) (long and short read mappers), and [samtools](http://www.htslib.org/download/) which could be installed using conda.
+Besides, you need softwares including [mafft](http://mafft.cbrc.jp/alignment/software/) (multiple sequence aligner), [iqtree](http://www.iqtree.org/) (phylogenomic inference), [minimap2](https://github.com/lh3/minimap2) (long and short read mappers), and [samtools](http://www.htslib.org/download/) which could be installed using conda.
+For this version, the `--read_type` argument accepts any minimap2 options string that defines how reads are aligned to the reference. For example, it could be `-ax sr`, `-ax map-hifi` or `-ax map-ont`. You can also pass `--threads 40` to be used with minimap2.
 ```
-conda install -c bioconda mafft iqtree ngmlr nextgenmap samtools
+conda install -c bioconda mafft iqtree minimap2 samtools
 ```
 
 Then, you can install the read2tree package after downlaoding the package from this GitHub repo using
 
 ```
-git clone https://github.com/DessimozLab/read2tree.git
+git clone https://github.com/DessimozLab/read2tree.git -b minimap2
 cd read2tree
 python setup.py install
 ```
 
 
-### 2) Installation using Conda
-
-
-```
-conda create -n r2t python=3.10.8
-conda install -c bioconda  read2tree 
-
-```
-Alternatively, you could also try using [mamba](https://mamba.readthedocs.io/en/latest/). Caution: please read about compatiblity of conda and mamba in one envirnoment.
-
-### 3) Installation using Docker
-The Dockerfile is also available in this repository. There is an example how to run in the [test example](#test-example) section.
-
-A prebuild container can be loaded from dockerhub:
-```
-docker pull dessimozlab/read2tree:latest
-```
-
-
-
 
 
 ## Run
@@ -100,7 +81,7 @@ cat marker_genes/*.fna > dna_ref.fa
 
 ### output 
 
-The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as defualt.  
+The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default.  
 
 
 ### Single species mode
@@ -109,12 +90,23 @@ read2tree --tree --standalone_path marker_genes/ --reads read_1.fastq read_2.fas
 ```
 
 ### Multiple species mode
+
+#### step1
 ```
-read2tree --standalone_path marker_genes/ --output_path output --reference --dna_reference  dna_ref.fa  # this creates just the reference folder 01 - 03
-read2tree --standalone_path marker_genes/ --output_path output --reads species1_R1.fastq species2_R2.fastq
-read2tree --standalone_path marker_genes/ --output_path output --reads species2_R1.fastq species2_R2.fastq
-read2tree --standalone_path marker_genes/ --output_path output --reads species3_R1.fastq species3_R2.fastq
-read2tree --standalone_path marker_genes/ --output_path output --merge_all_mappings --tree
+read2tree  --step 1marker  --standalone_path marker_genes  --dna_reference dna_ref.fa --output_path output  --debug 
+```
+
+#### step2
+The following could be run in parallel. 
+```
+read2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species1_R1.fastq species2_R2.fastq  --output_path output --debug
+read2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species2_R1.fastq species2_R2.fastq  --output_path output --debug
+read2tree --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads species3_R1.fastq species3_R2.fastq  --output_path output  --debug
+```
+
+#### step3
+```
+read2tree  --step 3combine --standalone_path marker_genes  --dna_reference dna_ref.fa  --output_path output  --tree --debug
 ```
 
 ### bootstraping
@@ -141,12 +133,12 @@ The goal of this test example is to infer species tree for Mus musculus using it
 
 ```
 cd tests
-read2tree --debug --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output/  --dna_reference  dna_ref.fa 
+read2tree --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq  --output_path output --dna_reference  dna_ref.fa  
 ```
 
 
 #### Run test example using docker
-
+(to be updated  )
 ```
 docker run --rm -i -v $PWD/tests:/input -v $PWD/tests/:/reads -v $PWD/output:/out -v $PWD/run:/run  dessimozlab/read2tree:latest  --tree --standalone_path /input/marker_genes --dna_reference /input/cds-marker_genes.fasta.gz --reads /reads/sample_1.fastq --output_path /out
 ```
@@ -190,33 +182,30 @@ export LANG=en_US.UTF-8
 
 ## Change log
 
-
+- version 1.5:
+  - using minimap2 as the read mapper 
 - version 0.1.5:
   - fix issue with UnknownSeq being removed in Biopython>1.80
   - removing unused modeltester wrappers
-
 - version 0.1.4:
    - allow reference folders not named marker_genes (#12)
    - update environment.yml file to contain all dependencies (#16)
    - documentation improvements
    - CI/CD pipeline
-
 - version 0.1.3: 
    - improvements of documentation
    - adding support for docker
-   - small bugfixes 
-
+   - small bugfixes
 - version 0.1.2: packaging
-
 - version 0.1.0: Adding covid analysis
-
 - version 0.0: Initial work
 
 
 ## Authors
 
-* [David Dylus](https://github.com/dvdylus), (main author)
+* [David Dylus](https://github.com/dvdylus)
 * [Adrian Altenhoff](http://people.inf.ethz.ch/adriaal).
+* [Sina Majidian](https://sinamajidian.github.io/)
 
 
 The authors would like to thank Alex Warwick for help how to initiate such a package.
 
@@ -87,7 +87,7 @@ def _extract_line_from_log(self, word, logfile):
             with open(logfile, "r") as file:
                 bestline = [line.split() for line in file if word in line]
                 if bestline:
-                    return bestline[-1]
+                    return bestline[-1] # the last line with this word is selected
                 return None
         except FileNotFoundError:
             print('File {} not accessible'.format(logfile))
@@ -101,6 +101,7 @@ def _get_number_of_OGs(self):
         '''
         log_list = self._extract_line_from_log('Gathering', 'mplog.log')
         if log_list:
+            logging.debug(' We are using the info from line #' +" ".join(log_list[:3])+".# So number of OGs is "+str(int(log_list[13])) )
             return int(log_list[13])
         else:
             return 0
@@ -113,6 +114,7 @@ def _get_number_of_appeneded_seq_to_OGs(self):
         '''
         log_list = self._extract_line_from_log('Appending', 'mplog.log')
         if log_list:
+            logging.debug(' We are using the info from line #' + " ".join(log_list[:3]) + ".# So number of appended sequences to OGs " + str(int(log_list[9])))
             return int(log_list[9])
         else:
             return 0
@@ -125,6 +127,7 @@ def _get_number_of_alignments(self):
         '''
         log_list = self._extract_line_from_log('Alignment of', 'mplog.log')
         if log_list:
+            logging.debug(' We are using the info from line #' + " ".join(log_list[:3]) + ".# So number of alignments is " + str(int(log_list[10])))
             return int(log_list[10])
         else:
             return 0
@@ -135,9 +138,10 @@ def _get_number_of_references(self):
         2018-11-23 12:13:53,691 - read2tree.ReferenceSet - INFO - ass: Extracted 6 reference species form 5 ogs took 0.0008709430694580078
         :return: Number of reference species
         '''
-        log_list = self._extract_line_from_log('ReferenceSet', 'mplog.log')
+        log_list = self._extract_line_from_log('ReferenceSet', 'mplog.log')  # # the last line with ReferenceSet is selected
         if log_list:
-            return int(log_list[9])
+            logging.debug("We are using the info from line #" + " ".join(log_list[:3]) + ".# So number of references is " + str(int(log_list[9])))
+            return int(log_list[9])  #
         else:
             return 0
 
@@ -159,9 +163,12 @@ def _get_og_set_status(self):
         if os.path.exists(self._folder_ref_ogs_aa) and os.path.exists(self._folder_ref_ogs_dna):
             num_ogs_aa = self._count_files(self._folder_ref_ogs_aa, '*fa')
             num_ogs_dna = self._count_files(self._folder_ref_ogs_dna, '*fa')
+            logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa which are ' + str(num_ogs_aa)  + " and "+ str(num_ogs_dna) +" in folder "+str(self._folder_ref_ogs_aa) +" and "+ str(self._folder_ref_ogs_dna))
             if (num_ogs_expected-num_ogs_aa) == 0 and (num_ogs_expected-num_ogs_dna) == 0:
+                logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa, which are ' + str(num_ogs_aa) + " and " + str(num_ogs_dna) +" the same as expected"+str(num_ogs_expected) +". So this step is done")
                 return True
             else:
+                logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa, which are ' + str(num_ogs_aa) + " and " + str(num_ogs_dna) +" but not the same as expected "+str(num_ogs_expected)  +". So this step is done")
                 return False
         else:
             return False
@@ -176,8 +183,11 @@ def _get_append_og_set_status(self):
             num_ogs_aa = self._count_files(self._folder_append_og_aa, '*fa')
             num_ogs_dna = self._count_files(self._folder_append_og_dna, '*fa')
             if (num_ogs_expected-num_ogs_aa) <= 0 and (num_ogs_expected-num_ogs_dna) <= 0:
+                logging.debug("Number of ogs expected after appending is"+str(num_ogs_expected)+" the same as ogs number in dna and aa " + str(self._folder_ref_ogs_dna)  +". So this step is done")
                 return True
             else:
+                logging.debug("Number of ogs expected after appending is"+str(num_ogs_expected)+" but the number in dna and aa " + str(self._folder_ref_ogs_dna) +" are " + str(num_ogs_aa)  + " and "+ str(num_ogs_dna)+". So this step is not done")
+
                 return False
         else:
             return False
@@ -191,8 +201,10 @@ def _get_reference_status(self):
         if os.path.exists(self._folder_ref_dna):
             num_references = self._count_files(self._folder_ref_dna, '*fa')
             if (num_ref_expected-num_references) == 0:
+                logging.debug("Number of ref expected is " + str( num_ref_expected) + " and number in  " + str( self._folder_ref_ogs_dna) +" is "+ str(num_references)+ ". So this step is done")
                 return True
             else:
+                logging.debug("Number of ref expected is " + str( num_ref_expected) + " but number in  " + str( self._folder_ref_ogs_dna) +" is "+ str(num_references)+ ". So this step is not done")
                 return False
         else:
             return False
@@ -207,8 +219,10 @@ def _get_alignment_status(self):
             num_align_aa = self._count_files(self._folder_align_aa, '*phy')
             num_align_dna = self._count_files(self._folder_align_dna, '*phy')
             if (num_aligns_expected-num_align_aa) == 0 and (num_aligns_expected-num_align_dna) == 0:
+                logging.debug("Number of aligns expected is " + str( num_aligns_expected) + " and number in  " + str( self._folder_ref_ogs_dna) +" is "+ str(num_align_dna)+ ", similarly for aa. So this step is done")
                 return True
             else:
+                logging.debug("Number of aligns expected is " + str( num_aligns_expected) + " and number in  " + str( self._folder_ref_ogs_dna) +" and aa versions  are "+ str(num_align_dna)+ " and "+ str(num_align_aa)+ ". So this step is not done")
                 return False
         else:
             return False
@@ -223,8 +237,10 @@ def _get_append_alignment_status(self):
             num_align_aa = self._count_files(self._folder_align_append_aa, '*phy')
             num_align_dna = self._count_files(self._folder_align_append_dna, '*phy')
             if (num_aligns_expected-num_align_aa) == 0 and (num_aligns_expected-num_align_dna) == 0:
+                logging.debug("Number of aligns expected is " + str(num_aligns_expected) + " and number in  " + str(self._folder_align_append_aa) + " is " + str(num_align_aa) + ", similarly for aa. So this step is done")
                 return True
             else:
+                logging.debug("Number of aligns expected is " + str(num_aligns_expected) + " and number in  " + str(self._folder_align_append_aa) + " and aa versions  are " + str(num_align_aa) + " and " + str(num_align_dna) + ". So this step is not done")
                 return False
         else:
             return False
@@ -233,6 +249,7 @@ def _get_finished_mapping_folders(self, path):
         mapping_folders_finished = []
         num_expected_mappings = self._get_number_of_references()
         mapping_folders = [x for x in os.listdir(path) if '04' in x]
+        logging.debug("Number of mapping expected is " + str(num_expected_mappings) + " we are checking folders in  " + str(path))
         for folder in mapping_folders:
             # NOTE: we are calculating the number of completed mappings as the number of existing cov files,
             # because these are written even if the mapping step did not find any reads to map to a particular reference
@@ -258,10 +275,13 @@ def _get_mapping_status(self):
             if len(mapping_folders) > 0:
                 self.num_completed_mappings = len(mapping_folders)
                 # self.logger.info('{}: Mapping completed!'.format(self._species_name))
+                logging.debug("There are some mapping folders " + str(self.num_completed_mappings))
                 return True
             else:
                 self.num_completed_mappings = 0
+                logging.debug("There is no mapping folder")
                 # self.logger.info('{}: Mapping not completed!'.format(self._species_name))
                 return False
         else:
+            logging.debug("There is no mapping folder")
             return False
@@ -0,0 +1,20 @@
+
+
+import read2tree
+from read2tree.main import main
+from read2tree._utils import exe_name
+
+import sys
+
+print("start run_r2t 2  223  3 ")
+main(sys.argv[1:], exe_name=exe_name(), desc="descr")
+
+
+# --step 1marker  --standalone_path marker_genes  --dna_reference dna_ref.fa --output_path output  --debug
+# --step 2map --standalone_path marker_genes  --dna_reference dna_ref.fa --reads /work/FAC/FBM/DBC/cdessim2/read2tree/v2_test/t1/reads_20/ERR7350657__2.fastq.gz  --output_path output --debug --threads 1
+
+print("finish run_r2t   ")
+
+
+
+a=1
@@ -10,10 +10,13 @@
 class Use(unittest.TestCase):
 
     def test_OGSet(self):
+        pass
 
     def test_write_progress(self):
+        pass
 
     def test_read_progress(self):
+        pass
 
 
 if __name__ == "__main__":
 
@@ -22,5 +22,4 @@ dependencies:
   - nextgenmap
   - samtools
   - filelock
-  - pyham
-  - pysam
+  - pysam