Skip to content

Commit 195ba7b

Browse files
authored
Merge pull request #74 from DessimozLab/minimap2
new version of r2t with better logging and improved speed
2 parents 925b650 + 1b92f72 commit 195ba7b

File tree

17 files changed

+798
-684
lines changed

17 files changed

+798
-684
lines changed

README.md

Lines changed: 29 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# read2tree
22

3-
read2tree is a software tool that allows to obtain alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bipasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtereing, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatination.
3+
read2tree is a software tool that allows to obtain alignment matrices for tree inference. For this purpose it makes use of the OMA database and a set of reads. Its strength lies in the fact that it bipasses the several standard steps when obtaining such a matrix in regular analysis. These steps are read filtereing, assembly, gene prediction, gene annotation, all vs all comparison, orthology prediction, alignment and concatenation.
44

55
read2tree works in linux with [![Python 3.10.8](https://img.shields.io/badge/python-3.10.8-blue.svg)](https://www.python.org/downloads/release/python-310/)
66

@@ -47,40 +47,21 @@ conda install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsi
4747
conda install -c bioconda dendropy pysam
4848
```
4949

50-
Besides, you need softwares including [mafft](http://mafft.cbrc.jp/alignment/software/) (multiple sequence aligner), [iqtree](http://www.iqtree.org/) (phylogenomic inference), [ngmlr](https://github.com/philres/ngmlr), [ngm/nextgenmap](https://github.com/Cibiv/NextGenMap) (long and short read mappers), and [samtools](http://www.htslib.org/download/) which could be installed using conda.
50+
Besides, you need softwares including [mafft](http://mafft.cbrc.jp/alignment/software/) (multiple sequence aligner), [iqtree](http://www.iqtree.org/) (phylogenomic inference), [minimap2](https://github.com/lh3/minimap2) (long and short read mappers), and [samtools](http://www.htslib.org/download/) which could be installed using conda.
51+
For this version, the `--read_type` argument accepts any minimap2 options string that defines how reads are aligned to the reference. For example, it could be `-ax sr`, `-ax map-hifi` or `-ax map-ont`. You can also pass `--threads 40` to be used with minimap2.
5152
```
52-
conda install -c bioconda mafft iqtree ngmlr nextgenmap samtools
53+
conda install -c bioconda mafft iqtree minimap2 samtools
5354
```
5455

5556
Then, you can install the read2tree package after downlaoding the package from this GitHub repo using
5657

5758
```
58-
git clone https://github.com/DessimozLab/read2tree.git
59+
git clone https://github.com/DessimozLab/read2tree.git -b minimap2
5960
cd read2tree
6061
python setup.py install
6162
```
6263

6364

64-
### 2) Installation using Conda
65-
66-
67-
```
68-
conda create -n r2t python=3.10.8
69-
conda install -c bioconda read2tree
70-
71-
```
72-
Alternatively, you could also try using [mamba](https://mamba.readthedocs.io/en/latest/). Caution: please read about compatiblity of conda and mamba in one envirnoment.
73-
74-
### 3) Installation using Docker
75-
The Dockerfile is also available in this repository. There is an example how to run in the [test example](#test-example) section.
76-
77-
A prebuild container can be loaded from dockerhub:
78-
```
79-
docker pull dessimozlab/read2tree:latest
80-
```
81-
82-
83-
8465

8566

8667
## Run
@@ -100,7 +81,7 @@ cat marker_genes/*.fna > dna_ref.fa
10081

10182
### output
10283

103-
The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as defualt.
84+
The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default.
10485

10586

10687
### Single species mode
@@ -109,12 +90,23 @@ read2tree --tree --standalone_path marker_genes/ --reads read_1.fastq read_2.fas
10990
```
11091

11192
### Multiple species mode
93+
94+
#### step1
11295
```
113-
read2tree --standalone_path marker_genes/ --output_path output --reference --dna_reference dna_ref.fa # this creates just the reference folder 01 - 03
114-
read2tree --standalone_path marker_genes/ --output_path output --reads species1_R1.fastq species2_R2.fastq
115-
read2tree --standalone_path marker_genes/ --output_path output --reads species2_R1.fastq species2_R2.fastq
116-
read2tree --standalone_path marker_genes/ --output_path output --reads species3_R1.fastq species3_R2.fastq
117-
read2tree --standalone_path marker_genes/ --output_path output --merge_all_mappings --tree
96+
read2tree --step 1marker --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --debug
97+
```
98+
99+
#### step2
100+
The following could be run in parallel.
101+
```
102+
read2tree --step 2map --standalone_path marker_genes --dna_reference dna_ref.fa --reads species1_R1.fastq species2_R2.fastq --output_path output --debug
103+
read2tree --step 2map --standalone_path marker_genes --dna_reference dna_ref.fa --reads species2_R1.fastq species2_R2.fastq --output_path output --debug
104+
read2tree --step 2map --standalone_path marker_genes --dna_reference dna_ref.fa --reads species3_R1.fastq species3_R2.fastq --output_path output --debug
105+
```
106+
107+
#### step3
108+
```
109+
read2tree --step 3combine --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --tree --debug
118110
```
119111

120112
### bootstraping
@@ -141,12 +133,12 @@ The goal of this test example is to infer species tree for Mus musculus using it
141133

142134
```
143135
cd tests
144-
read2tree --debug --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output/ --dna_reference dna_ref.fa
136+
read2tree --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output --dna_reference dna_ref.fa
145137
```
146138

147139

148140
#### Run test example using docker
149-
141+
(to be updated )
150142
```
151143
docker run --rm -i -v $PWD/tests:/input -v $PWD/tests/:/reads -v $PWD/output:/out -v $PWD/run:/run dessimozlab/read2tree:latest --tree --standalone_path /input/marker_genes --dna_reference /input/cds-marker_genes.fasta.gz --reads /reads/sample_1.fastq --output_path /out
152144
```
@@ -190,33 +182,30 @@ export LANG=en_US.UTF-8
190182

191183
## Change log
192184

193-
185+
- version 1.5:
186+
- using minimap2 as the read mapper
194187
- version 0.1.5:
195188
- fix issue with UnknownSeq being removed in Biopython>1.80
196189
- removing unused modeltester wrappers
197-
198190
- version 0.1.4:
199191
- allow reference folders not named marker_genes (#12)
200192
- update environment.yml file to contain all dependencies (#16)
201193
- documentation improvements
202194
- CI/CD pipeline
203-
204195
- version 0.1.3:
205196
- improvements of documentation
206197
- adding support for docker
207-
- small bugfixes
208-
198+
- small bugfixes
209199
- version 0.1.2: packaging
210-
211200
- version 0.1.0: Adding covid analysis
212-
213201
- version 0.0: Initial work
214202

215203

216204
## Authors
217205

218-
* [David Dylus](https://github.com/dvdylus), (main author)
206+
* [David Dylus](https://github.com/dvdylus)
219207
* [Adrian Altenhoff](http://people.inf.ethz.ch/adriaal).
208+
* [Sina Majidian](https://sinamajidian.github.io/)
220209

221210

222211
The authors would like to thank Alex Warwick for help how to initiate such a package.
Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ def _extract_line_from_log(self, word, logfile):
8787
with open(logfile, "r") as file:
8888
bestline = [line.split() for line in file if word in line]
8989
if bestline:
90-
return bestline[-1]
90+
return bestline[-1] # the last line with this word is selected
9191
return None
9292
except FileNotFoundError:
9393
print('File {} not accessible'.format(logfile))
@@ -101,6 +101,7 @@ def _get_number_of_OGs(self):
101101
'''
102102
log_list = self._extract_line_from_log('Gathering', 'mplog.log')
103103
if log_list:
104+
logging.debug(' We are using the info from line #' +" ".join(log_list[:3])+".# So number of OGs is "+str(int(log_list[13])) )
104105
return int(log_list[13])
105106
else:
106107
return 0
@@ -113,6 +114,7 @@ def _get_number_of_appeneded_seq_to_OGs(self):
113114
'''
114115
log_list = self._extract_line_from_log('Appending', 'mplog.log')
115116
if log_list:
117+
logging.debug(' We are using the info from line #' + " ".join(log_list[:3]) + ".# So number of appended sequences to OGs " + str(int(log_list[9])))
116118
return int(log_list[9])
117119
else:
118120
return 0
@@ -125,6 +127,7 @@ def _get_number_of_alignments(self):
125127
'''
126128
log_list = self._extract_line_from_log('Alignment of', 'mplog.log')
127129
if log_list:
130+
logging.debug(' We are using the info from line #' + " ".join(log_list[:3]) + ".# So number of alignments is " + str(int(log_list[10])))
128131
return int(log_list[10])
129132
else:
130133
return 0
@@ -135,9 +138,10 @@ def _get_number_of_references(self):
135138
2018-11-23 12:13:53,691 - read2tree.ReferenceSet - INFO - ass: Extracted 6 reference species form 5 ogs took 0.0008709430694580078
136139
:return: Number of reference species
137140
'''
138-
log_list = self._extract_line_from_log('ReferenceSet', 'mplog.log')
141+
log_list = self._extract_line_from_log('ReferenceSet', 'mplog.log') # # the last line with ReferenceSet is selected
139142
if log_list:
140-
return int(log_list[9])
143+
logging.debug("We are using the info from line #" + " ".join(log_list[:3]) + ".# So number of references is " + str(int(log_list[9])))
144+
return int(log_list[9]) #
141145
else:
142146
return 0
143147

@@ -159,9 +163,12 @@ def _get_og_set_status(self):
159163
if os.path.exists(self._folder_ref_ogs_aa) and os.path.exists(self._folder_ref_ogs_dna):
160164
num_ogs_aa = self._count_files(self._folder_ref_ogs_aa, '*fa')
161165
num_ogs_dna = self._count_files(self._folder_ref_ogs_dna, '*fa')
166+
logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa which are ' + str(num_ogs_aa) + " and "+ str(num_ogs_dna) +" in folder "+str(self._folder_ref_ogs_aa) +" and "+ str(self._folder_ref_ogs_dna))
162167
if (num_ogs_expected-num_ogs_aa) == 0 and (num_ogs_expected-num_ogs_dna) == 0:
168+
logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa, which are ' + str(num_ogs_aa) + " and " + str(num_ogs_dna) +" the same as expected"+str(num_ogs_expected) +". So this step is done")
163169
return True
164170
else:
171+
logging.debug(' We are counting the number of fa files in folder _ogs_aa and _ogs_daa, which are ' + str(num_ogs_aa) + " and " + str(num_ogs_dna) +" but not the same as expected "+str(num_ogs_expected) +". So this step is done")
165172
return False
166173
else:
167174
return False
@@ -176,8 +183,11 @@ def _get_append_og_set_status(self):
176183
num_ogs_aa = self._count_files(self._folder_append_og_aa, '*fa')
177184
num_ogs_dna = self._count_files(self._folder_append_og_dna, '*fa')
178185
if (num_ogs_expected-num_ogs_aa) <= 0 and (num_ogs_expected-num_ogs_dna) <= 0:
186+
logging.debug("Number of ogs expected after appending is"+str(num_ogs_expected)+" the same as ogs number in dna and aa " + str(self._folder_ref_ogs_dna) +". So this step is done")
179187
return True
180188
else:
189+
logging.debug("Number of ogs expected after appending is"+str(num_ogs_expected)+" but the number in dna and aa " + str(self._folder_ref_ogs_dna) +" are " + str(num_ogs_aa) + " and "+ str(num_ogs_dna)+". So this step is not done")
190+
181191
return False
182192
else:
183193
return False
@@ -191,8 +201,10 @@ def _get_reference_status(self):
191201
if os.path.exists(self._folder_ref_dna):
192202
num_references = self._count_files(self._folder_ref_dna, '*fa')
193203
if (num_ref_expected-num_references) == 0:
204+
logging.debug("Number of ref expected is " + str( num_ref_expected) + " and number in " + str( self._folder_ref_ogs_dna) +" is "+ str(num_references)+ ". So this step is done")
194205
return True
195206
else:
207+
logging.debug("Number of ref expected is " + str( num_ref_expected) + " but number in " + str( self._folder_ref_ogs_dna) +" is "+ str(num_references)+ ". So this step is not done")
196208
return False
197209
else:
198210
return False
@@ -207,8 +219,10 @@ def _get_alignment_status(self):
207219
num_align_aa = self._count_files(self._folder_align_aa, '*phy')
208220
num_align_dna = self._count_files(self._folder_align_dna, '*phy')
209221
if (num_aligns_expected-num_align_aa) == 0 and (num_aligns_expected-num_align_dna) == 0:
222+
logging.debug("Number of aligns expected is " + str( num_aligns_expected) + " and number in " + str( self._folder_ref_ogs_dna) +" is "+ str(num_align_dna)+ ", similarly for aa. So this step is done")
210223
return True
211224
else:
225+
logging.debug("Number of aligns expected is " + str( num_aligns_expected) + " and number in " + str( self._folder_ref_ogs_dna) +" and aa versions are "+ str(num_align_dna)+ " and "+ str(num_align_aa)+ ". So this step is not done")
212226
return False
213227
else:
214228
return False
@@ -223,8 +237,10 @@ def _get_append_alignment_status(self):
223237
num_align_aa = self._count_files(self._folder_align_append_aa, '*phy')
224238
num_align_dna = self._count_files(self._folder_align_append_dna, '*phy')
225239
if (num_aligns_expected-num_align_aa) == 0 and (num_aligns_expected-num_align_dna) == 0:
240+
logging.debug("Number of aligns expected is " + str(num_aligns_expected) + " and number in " + str(self._folder_align_append_aa) + " is " + str(num_align_aa) + ", similarly for aa. So this step is done")
226241
return True
227242
else:
243+
logging.debug("Number of aligns expected is " + str(num_aligns_expected) + " and number in " + str(self._folder_align_append_aa) + " and aa versions are " + str(num_align_aa) + " and " + str(num_align_dna) + ". So this step is not done")
228244
return False
229245
else:
230246
return False
@@ -233,6 +249,7 @@ def _get_finished_mapping_folders(self, path):
233249
mapping_folders_finished = []
234250
num_expected_mappings = self._get_number_of_references()
235251
mapping_folders = [x for x in os.listdir(path) if '04' in x]
252+
logging.debug("Number of mapping expected is " + str(num_expected_mappings) + " we are checking folders in " + str(path))
236253
for folder in mapping_folders:
237254
# NOTE: we are calculating the number of completed mappings as the number of existing cov files,
238255
# because these are written even if the mapping step did not find any reads to map to a particular reference
@@ -258,10 +275,13 @@ def _get_mapping_status(self):
258275
if len(mapping_folders) > 0:
259276
self.num_completed_mappings = len(mapping_folders)
260277
# self.logger.info('{}: Mapping completed!'.format(self._species_name))
278+
logging.debug("There are some mapping folders " + str(self.num_completed_mappings))
261279
return True
262280
else:
263281
self.num_completed_mappings = 0
282+
logging.debug("There is no mapping folder")
264283
# self.logger.info('{}: Mapping not completed!'.format(self._species_name))
265284
return False
266285
else:
286+
logging.debug("There is no mapping folder")
267287
return False

archive/run_r2t.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
2+
3+
import read2tree
4+
from read2tree.main import main
5+
from read2tree._utils import exe_name
6+
7+
import sys
8+
9+
print("start run_r2t 2 223 3 ")
10+
main(sys.argv[1:], exe_name=exe_name(), desc="descr")
11+
12+
13+
# --step 1marker --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --debug
14+
# --step 2map --standalone_path marker_genes --dna_reference dna_ref.fa --reads /work/FAC/FBM/DBC/cdessim2/read2tree/v2_test/t1/reads_20/ERR7350657__2.fastq.gz --output_path output --debug --threads 1
15+
16+
print("finish run_r2t ")
17+
18+
19+
20+
a=1

archive/tests/test_use.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,13 @@
1010
class Use(unittest.TestCase):
1111

1212
def test_OGSet(self):
13+
pass
1314

1415
def test_write_progress(self):
16+
pass
1517

1618
def test_read_progress(self):
19+
pass
1720

1821

1922
if __name__ == "__main__":

environment.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,4 @@ dependencies:
2222
- nextgenmap
2323
- samtools
2424
- filelock
25-
- pyham
26-
- pysam
25+
- pysam

0 commit comments

Comments
 (0)