BigDataBiology
diff --git a/‎.github/workflows/test_gmsc_mapper.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/test_gmsc_mapper.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 147 additions & 73 deletions b/‎README.md‎
Lines changed: 147 additions & 73 deletions
@@ -8,7 +8,7 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest]
-        python-version: ["3.7", "3.8", "3.9", "3.10"]
+        python-version: ["3.8", "3.9"]
 
 
     steps:
 
@@ -1,6 +1,14 @@
 # GMSC-mapper
 
-Command line tool to query the Global Microbial smORFs Catalog (GMSC)
+GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
+
+GMSC-mapper can be used to 
+- Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalog (GMSC) by alignment.
+  - Support 3 types of input:
+    - contigs (GMSC-mapper will predict smORFs from contigs first)
+    - amino acid sequences
+    - nucleotide gene sequences
+- Annotate query/predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.
 
 ## Installation
 
@@ -27,64 +35,57 @@ During the process, we install also the following dependencies:
 - [MMseqs2](https://github.com/soedinglab/MMseqs2)
 - [Diamond](https://github.com/bbuchfink/diamond)
 
-And perform a series of tests using mock datasets to check if the installation works well:
+### Example test
+Because the whole GMSC database is large, and takes some minutes to process. 
 
-1. Input is genome contig sequences.
+If you want to check if the installation works well, you can test with mock datasets easily and fast.
 
-```bash
-gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
-```
+- Create GMSC database index
 
-2. Input is amino acid sequences.
+Default alignment tool is Diamond.
 
 ```bash
-gmsc-mapper --aa-genes ../examples/example.faa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
+gmsc-mapper createdb -i examples/target.faa -o examples/ -m diamond
 ```
 
-3. Input is nucleotide gene sequences.
+If you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
 
 ```bash
-gmsc-mapper --nt-genes ../examples/example.fna --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
+gmsc-mapper createdb -i examples/target.faa -o examples/ -m mmseqs
 ```
 
-4. Check the Alignment tool: Diamond/MMseqs2 is optional
+- Input is genome contig sequences.
 
 ```bash
-gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool mmseqs
-
-gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool diamond
+gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
 ```
 
-5. Flags to disable results from Habitat/taxonomy/quality annotation
+- Input is amino acid sequences.
 
 ```bash
-gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --nohabitat --notaxonomy --noquality
+gmsc-mapper --aa-genes examples/example.faa -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
 ```
 
-## Usage
+- Input is nucleotide gene sequences.
 
-### Example Usage
-The GMSC database is large,and taks some time to process all the things. If you want to know if GMSC-Mapper has been installed successfully and work well, please try the example usage with example target database as below.
+```bash
+gmsc-mapper --nt-genes examples/example.fna -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
+```
 
-#### Create GMSC database index
-`-o`: Path to database output directory.(default: `GMSC-mapper/examples`)
+- Check another alignment tool: MMseqs2
 
-`-m`: Alignment tool(Diamond/MMseqs2).
-```
-cd gmsc_mapper
-gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m diamond
-```
-or
-```
-cd gmsc_mapper
-gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m mmseqs
+```bash
+gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt --tool mmseqs
 ```
 
-### Real data Usage
-#### Create GMSC database index
+## Usage
+Please make `GMSC-mapper/gmsc_mapper` as your work directory.
+
+### Create GMSC database index
 `-o`: Path to database output directory.(default: `GMSC-mapper/db`)
 
-`-m`: Alignment tool(Diamond/MMseqs2).
+`-m`: Alignment tool (Diamond / MMseqs2).
+
 ```
 cd gmsc_mapper
 gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m diamond
@@ -95,11 +96,8 @@ cd gmsc_mapper
 gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m mmseqs
 ```
 
-#### Default
-
-Please make `GMSC-mapper/gmsc_mapper` as your work directory.
-
-GMSC database/habitat/taxonomy/quality file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
+### Default
+GMSC database / habitat / taxonomy / quality file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
 
 1. Input is genome contig sequences.
 
@@ -119,74 +117,150 @@ gmsc-mapper --aa-genes ../examples/example.faa
 gmsc-mapper --nt-genes ../examples/example.fna
 ```
 
-#### Alignment tool: Diamond/MMseqs2 is optional
-If you want to change alignment tool(Diamond/MMseqs2), you can use `--tool`.
+### Alignment tool: Diamond / MMseqs2 is optional
+If you want to change alignment tool (Diamond / MMseqs2), you can use `--tool`.
+
 ```bash
 gmsc-mapper -i ../examples/example.fa --tool mmseqs
 ```
 
-#### Habitat/taxonomy/quality annotation is optional
-If you don't want to annotate habitat/taxonomy/quality you can use `--nohabitat`/`--notaxonomy`/`--noquality`.
+### Habitat / taxonomy / quality annotation is optional
+If you don't want to annotate habitat / taxonomy / quality you can use `--nohabitat`/`--notaxonomy`/`--noquality`.
+
 ```bash
 gmsc-mapper -i ../examples/example.fa --nohabitat --notaxonomy --noquality
 ```
 
-## Example Output
+## Output files
 The output folder will contain
-- Outputs of smORF prediction.
-- Complete mapping result table, listing all the hits in GMSC, per smORF.
-- Habitat annotation of smORFs.(optional)
-- Taxonomy annotation of smORFs.(optional)
-- Quality annotation of smORFs.(optional)
 
-## Parameters
-* `-i/--input`: Path to the input genome contig sequence FASTA file (possibly .gz/.bz2/.xz compressed).
+- Outputs of smORFs prediction (predicted.filterd.smorf.faa)
+
+  A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
+
+- Complete alignment result table (diamond.out.smorfs.tsv / mmseqs.out.smorfs.tsv)
+
+  A file listing all the query hits of GMSC, from Diamond or MMseqs2.
+
+  The file format is followed by a space-separated list of these keywords:
+
+  `qseqid`: Query seq id
+
+  `sseqid`: Target seq id (in GMSC)
+
+  `full_qseq`: Query sequences
 
-* `--aa-genes`: Path to the input amino acid sequence FASTA file (possibly .gz/.bz2/.xz compressed).
+  `full_sseq`: Target sequences (in GMSC)
 
-* `--nt-genes`: Path to the input nucleotide gene sequence FASTA file (possibly .gz/.bz2/.xz compressed).
+  `qlen`: Query sequences length
 
-* `--nofilter`: Use this if no need to filter <100aa input sequences.
+  `slen`: Target sequences length
 
-* `-o/--output`: Output directory (will be created if non-existent).
+  `length`: Alignment length
 
-* `--tool`: Sequence alignment tool(Diamond/MMseqs).
+  `qstart`: Start of alignment in query
 
-* `--db`: Path to the GMSC database file.
+  `qend`: End of alignment in query
 
-* `--id`: Minimum identity to report an alignment(range 0.0-1.0).
+  `sstart`: Start of alignment in target
 
-* `--cov`: Minimum coverage to report an alignment(range 0.0-1.0).
+  `send`: End of alignment in target
 
-* `-e/--evalue`: Maximum e-value to report alignments(default=0.00001).
+  `bitscore`: Bit score
 
-* `--outfmt`: Output format of alignment result.
+  `pident`: Percentage of identical matches
 
-(Diamond default is "6,qseqid,sseqid,full_qseq,full_sseq,qlen,slen,pident,length,evalue,qcovhsp,scovhsp".
+  `evalue`: Expect value
 
-MMseqs default is "query,target,qseq,tseq,qlen,tlen,fident,alnlen,evalue,qcov,tcov".
+  `qcovhsp`: Query Coverage
 
-The first two column in result format of Diamond/MMseqs must be "qseqid"/"query" and "sseqid"/"target".)
+  `scovhsp`: Target Coverage
 
-* `--habitat`: Path to the habitat file.
+- Total smORFs homologous to GMSC (mapped.smorfs.faa)
 
-* `--nohabitat`: Use this if no need to annotate habitat.
+  A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
 
-* `--taxonomy`: Path to the taxonomy file.
+- Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv) 
 
-* `--notaxonomy`: Use this if no need to annotate taxonomy.
+  A file listing the habitat annotation for each smORF homologous to GMSC.
 
-* `--quality`: Path to the quality file.
+  There are two columns in the file:
 
-* `--noquality`: Use this if no need to annotate quality.
+  `qseqid`: Query seq id
 
-* `-t/--threads`: Number of CPU threads(default=3).
+  `habitat`: Habitat, ',' separated if the sequences is from multiple habitats
+
+- Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
+
+  A file listing the taxonomy annotation for each smORF homologous to GMSC.
+
+  There are two columns in the file:
+
+  `qseqid`: Query seq id
+
+  `taxonomy`: Taxonomy, ';' separated between each taxonomy rank
+
+- Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
+
+  A file listing the quality annotation for each smORF homologous to GMSC.
+
+  `qseqid`: Query seq id
+
+  `quality`: Quality label
+
+- Summry (summary.txt)
+
+  A file providing a human-readable summary of the results.
+
+## Parameters
+* `-i/--input`: Path to the input genome contig sequence FASTA file (possibly .gz compressed).
+
+* `--aa-genes`: Path to the input amino acid sequence FASTA file (possibly .gz compressed).
+
+* `--nt-genes`: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).
+
+* `-o/--output`: Output directory (will be created if non-existent). (default: ../output)
+
+* `--tool`: Sequence alignment tool (Diamond / MMseqs). (default: diamond)
+
+*  `-s/--sensitivity`: Sensitivity. (default: --more-sensitive (Diamond) 5.7 (mmseqs))
+
+* `--id`: Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0)
+
+* `--cov`: Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9)
+
+* `-e/--evalue`: Maximum e-value to report alignments. (default: 1e-05)
+
+* `-t/--threads`: Number of CPU threads. (default: 1)
+
+* `--filter`: Use this to filter <100 aa or <303 nt input sequences. (default: False)
+
+* `--nohabitat`: Use this if no need to annotate habitat. (default: False)
+
+* `--notaxonomy`: Use this if no need to annotate taxonomy. (default: False)
+
+* `--noquality`: Use this if no need to annotate quality. (default: False)
+
+* `--quiet`: Disable alignment console output. (default:False)
+
+* `--db`: Path to the GMSC database file. (default: ../db/targetdb.dmnd)
+
+* `--habitat`: Path to the habitat file. (default: ../db/ref_habitat.tsv.xz)
+
+* `--taxonomy`: Path to the taxonomy file. (default: ../db/ref_taxonomy.tsv.xz)
+
+* `--quality`: Path to the quality file. (default: ../db/ref_quality.tsv.xz)
 
 ### Subcommands and Parameters 
 Subcommands: `gmsc-mapper createdb`
 
-* `-i`: Path to the GMSC 90AA FASTA file.
+* `-i`: Path to the GMSC FASTA file.
+
+* `-o/--output`: Path to database output directory. (default: ../db)
+
+* `-m/--mode`: Alignment tool (Diamond / MMseqs2).
 
-* `-o/--output`: Path to database output directory.
+* `--quiet`: Disable alignment console output. (default:False)
 
-* `-m/--mode`: Alignment tool(Diamond/MMseqs2)
+## Sensitivity choices considering time and memory usage
+To be done