Skip to content

Commit c470a88

Browse files
committed
ENH update general_scripts 00 README
1 parent c7dd206 commit c470a88

File tree

3 files changed

+35
-3
lines changed

3 files changed

+35
-3
lines changed
Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,10 @@
1-
1+
## 00_Remove_redundancy_and_cluster
2+
| **Code** | **Description** | **Input** | **Output** |
3+
| :---: | :---: | :---: | :---: |
4+
| 01_deduplicate_sort_merge.py | Remove redundancy of the raw data (predicted smORFs from metagenomes and genomes) | GMSC10.metag_Prog_smorfs.faa.gz | metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz|
5+
| 02_extract.py | Extract non-singletons and singletons |metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz | metag_ProG_nonsingleton.faa.gz metag_ProG_singleton.faa.gz|
6+
| 03_linclust.sh | Cluster non-singletons at 90% amino acid identity and 90% coverage | metag_ProG_nonsingleton.faa.gz | metag_ProG_nonsingleton_0.9_clu.tsv metag_ProG_nonsingleton_0.9_clu_rep.faa 0.9clu_singleton_name |
7+
| 04_1_sig_select1000.py | Randomly select 1,000 cluster with only 1 sequence for cluster significance checking | 0.9clu_singleton_name metag_ProG_nonsingleton_0.9_clu_rep.faa | 0.9clu_singleton.faa 0.9clu_nonsingleton.faa selected_singleton.faa |
8+
| 04_1_sig_select100AA.py | Randomly select 1,000 sequences for mapping back to the representive sequences of the cluster(>1 member) they are from | all_0.9_0.5_family_sort.tsv.xz 100AA_GMSC_sort.faa.xz 90AA_GMSC_sort.faa.gz| selected_cluster.tsv selected_100AA.faa selected_90AA.faa |
9+
| 05_align_swipe.sh | Run swipe to align above sequences | 0.9clu_nonsingleton.faa selected_singleton.faa selected_90AA.faa selected_100AA.faa | result_singleton.tsv result_100AA.tsv |
10+
| 06_split_singletons.py 07_diamond.sh 08_identify_clusters.py 09_join_rescue_result.py| Align all the singletons of raw data against cluster representatives at 90% amino acid identity and 90% coverage | metag_ProG_singleton.faa.gz metag_ProG_nonsingleton_0.9_clu_rep.faa | singleton_0.9.tsv |

General_Scripts/README.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,23 @@
1-
# smorf-catalog-project
1+
# General_Scripts
2+
The folder contains scripts to generate GMSC resourece from the raw data.
3+
## 00_Remove_redundancy_and_cluster
4+
| **Code** | **Description** | **Input** | **Output** |
5+
| :---: | :---: | :---: | :---: |
6+
| 01_deduplicate_sort_merge.py | Remove redundancy of the raw data (predicted smORFs from metagenomes and genomes) | GMSC10.metag_Prog_smorfs.faa.gz | metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz|
7+
| 02_extract.py | Extract non-singletons and singletons |metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz | metag_ProG_nonsingleton.faa.gz metag_ProG_singleton.faa.gz|
8+
| 03_linclust.sh | Cluster non-singletons at 90% amino acid identity and 90% coverage | metag_ProG_nonsingleton.faa.gz | metag_ProG_nonsingleton_0.9_clu.tsv metag_ProG_nonsingleton_0.9_clu_rep.faa 0.9clu_singleton_name |
9+
| 04_1_sig_select1000.py | Randomly select 1,000 cluster with only 1 sequence for cluster significance checking | 0.9clu_singleton_name metag_ProG_nonsingleton_0.9_clu_rep.faa | 0.9clu_singleton.faa 0.9clu_nonsingleton.faa selected_singleton.faa |
10+
| 04_1_sig_select100AA.py | Randomly select 1,000 sequences for mapping back to the representive sequences of the cluster(>1 member) they are from | all_0.9_0.5_family_sort.tsv.xz 100AA_GMSC_sort.faa.xz 90AA_GMSC_sort.faa.gz| selected_cluster.tsv selected_100AA.faa selected_90AA.faa |
11+
| 05_align_swipe.sh | Run swipe to align above sequences | 0.9clu_nonsingleton.faa selected_singleton.faa selected_90AA.faa selected_100AA.faa | result_singleton.tsv result_100AA.tsv |
12+
| 06_split_singletons.py 07_diamond.sh 08_identify_clusters.py 09_join_rescue_result.py| Align all the singletons of raw data against cluster representatives at 90% amino acid identity and 90% coverage | metag_ProG_singleton.faa.gz metag_ProG_nonsingleton_0.9_clu_rep.faa | singleton_0.9.tsv |
13+
14+
## 01_Taxonomy_mapping
15+
## 02_Habitat_mapping
16+
## 03_Quality_control
17+
## 04_Frozen
18+
## 05_Rarefaction
19+
## 06_Compare_with_other_datasets
20+
## 07_GMSC_mapper_benchmark
21+
## 08_Conserved_domain_annotation
22+
## 09_Density
23+
## 10_Transmembrane_secreted

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ The softwares are required for the scripts.
2020
| NGLess (v.1.3.0) | https://github.com/ngless-toolkit/ngless |
2121
| Prodigal (v 2.6.3) | https://github.com/hyattpd/Prodigal |
2222
| Macrel (v.0.5) | https://github.com/BigDataBiology/macrel |
23+
| JUG (v.2.1.1) | https://github.com/luispedro/jug |
2324
| MMseqs2 | https://github.com/soedinglab/MMseqs2 |
2425
| Swipe (v.2.1.1) | https://github.com/torognes/swipe |
2526
| DIAMOND (v.2.0.4) | https://github.com/bbuchfink/diamond |
@@ -35,7 +36,7 @@ The softwares are required for the scripts.
3536

3637
### Database
3738

38-
Theese databases are used in the construction and analysis of the catalogue.
39+
These databases are used in the construction and analysis of the catalogue.
3940

4041
| **Database** | **Availability** |
4142
| :---: | :---: |

0 commit comments

Comments
 (0)