ENH update general_scripts 00 README

cocodyq · cocodyq · commit c470a88b3aa5 · 2024-03-19T00:19:42.000+08:00
diff --git a/General_Scripts/00_Remove_redundancy_and_cluster/README.md b/General_Scripts/00_Remove_redundancy_and_cluster/README.md
@@ -1 +1,10 @@
-
+## 00_Remove_redundancy_and_cluster
+| **Code** | **Description** | **Input** | **Output** |
+| :---: | :---: | :---: | :---: |
+| 01_deduplicate_sort_merge.py | Remove redundancy of the raw data (predicted smORFs from metagenomes and genomes) | GMSC10.metag_Prog_smorfs.faa.gz | metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz|
+| 02_extract.py | Extract non-singletons and singletons |metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz | metag_ProG_nonsingleton.faa.gz metag_ProG_singleton.faa.gz|
+| 03_linclust.sh | Cluster non-singletons at 90% amino acid identity and 90% coverage | metag_ProG_nonsingleton.faa.gz | metag_ProG_nonsingleton_0.9_clu.tsv metag_ProG_nonsingleton_0.9_clu_rep.faa 0.9clu_singleton_name |
+| 04_1_sig_select1000.py | Randomly select 1,000 cluster with only 1 sequence for cluster significance checking | 0.9clu_singleton_name metag_ProG_nonsingleton_0.9_clu_rep.faa | 0.9clu_singleton.faa 0.9clu_nonsingleton.faa selected_singleton.faa |
+| 04_1_sig_select100AA.py | Randomly select 1,000 sequences for mapping back to the representive sequences of the cluster(>1 member) they are from | all_0.9_0.5_family_sort.tsv.xz 100AA_GMSC_sort.faa.xz 90AA_GMSC_sort.faa.gz| selected_cluster.tsv selected_100AA.faa selected_90AA.faa |
+| 05_align_swipe.sh | Run swipe to align above sequences | 0.9clu_nonsingleton.faa selected_singleton.faa selected_90AA.faa selected_100AA.faa | result_singleton.tsv result_100AA.tsv |
+| 06_split_singletons.py 07_diamond.sh 08_identify_clusters.py 09_join_rescue_result.py| Align all the singletons of raw data against cluster representatives at 90% amino acid identity and 90% coverage | metag_ProG_singleton.faa.gz metag_ProG_nonsingleton_0.9_clu_rep.faa | singleton_0.9.tsv |
diff --git a/General_Scripts/README.md b/General_Scripts/README.md
@@ -1 +1,23 @@
-# smorf-catalog-project
+# General_Scripts
+The folder contains scripts to generate GMSC resourece from the raw data.
+## 00_Remove_redundancy_and_cluster
+| **Code** | **Description** | **Input** | **Output** |
+| :---: | :---: | :---: | :---: |
+| 01_deduplicate_sort_merge.py | Remove redundancy of the raw data (predicted smORFs from metagenomes and genomes) | GMSC10.metag_Prog_smorfs.faa.gz | metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz|
+| 02_extract.py | Extract non-singletons and singletons |metag_ProG_dedup.faa.gz metag_ProG.raw_number.tsv.gz | metag_ProG_nonsingleton.faa.gz metag_ProG_singleton.faa.gz|
+| 03_linclust.sh | Cluster non-singletons at 90% amino acid identity and 90% coverage | metag_ProG_nonsingleton.faa.gz | metag_ProG_nonsingleton_0.9_clu.tsv metag_ProG_nonsingleton_0.9_clu_rep.faa 0.9clu_singleton_name |
+| 04_1_sig_select1000.py | Randomly select 1,000 cluster with only 1 sequence for cluster significance checking | 0.9clu_singleton_name metag_ProG_nonsingleton_0.9_clu_rep.faa | 0.9clu_singleton.faa 0.9clu_nonsingleton.faa selected_singleton.faa |
+| 04_1_sig_select100AA.py | Randomly select 1,000 sequences for mapping back to the representive sequences of the cluster(>1 member) they are from | all_0.9_0.5_family_sort.tsv.xz 100AA_GMSC_sort.faa.xz 90AA_GMSC_sort.faa.gz| selected_cluster.tsv selected_100AA.faa selected_90AA.faa |
+| 05_align_swipe.sh | Run swipe to align above sequences | 0.9clu_nonsingleton.faa selected_singleton.faa selected_90AA.faa selected_100AA.faa | result_singleton.tsv result_100AA.tsv |
+| 06_split_singletons.py 07_diamond.sh 08_identify_clusters.py 09_join_rescue_result.py| Align all the singletons of raw data against cluster representatives at 90% amino acid identity and 90% coverage | metag_ProG_singleton.faa.gz metag_ProG_nonsingleton_0.9_clu_rep.faa | singleton_0.9.tsv |
+
+## 01_Taxonomy_mapping
+## 02_Habitat_mapping
+## 03_Quality_control
+## 04_Frozen
+## 05_Rarefaction
+## 06_Compare_with_other_datasets
+## 07_GMSC_mapper_benchmark
+## 08_Conserved_domain_annotation
+## 09_Density
+## 10_Transmembrane_secreted
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ The softwares are required for the scripts.
 | NGLess (v.1.3.0) | https://github.com/ngless-toolkit/ngless |
 | Prodigal (v 2.6.3) | https://github.com/hyattpd/Prodigal |
 | Macrel (v.0.5) | https://github.com/BigDataBiology/macrel |
+| JUG (v.2.1.1) | https://github.com/luispedro/jug |
 | MMseqs2 | https://github.com/soedinglab/MMseqs2 |
 | Swipe (v.2.1.1) | https://github.com/torognes/swipe |
 | DIAMOND (v.2.0.4) | https://github.com/bbuchfink/diamond |
@@ -35,7 +36,7 @@ The softwares are required for the scripts.
 
 ### Database
 
-Theese databases are used in the construction and analysis of the catalogue.
+These databases are used in the construction and analysis of the catalogue.
 
 | **Database** | **Availability** |
 | :---: | :---: |