Skip to content

Commit e594d7d

Browse files
committed
ENH add readme.md for each folder
1 parent c470a88 commit e594d7d

File tree

15 files changed

+307
-294
lines changed

15 files changed

+307
-294
lines changed
Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,12 @@
1-
Codes for mapping taxonomy
1+
## 01_Taxonomy_mapping
2+
3+
| **Code** | **Description** | **Input** | **Output** |
4+
| :---: | :---: | :---: | :---: |
5+
| 01_map_prog_taxa_dedup.py | Remove redundancy of smORFs from Progenomes and map specI taxonomy to genomes | genome_prog.tsv specI_genome_taxa.txt GMSC10.ProG_smorfs.faa.gz | prog_specI_genome_taxa.tsv prog_dedup_sort.faa.gz prog_redundant.tsv.gz |
6+
| 02_run_linclust.sh 03_lca_change_format.py | Cluster smORFs from Progenomes at 90% amino acid identity and 90% coverage and map taxonomy by LCA | prog_dedup.faa.gz prog_redundant.tsv.gz prog_specI_genome_taxa.tsv.gz | prog_taxonomy_change.tsv.gz |
7+
| 04_map_metag_taxid_full.py | Map taxid of smORFs from metaG based on contigs and et the fullname of taxid based on GTDB | mmseqs2.lca_taxonomy.full.tsv.xz GMSC10.metag_smorfs.rename.txt.xz gtdb_taxonomy.tsv | metag_taxid.tsv.xz taxid_fullname_gtdb.tsv |
8+
| 05_dedup_cluster.py | Get clusters at 100% identity of raw data | GMSC10.metag_ProG_smorfs.faa.gz | dedup_cluster.tsv.gz |
9+
| 06_map_taxonomy.py | Map taxonomy for all the smORFs from metaG | taxid_fullname_gtdb.tsv metag_taxid.tsv.xz dedup_cluster.tsv.gz prog_taxonomy_change.tsv.gz | metag_cluster_taxonomy.tsv.xz |
10+
| 07_deep_lca_100.py | Map taxonomy for 100% identity smORFs with LCA | metag_cluster_taxonomy.tsv.xz all_0.5_0.9.tsv.gz | 100AA_taxonomy.tsv.xz |
11+
| 08_map_cluster_tax.py | Map taxonomy for 90% identity smORFs with LCA | 100AA_taxonomy.tsv.xz all_cluster_0.9.tsv.xz | 90AA_tax.tsv.xz |
12+
| 09_fix_prog_tax.py | Make consistency between Progenomes2 taxonomy and GTDB taxonomy | 100AA_taxonomy.tsv.xz 90AA_tax.tsv.xz | GMSC10.100AA.taxonomy.tsv GMSC10.90AA.taxonomy.tsv |
Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,8 @@
1-
Codes for mapping habitat
1+
## 02_Habitat_mapping
2+
3+
| **Code** | **Description** | **Input** | **Output** |
4+
| :---: | :---: | :---: | :---: |
5+
| 01_map_habitat.py | Map habitat for all the smORFs from metaG | metadata.tsv GMSC10.metag_smorfs.rename.txt.xz dedup_cluster.tsv.gz| metag_cluster_habitat.tsv.xz |
6+
| 02_multi_habitat.py | Combine multiple habitats for each smORF from the same cluster | metag_cluster_habitat.tsv.xz 100AA_rename.tsv.xz habitat_general.txt| 100AA_multi_general_habitat.tsv.xz |
7+
| 03_map_cluster_habitat.py | Map habitat to 90% identity smORFs clusters. | all_0.5_0.9.tsv.gz 100AA_multi_general_habitat.tsv.xz | cluster_multi_habitat_90.tsv.xz |
8+
|04_multi_habitat_90_50.py | Combine multiple habitats for each smORF from the same 90AA cluster |cluster_multi_habitat_90.tsv.xz habitat_general.txt | 90AA_multi_general_habitat.tsv.xz |
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
### Antifam
2+
3+
| **Code** | **Description** | **Input** | **Output** |
4+
| :---: | :---: | :---: | :---: |
5+
| 01_run_antifam.sh | Map 100AA smORFs to antifam | AntiFam.hmm 100AA_GMSC_sort.faa | antifam_result.tsv |
6+
| 02_assign_all_level.py | Assign antifam results to 90AA smORFs. | antifam_result.tsv all_0.9_0.5_family.tsv.xz | antifam_90AA.tsv.gz |
7+
8+
### Coordinates
9+
10+
| **Code** | **Description** | **Input** | **Output** |
11+
| :---: | :---: | :---: | :---: |
12+
| 01_allcoordinate.py | Detect if there is a STOP codon on the upstream of the smORF in the contigs | GMSC10.metag_smorfs.rename.txt.xz ./contigs/ | result.tsv.gz |
13+
| 02_assign_all_level.py | Assign terminal checking results to 100AA and 90AA smORFs | result.tsv.gz 100AA_rename.tsv.xz all_0.9_0.5_family.tsv.xz | 100AA_coordinate.tsv.gz 90AA_coordinate.tsv.gz |
14+
15+
### RNAcode
16+
17+
| **Code** | **Description** | **Input** | **Output** |
18+
| :---: | :---: | :---: | :---: |
19+
| 01_filter8_addfna_split.py | Select clusters(>= 8 members) | all_0.5_0.9.tsv metag_ProG_smorfs.fna.xz | ./split/*.fna |
20+
| 02_run_MSA.sh | Multiple sequences alignment of each .fna file | ./split/*.fna | *.aln |
21+
| 03_run_RNAcode.sh | Run RNAcode | *.aln | *.tsv |
22+
| 04_filter_RNAcode.py | Filter RNAcode result | *.tsv | smORF_0.9_RNAcode.tsv |
23+
| 05_assign_all_level.py | Assign RNAcode results to 100AA and 90AA smORFs | smORF_0.9_RNAcode.tsv all_0.5_0.9_filter.tsv 100AA_rename.tsv.xz 90AA_rename.tsv.xz all_0.9_0.5_family.tsv.xz | rnacode_true_100AA.tsv.xz rnacode_false_100AA.tsv.xz rnacode_true_90AA.tsv.xz rnacode_false_90AA.tsv.xz |
24+
25+
### metatranscriptomics
26+
27+
| **Code** | **Description** | **Input** | **Output** |
28+
| :---: | :---: | :---: | :---: |
29+
| 01_run_bwa_ngless.sh | Map metatranscriptome reads to smORFs | 90AA_GMSC.fna *.fastq.gz | *.tsv |
30+
| 02_merge_filter.py | Merge and filter mapping results | *.tsv | metaT_result_filter.tsv |
31+
| 03_assign_all_level.py | Assign results to 100AA and 90AA smORFs | metaT_result_filter.tsv all_0.9_0.5_family.tsv.xz | metaT_100AA.tsv.gz metaT_90AA.tsv.gz |
32+
33+
### riboseq
34+
35+
| **Code** | **Description** | **Input** | **Output** |
36+
| :---: | :---: | :---: | :---: |
37+
| 01_run_bwa_ngless.sh | Map riboseq reads to smORFs | 90AA_GMSC.fna *.fastq.gz | *.tsv |
38+
| 02_merge_filter.py | Merge and filter mapping results | *.tsv | riboseq_result_filter.tsv |
39+
| 03_assign_all_level.py | Assign results to 100AA and 90AA smORFs | riboseq_result_filter.tsv all_0.9_0.5_family.tsv.xz | riboseq_100AA.tsv.gz riboseq_90AA.tsv.gz |
40+
41+
### metaproteomics
42+
43+
| **Code** | **Description** | **Input** | **Output** |
44+
| :---: | :---: | :---: | :---: |
45+
| 00_split_100AA.py 01_map.py | For each metaproteomes peptides from each project in PRIDE,find their exact match against 100AA smORFs | 100AA_GMSC.faa.xz *.fasta | *.tsv |
46+
| 02_merge.py | Calculate and filter peptide coverage rate of each smORF | *.tsv | coverage_analysis.tsv |
47+
| 03_assign_all_level.py | Assign results to 90AA smORFs | coverage_analysis.tsv all_0.9_0.5_family.tsv.xz | metaP_90AA.tsv.gz |
48+
49+
50+
### merge_quality_control
51+
52+
| **Code** | **Description** | **Input** | **Output** |
53+
| :---: | :---: | :---: | :---: |
54+
| 01_merge.py | Merge all the quality control results | 100AA_rename.tsv.xz rnacode_true_100AA.tsv.xz rnacode_false_100AA.tsv.xz antifam_result.tsv coverage_analysis.tsv.gz riboseq_100AA.tsv.gz 100AA_coordinate.tsv.gz metaT_100AA.tsv.gz | allquality_100AA.tsv.gz allpass_100AA.txt |

General_Scripts/04_Frozen/05_sort_fna.py

Lines changed: 0 additions & 63 deletions
This file was deleted.

General_Scripts/04_Frozen/06_sort_faa_family.py

Lines changed: 0 additions & 48 deletions
This file was deleted.

General_Scripts/04_Frozen/07_habitat_100_50_90.py

Lines changed: 0 additions & 60 deletions
This file was deleted.

General_Scripts/04_Frozen/08_taxonomy_100_50_90.py

Lines changed: 0 additions & 63 deletions
This file was deleted.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# 04_Frozen
2+
| **Code** | **Description** | **Input** | **Output** |
3+
| :---: | :---: | :---: | :---: |
4+
| 01_rename_list.py | Rename 100AA sequences | metag_ProG.raw_number.tsv.gz metag_ProG_nonsingleton.faa.gz singleton_0.5_0.9.tsv metag_ProG_singleton.faa.gz| nonsingleton_rename.tsv singleton_rename.tsv |
5+
| 02_100AA_faa_fna.py | Generate 100AA faa and fna file with new identifier | 100AA_rename.tsv.xz metag_ProG_dedup.faa.gz GMSC10.metag_smorfs.fna.xz GMSC.ProGenomes2.smorfs.fna.xz | 100AA_GMSC.faa.xz 100AA_metag.fna.xz 100AA_prog.fna.xz |
6+
| 03_90AA_faa_fna.py | Rename 90AA sequences and generate 90AA faa and fna file with new identifier | metag_ProG_nonsingleton_0.9_clu_rep.faa.gz metag_ProG_nonsingleton_0.9_clu.tsv.gz 100AA_rename.tsv.xz 100AA_GMSC.fna.xz | 90AA_rename.tsv.xz 90AA_rename_all.tsv.xz 90AA_GMSC.faa.xz 90AA_GMSC.fna.xz |
7+
| 04_family.py | Generate the cluster table | 90AA_rename_all.tsv.xz all_0.5_0.9_rename.tsv.gz | all_0.9_0.5_family.tsv.xz |
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
## 06_Compare_with_other_datasets
2+
3+
| **Code** | **Description** |
4+
| :---: | :---: |
5+
| 01_download.sh 02_filter_sp_dedup.py | Download archaeal and bacterial proteins from Refseq, filter sequences (<100aa) and remove redundancy |
6+
| 03_align.sh | Use Diamond to align sequences to GMSC |
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
## 07_GMSC_mapper_benchmark
2+
3+
### 01_sensitivity
4+
5+
| **Code** | **Description** |
6+
| :---: | :---: |
7+
| 01_sensitivity.sh | Test the recovery of smORFs by different sensitivity of Diamond and MMseqs2 with different length |
8+
9+
### 02_time
10+
11+
| **Code** | **Description** |
12+
| :---: | :---: |
13+
| 01_time.sh | Test the time cost by Diamond and MMseqs2 |
14+
15+
### 03_identity
16+
17+
| **Code** | **Description** |
18+
| :---: | :---: |
19+
| 01_select_mutation.py | Randomly selected and mutated 10,000 sequences from 90AA smORFs with different length (20,30,40,60,80,all) at different identity. |
20+
| 02_identity.sh | Test the recovery of smORFs between Blast,Diamond,MMseqs with different length (20,30,40,60,80,all) and different identity |

0 commit comments

Comments
 (0)