Skip to content

Commit 30613fb

Browse files
committed
Added Joseph's code
1 parent 9bd9468 commit 30613fb

File tree

392 files changed

+9674010
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

392 files changed

+9674010
-0
lines changed
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Bacterial Genome Assembly Pipeline
2+
3+
A Snakemake workflow for bacterial genome assembly from paired-end Illumina reads.
4+
5+
## Pipeline Overview
6+
7+
This pipeline performs the following steps:
8+
1. **Adapter trimming** (cutadapt)
9+
2. **Quality filtering** (sickle)
10+
3. **Genome assembly** (Unicycler with integrated SPAdes)
11+
4. **Gene annotation** (Bakta)
12+
5. **Taxonomic classification** (GTDB-Tk)
13+
6. **Assembly statistics** (seqkit)
14+
7. **Quality assessment** (CheckM2)
15+
16+
## Requirements
17+
18+
- Snakemake (>= 7.0)
19+
- Conda/Mamba
20+
- SLURM cluster (optional, for HPC execution)
21+
22+
## Setup
23+
24+
1. Clone this repository
25+
2. Install required databases:
26+
- Bakta database: https://github.com/oschwengers/bakta#database
27+
- GTDB-Tk database: https://ecogenomics.github.io/GTDBTk/installing/index.html
28+
- CheckM2 database: https://github.com/chklovski/CheckM2
29+
30+
3. Update database paths in `config/config.yaml`:
31+
```yaml
32+
bakta:
33+
db: "/path/to/bakta/db"
34+
gtdbtk:
35+
gtdb_data_path: "/path/to/gtdbtk/data"
36+
checkm2:
37+
database_path: "/path/to/checkm2/database/uniref100.KO.1.dmnd"
38+
```
39+
40+
4. Create your sample sheet in `config/samples.csv`:
41+
```csv
42+
isolate_id,fastq_1,fastq_2
43+
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
44+
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
45+
```
46+
47+
## Usage
48+
49+
### Local execution
50+
```bash
51+
snakemake --use-conda --cores 8
52+
```
53+
54+
### SLURM cluster execution
55+
```bash
56+
snakemake --use-conda --profile slurm
57+
```
58+
59+
Where the SLURM profile should be configured according to your cluster specifications.
60+
61+
### Dry run
62+
```bash
63+
snakemake -n
64+
```
65+
66+
### Generate workflow diagram
67+
```bash
68+
snakemake --dag | dot -Tpng > workflow.png
69+
```
70+
71+
## Output
72+
73+
Results are organized in the `results/` directory:
74+
- `cutadapt/`: Adapter-trimmed reads
75+
- `sickle/`: Quality-filtered reads
76+
- `unicycler/`: Genome assemblies
77+
- `bakta/`: Gene annotations
78+
- `gtdbtk/`: Taxonomic classifications
79+
- `seqkit/`: Assembly statistics
80+
- `checkm2/`: Quality assessment reports
81+
- `summary/`: Combined summary tables
82+
83+
## Configuration
84+
85+
Edit `config/config.yaml` to adjust parameters for each tool.
86+
87+
## Resource Requirements
88+
89+
The pipeline is configured with SLURM resource allocations:
90+
- Unicycler (with SPAdes): 64GB RAM, 24 CPUs
91+
- GTDB-Tk: 128GB RAM, 32 CPUs
92+
- CheckM2: 32GB RAM, 16 CPUs
93+
- Bakta: 16GB RAM, 8 CPUs
94+
- Other tools: 2-4GB RAM, 1-4 CPUs
95+
96+
Adjust these in the rule definitions as needed for your system.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
$schema: "https://json-schema.org/draft/2020-12/schema"
2+
type: object
3+
properties:
4+
samples:
5+
type: string
6+
description: "Path to samples CSV file"
7+
8+
cutadapt:
9+
type: object
10+
properties:
11+
adapter_r1:
12+
type: string
13+
description: "3' adapter sequence to trim from R1 reads"
14+
adapter_r2:
15+
type: string
16+
description: "3' adapter sequence to trim from R2 reads"
17+
min_length:
18+
type: integer
19+
minimum: 1
20+
description: "Minimum read length after trimming"
21+
quality_cutoff:
22+
type: integer
23+
minimum: 0
24+
description: "Quality score cutoff for trimming"
25+
required: ["adapter_r1", "adapter_r2", "min_length", "quality_cutoff"]
26+
27+
sickle:
28+
type: object
29+
properties:
30+
quality_type:
31+
type: string
32+
enum: ["sanger", "illumina", "solexa"]
33+
description: "Quality score encoding type"
34+
quality_threshold:
35+
type: integer
36+
minimum: 0
37+
description: "Minimum quality score threshold"
38+
length_threshold:
39+
type: integer
40+
minimum: 1
41+
description: "Minimum read length after quality trimming"
42+
required: ["quality_type", "quality_threshold", "length_threshold"]
43+
44+
unicycler:
45+
type: object
46+
properties:
47+
mode:
48+
type: string
49+
enum: ["conservative", "normal", "bold"]
50+
description: "Unicycler assembly mode"
51+
min_fasta_length:
52+
type: integer
53+
minimum: 1
54+
description: "Minimum contig length in output"
55+
kmers:
56+
type: string
57+
description: "Comma-separated list of k-mer sizes for SPAdes"
58+
keep:
59+
type: integer
60+
minimum: 0
61+
maximum: 3
62+
description: "Level of file retention (0-3)"
63+
spades_options:
64+
type: string
65+
description: "Additional options to pass to SPAdes"
66+
required: ["mode", "min_fasta_length", "kmers", "keep", "spades_options"]
67+
68+
bakta:
69+
type: object
70+
properties:
71+
db:
72+
type: string
73+
description: "Path to Bakta database"
74+
genus:
75+
type: string
76+
description: "Genus name for annotation"
77+
species:
78+
type: string
79+
description: "Species name for annotation"
80+
min_contig_length:
81+
type: integer
82+
minimum: 1
83+
description: "Minimum contig length to annotate"
84+
required: ["db", "genus", "species", "min_contig_length"]
85+
86+
gtdbtk:
87+
type: object
88+
properties:
89+
gtdb_data_path:
90+
type: string
91+
description: "Path to GTDB-Tk database"
92+
required: ["gtdb_data_path"]
93+
94+
checkm2:
95+
type: object
96+
properties:
97+
database_path:
98+
type: string
99+
description: "Path to CheckM2 database file"
100+
lowmem:
101+
type: string
102+
description: "Low memory mode flag"
103+
required: ["database_path", "lowmem"]
104+
105+
required: ["samples", "cutadapt", "sickle", "unicycler", "bakta", "gtdbtk", "checkm2"]
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
samples: "config/samples.csv"
2+
3+
fastp:
4+
# Adapter sequences (auto-detect if not specified)
5+
adapter_r1: ""
6+
adapter_r2: ""
7+
# Quality filtering
8+
qualified_quality_phred: 20
9+
# Minimum read length after trimming
10+
length_required: 50
11+
# PolyG tail trimming for NextSeq/NovaSeq
12+
trim_poly_g: true
13+
# PolyX tail trimming
14+
trim_poly_x: true
15+
# Complexity filtering
16+
complexity_threshold: 30
17+
18+
unicycler:
19+
mode: "normal"
20+
min_fasta_length: 200
21+
kmers: "21,33,55,77,99,127"
22+
keep: 1
23+
spades_options: "--careful"
24+
25+
bakta:
26+
db: "/n/groups/kwon/joseph/dbs/bakta_db_v5/"
27+
genus: "Unknown"
28+
species: "sp."
29+
min_contig_length: 1000
30+
31+
gtdbtk:
32+
database_path: "/n/groups/kwon/joseph/dbs/gtdb"
33+
34+
checkm2:
35+
database_path: "/n/groups/kwon/joseph/dbs/checkm2/CheckM2_database/uniref100.KO.1.dmnd"
36+
lowmem: "--lowmem"

0 commit comments

Comments
 (0)