Workflow

The following document describes the detailed workflow and results of the analyses.

Please refer to the following diagram for clearer understanding of the workflow and the order in which the analysis steps were performed.

Untitled Diagram (2)

The code included in this wiki is only the example commands. These are often given only for one out of three scaffolds to avoid making this text unreadable. The full job files with imports, file management (including all the cd commands) and slurm details can be found in the codefiles.

Raw data

This project used three kinds of input raw data: genomic short Illumina reads and genomic long PacBio reads, as well as transcriptomic short Illumina reads. All the input files were in .fq.gz format

The given input data belonged to three different scaffolds (6, 10 and 11). All analyses up to feature counting are performed for all three of these scaffolds, using either separate code files or one for all. For the last two steps of the analysis, due to lack of time, only results from scaffold 10 are used.

The reads represent nine different combinations of two plant cultivara (Musang King or Monthong) and four plant organs, corresponding to a specific experiment ID. These are presented below:

SRR6040092 Musang King leaf
SRR6040093 Musang King root
SRR6040094 Musang King aril
SRR6040095 Musang King aril
SRR6040096 Musang King stem
SRR6040097 Musang King aril
SRR6156066 Monthong aril
SRR6156067 Monthong aril
SRR6156069 Monthong aril

Short reads preprocessing

Raw illumina reads, both genomic and transcriptomic, were Quality Controlled, trimmed, and Quality Controlled again. The programs used were FastQC and Trimmomatic.

QC parameters:

6 threads

Initial QC code:

fastqc -o ~/GenomeAnalysis/DNA_prep_illumina/ -t 6 *

Trimming parameters:

Paired end reads
Illumina Tru-seq3 adapters removed
Leading and trailing bases below quality 3 removed
Reads shorter than 36 bases removed

Trimming code:

trimmomatic PE SRR6058604_scaffold_06.1P.fastq.gz SRR6058604_scaffold_06.2P.fastq.gz illumina_scaffold06.paired.forward.trimmed.fastq.gz illumina_scaffold06.unpaired.forward.trimmed.fastq.gz illumina_scaffold06.paired.reverse.trimmed.fastq.gz illumina_scaffold06.unpaired.reverse.trimmed.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 MINLEN:36

Post-trim QC:

fastqc -o ~/GenomeAnalysis/DNA_prep_illumina/fastq_trimmed -t 6 *

Transcriptome reads were preprocessed in the same way. Fastqc parameters: 10 threads, running from raw file folders for trimmed and untrimmed reads each. Reads from only one experiment (SRR6040095) needed trimming, trim run with the same parameters as above.

fastqc -o /home/juliasul/GenomeAnalysis/RNA_prep/fastQC_initial/fastqc_initial_repeat -t 10 *

trimmomatic PE SRR6040095_scaffold_06.1.fastq.gz SRR6040095_scaffold_06.2.fastq.gz RNA_sc06.paired.forward.trimmed.fastq.gz RNA_sc06.unpaired.forward.trimmed.fastq.gz RNA_sc06.paired.reverse.trimmed.fastq.gz RNA_sc06.unpaired.reverse.trimmed.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 MINLEN:36

fastqc -o /home/juliasul/GenomeAnalysis/RNA_prep/fastQC_trimmed -t 10 RNA_sc06* RNA_sc10* RNA_sc11*

The genomic reads passed QC well, while transcriptomic reads did not (both before and after trimming), failing among others in metrics of overrepresenteed sequences and per sequence GC content. The analysis was nevertheless continued with the same data, due to lack of any other data or time.

Assembly

The initial assembly was made using long genomic PacBio reads in software Canu. No initial preprocessing was made, because Canu includes a QC. The command outputs a .fasta file with an assembly.

Canu parameters:

max threads 10
no algorithms are run using grid control
genome size for selected scaffold (here sc11) 23775123 base pairs. (from metadata)

The following command was used:

canu -maxThreads=10 -useGrid=false -p pacbioassembly_scaffold11 -d . genomeSize=23775123 -pacbio SRR6037732_scaffold_11.fq.gz

The canu QC for scaffold 06 resulted in the following data:

amount of reads: 266898; after overlap correction: 40347
amount of bases: 2048960181 bases; after overlap correction: 116386120
coverage: 67.2 times; after overlap correction: 4.3 times
N50: 9909; after overlap correction: 8847
653 sequences assembled, total length 34 Mbp
7370 sequences unassembled, total length 47 Mbp

The initial assembly is build from data with good coverage, but more than half of the sequence length couldn't be assembled (for sc06). This is an unsatisfactory result, possibly depending on the original genome quality. An improvement of this step would possibly require obtaining new samples and redoing the sequencing, but a long read assembly can also be improved with short reads, as described below.

Assembly improvement

The assembly based on genomic long reads was further improved using genomic short reads. The reads were first indexed and aligned to the assembly using software BWA with options index and mem (10 threads), creating an alignment file in .sam format:

bwa index pacbioassembly_scaffold06.contigs.fasta

bwa mem -t 10 pacbioassembly_scaffold06.contigs.fasta /home/juliasul/GenomeAnalysis/DNA_prep_illumina/trimmomatic/illumina_scaffold06.paired.forward.trimmed.fastq.gz /home/juliasul/GenomeAnalysis/DNA_prep_illumina/trimmomatic/illumina_scaffold06.paired.reverse.trimmed.fastq.gz > /proj/genomeanalysis2023/nobackup/work/jusu_bwa/scaffold06/aligned_scaffold06.sam

The alignment was converted to a compressed format .bam, as well as sorted by position and indexed, using SAMtools (10 threads, input in SAM format):

samtools view -S -@ 10 -b scaffold06/aligned_scaffold06.sam > aligned_scaffold06.bam

samtools sort -o aligned_scaffold06_sorted.bam -@ 10 aligned_scaffold06.bam

samtools index -@ 10 aligned_scaffold06_sorted.bam

The initial assembly was subsequently corrected with the alignment using software Pilon (10 threads, vcf gile generated).

java -jar /sw/bioinfo/Pilon/1.24/rackham/pilon.jar --genome pacbioassembly_scaffold11.contigs.fasta --bam aligned_scaffold11_sorted.bam --output assembly_corrected_scaffold11 --outdir ./pilon_scaffold11 --vcf --threads 10

The final assembly was Quality Controlled using QUAST (10 threads, eukaryote genome, circos plot generated, gene finding with GeneMark-ES applied, conserved gene finding with BUSCO applied) and the sequence repeats were softmasked using RepeatMasker (10 threads, taxonomy group rosids, return html results).

quast.py -t 10 -e --circos -f -b assembly_corrected_scaffold06.fasta

RepeatMasker -pa 10 -species rosids -html -dir repeatmasker/scaffold06 assembly_corrected_scaffold06.fasta

The final assembly of scaffold 10 consists of 614 contigs, all of which more than 1000bp in size and have GC content of 31.44%. The following plots generated by QUAST describe the cumulative length of contigs and the GC content distribution in the assembly:

After the improvement the assembly quality is good and the problems with the initial assembly were successfully corrected.

Genome annotation

The transcriptome reads were aligned (splice-aware) to the final genome assembly using STAR.

Indexing parameters:

10 threads
custom value 11 of genomeSAindexNbases for very small genome (since genome size is only one scaffold), calculated using the following formula: min(14, log2(GenomeLength)/2 - 1).

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /proj/genomeanalysis2023/nobackup/work/jusu/jusu_STAR/genomeindex/sc06 --genomeFastaFiles assembly_corrected_scaffold06.fasta.masked --genomeSAindexNbases 11

Alignment parameters (the alignment command was written separately for each run for each scaffold. Only one of these is demonstrated here):

10 threads
output file type BAM to save storage
output files sorted by coordinate
output unmapped reads within the main output file
use standard output file attributes
input gzipped files -> read with zcat

STAR --genomeDir /proj/genomeanalysis2023/nobackup/work/jusu/jusu_STAR/genomeindex/sc06 --runThreadN 10 --readFilesIn SRR6040092_scaffold_06.1.fastq.gz SRR6040092_scaffold_06.2.fastq.gz --outFileNamePrefix proj/genomeanalysis2023/nobackup/work/jusu/jusu_STAR/align/sc06/SRR.92/SRR_92_sc06_aligned --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard --readFilesCommand zcat

The alignment file was compressed, sorted and indexed using SAMtools as in the previous step.

samtools view -S -b sc06/SRR.66/SRR_66_sc06_aligned--outSAMtypeAligned.out.sam > /proj/genomeanalysis2023/nobackup/work/jusu/jusu_STAR/align/bamfiles/SRR_66_sc06.bam

samtools sort -o bamfiles_sorted/SRR_66_sc06.bam SRR_66_sc06.bam

samtools index SRR_66_sc06.bam

The alingment steps produced alignment files in .bam format as well as index files in .bai format.

Next, the final genome was annotated using the alignment files, with software BRAKER.

Braker parameters:

use existing species files if already ran braker
use softmasking (repeats written in lowercase rather than replaced with N's)
10 threads
construct UTR training examples using BAM files

braker.pl --useexisting species=D_zibethinus --genome=/proj/genomeanalysis2023/nobackup/work/jusu/jusu_quast/repeatmasker/scaffold10/assembly_corrected_scaffold10.fasta.masked --softmasking --bam=$bamdir/SRR_66_sc10.bam,$bamdir/SRR_67_sc10.bam,$bamdir/SRR_69_sc10.bam,$bamdir/SRR_92_sc10.bam,$bamdir/SRR_93_sc10.bam,$bamdir/SRR_94_sc10.bam,$bamdir/SRR_95_sc10.bam,$bamdir/SRR_96_sc10.bam,$bamdir/SRR_97_sc10.bam --cores=10 --UTR=on

The annotation produced feature files in .gtf and .gff formats.

Feature counting and differential expression analysis

Genomic features were counted from the .gtf file using HTseq. With a bash pipe, the command outputs a .txt file with counts.

Htseq parameters:

input format BAM
input data is not stranded (a read is considered to be overlapping with a feature regardless of which strand it's on)
input data sorted by position (coordinate)

bam=/proj/genomeanalysis2023/nobackup/work/jusu/jusu_STAR/align/bamfiles/bamfiles_sorted gtf=/proj/genomeanalysis2023/nobackup/work/jusu/jusu_braker/sc10/braker/Sp_4/GeneMark-ET htseq-count -f bam -s no -r pos $bam/SRR_66_sc10.bam $bam/SRR_67_sc10.bam $bam/SRR_69_sc10.bam $bam/SRR_92_sc10.bam $bam/SRR_93_sc10.bam $bam/SRR_94_sc10.bam $bam/SRR_95_sc10.bam $bam/SRR_96_sc10.bam $bam/SRR_97_sc10.bam $gtf/genemark.gtf > /home/juliasul/GenomeAnalysis/count_features/sc10_count.txt

The resulting count file was used for expression comparison between the different cultivars and different plant organs using R package DEseq2. The code is too large to present here, but can be found in the codefile deSEQ.r. For each analysis, three graphs were created to represent the result: an Mean-average plot (MA-plot), Principal Component Analysis (PCA-plot) and an expression heatmap. Due to lack of time, names of the most expressed genes were not retrieved via eggNOGmapper and were instead borrowed from a classmate working on the same project. These gene names are used in heatmaps.

Results

Expression analysis between cultivars

MA-plot for expression analysis between cultivars:

In the plot, each dot represents a gene and blue dots represent genes with p-value < 0.1. The x-axis shows average expression across all samples and the y-axis is log-fold-change between experimental groups. Genes with similar expression between the different cultivars are clustered close to centerline of the plot, while the genes with with different expression are further away from the center. The plot suggests quite high differences in expression patterns between the two cultivars, because there are a lot of genes with significant p-values rather far away from the centerline. In a good MA-plot the dots should be evenly dispersed in relation to the x-axis, which values in this plot are not, which I can not explain. The linear relationships on this plot arise from one of the expression values being 0 in several columns of one of the datasets.

PCA-plot for for expression analysis between cultivars:

A PCA plot clusters samples based on their similarity, representing expression in each sample by one dot across all genes. It is built using two parameters - PC1 and PC2, each of which is actually a line for expression comparison of two genes (but for different principal components), combined into all-to-all comparisons as an axis (More detailed on this here: https://blog.bioturing.com/2018/06/14/principal-component-analysis-explained-simply/). The plot for cultivar analysis doesn't have enough samples for any clear or conclusive results, but seems to have three somewhat distinct clusters: two corresponsing to Musang King cultivar, and one to Monthong cultivar (with one Musang King sample mixed in). The plot results are meaningsful since the PCA variance proportions combined (54% and 24%) account for most variation in the data.

Heatmap for for expression analysis between cultivars:

Each column in the plot above represents a sample, and each row a specific gene. Due to time constraints, the problem with not all gene names being specified wasn't solved, leading to some of the rows being labeled "UNKNOWN". The genes represented with warm colors are up-regulated, and with cold colors down-regulated. The heatmap for cultivar comparison is the same as for plant organ comparison, with only difference being the grouping labels. It shows that E3 ubiquitin-protein ligase is highly up-regulated for certain samples corresponding to Musang King cultivar (run names ending with 94-95, last two to the right) and highly down-regulated for the run name ending with 67 (first to the left) corresponding to Monthnong cultivar. The overall view of the heatmap shows many genes being up-regulated in Musang King cultivar and down-regulated in Monthong cultivar.

Expression analysis between plant organs

MA-plot for expression analysis between plant organs:

Unlike the MA-plot for the previous analysis, this plot suggests less differences in expression patterns between plant organs due to few significant values being far away from the centerline. This graph displays somewhat even dispersion of the values around the the horizontal centerline, indicating better data. Similarly to the cultivar comparison MA-plot, the linear relationships on this plot indicate 0-values in several data columns. Interestingly enough and unlike the cultivar comparison plot, these are in the lower half of the graph.

PCA-plot for for expression analysis between plant organs:

This PCA plot also has three clusters two of which correspond to plant aril and one to a mix of leaf, root and stem. This graph would be much more useful (and have more conclusive results) if applied to more samples, for example the combined samples of all three provided scaffolds.

Heatmap for for expression analysis between plant organs:

In this plot we can see that both the above mentioned expression high- and low-spots for E3 ubiquitin-protein ligase belong to aril tissue. This shows that the different cultivars have significant functional differences when it comes to this specific gene. For the aril of the monthong cultivar, it's the only highly up-regulated gene among the ones tested. There can't be any comparison between the different plant organs in the monthong cultivar because the only monthong samples are from the aril, but such comparisons can be made for the Musang King cultivar: glutamate synthase, methionine and v-type proton ATPase subunit are not under strong regulation in any of the plant organs, while xyloglucan galactosyltransferase is strongly down-regulated in the aril. Many other genes in Musang King aril are strongly down-regulated, including some unidentified genes. More information would be available from these plots if all the genes were identified.

Conclusions

Genome assembly results are good. The expression analysis would benefit from improvements, but still gives some results, including the following:

The E3 biquitin-protein ligase gene is strongly up-regulated in aril of Musang King, and strongly down-regulated in aril of Monthong, suggesting functional differences.
The xyloglucan galactosyltransferase gene in aril of Musang King is strongly downregulated.
There are overall quite high expression differences between the cultivars, and seemingly less differences between the plant organs. This is an unexpected result that would require some verification.
Plant leaf, room and stem have overall a similar expression profile compared to aril.

The best option for continuation of this project would be redoing the differential expression analyses with more time, and preferably checking the quality of annotation results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly