acropora_kenti_wgs/00.sample_sequencing_info.Rmd at main · bakeronit/acropora_kenti_wgs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "Samples and sequencing information"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
library(tidyverse)
library(ggpubr)
#library(ggbreak)
library(knitr)
source("scripts/my_color.R")
```


```{r map}
include_graphics("figures/Atenius-sampling.png")
```


 *Acropora kenti* corals were collected from both 5 inshore and 4 offshore locations on the Great Barrier Reef (GBR).

- Inshore reefs
  - Magnetic Island (MI, n=28)
  - Pandora Reef (PR, n=30)
  - Pelorus Island (PI, n=30)
  - Dunk Island (DI, n=30)
  - Fitzroy Island (FI, n=30)

- Offshore reefs
  - Arlington Reef (ARL, n=20)
  - Taylor Reef (TAY, n=20)
  - Rib Reef (RIB, n=20)
  - John Brewer Reef (JB, n=20)

The extracted DNA was sequenced at shallow depth **(~2-5x)** with 100bp paired sequences. Some samples were sequenced in multiple lines or flowcells, thus, one sample has several fastq files.

> Two samples (FI-1-3, MI-1-4) were sequenced at much higher depth ~20x. Aside from some specific analyses (eg structural variant dection) these files were down-sampled to average depth based on random selection of mapped reads from their bam files.

Data were sequenced in two batches.  The first batch is from a previously [published study](https://www.science.org/doi/full/10.1126/sciadv.abc6318).  The second was specifically sequenced for the present study.

- **Data of inshore reefs from Cooke et al 2020**

  Each sample was sequenced in 2 lanes with paired-end reads, thus, 592 fastq files were generated from 148 samples.

- **Data of offshore reefs. This study**

  80 samples, each with 2 lanes and 2 batches, from 640 fastq files.

List of filenames and sample info table can be found [here](https://docs.google.com/spreadsheets/d/1sArk4d6xUXZzDHxBPvQEfcudYWERo7W7Xp-Va3FbQeE/edit?usp=sharing)

**Sequencing data summary**

Data yield per sample ranged from 0.423-2.49Gbp.Based on a genome size of 487Mb (486,812,518), samples were sequenced at 0.87X-5.12X depths (except for two high sequencing depth samples: 23.84X,11.6Gbp; 25.90X,12.6Gbp).

```{r seq-data}
df <- read_csv("data/summary_data.csv") %>% mutate(pop_order=site_order()[pop])
df %>%
  filter(!(sample_id %in% c("MI-1-4_S10","FI-1-3_S9"))) %>%
  ggplot(aes(x=reorder(pop,pop_order),y=total_bases/1e+9,color=pop)) +
  geom_boxplot() +
  geom_point(color="darkgrey",size=.5) +
  scale_color_manual(values = site_colors(),guide=F) +
  theme_pubclean()+labs(x="",y="Total base (Gb)")
```

**Figure 1:** Total sequencing data available for all shallow coverage samples by reef.