popgen_course/04_Rmd_plotting_pca.Rmd at master · Schiffels-Popgen/popgen_course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Principal Component Plots"
output: html_document
editor_options:
  chunk_output_type: console
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE)
```

```{r}
library(magrittr)
```

For this chapter, you will need the PCA results that we ran in the last chapter. I have actually included the output files of my runs into this repository, so you can just use them if something didn't work in the previous chapter.

For making plots in python, one of the most popular libaries around is ggplot2. You can load it via:

```{r}
library(ggplot2)
```

Let's have a look at the main results file from smartpca:

```{r, warning=F}
pcaDat <- readr::read_delim("pca.WestEurasia.evec", " ", trim_ws = T)
pcaDat2 <- readr::read_delim("pca.AllEurasia.evec", " ", trim_ws = T)
```

The first row contains the eigenvalues for the first 4 principal components (PCs), and all further rows contain the PC coordinates for each individual. The first column contains the name of each individual, the last row the population. To load this dataset with R, we use the readr package. To load data using readr, we used the read_delim() function. We can now change the column headers:

```{r}
colnames(pcaDat) <- colnames(pcaDat2) <- c("Name", "PC1", "PC2", "PC3", "PC4", "Group")
```

Looking at the data, we find that it is a tibble (a better data.frame), with each individual on one row, and the columns denoting the first 4 principal components. The last column contains the population for each individual:

```{r}
pcaDat
```

We can quickly plot the first two PCs for all individuals:

```{r}
pcaDat %>%
  ggplot() +
  geom_point(aes(x = PC1, y = PC2))
```

which is not very helpful, because we can't see where each population falls. We can highlight a few populations to get a bit more of a feeling:

```{r}
ggplot() +
  geom_point(
    data = pcaDat %>% dplyr::filter(!(Group %in% c("Finnish", "Sardinian", "Armenian", "BedouinB"))),
    aes(x = PC1, y = PC2)
  ) +
  geom_point(
    data = pcaDat %>% dplyr::filter(Group %in% c("Finnish", "Sardinian", "Armenian", "BedouinB")),
    aes(x = PC1, y = PC2, color = Group)
  )
```

## Showing all populations

OK, but how do we systematically show all the populations? There are too many of those to separate them all by different colors, or by different symbols, so we need to combine colours and symbols and use all the combinations of them to show all the populations.

```{r}
populations <- readr::read_csv("data/popgen_course/WestEurasia.poplist.txt", col_names = F)$X1
```

```{r, fig.height=10}
pcaDat %>%
  dplyr::filter(Group %in% populations) %>%
  ggplot() +
  geom_point(aes(
    x = PC1, y = PC2,
    color = Group, shape = Group
  )) +
  scale_shape_manual(values = rep(0:18, len = 57)) +
  theme(legend.position = "bottom")
```

## Adding ancient populations

Of course, until now we haven't yet included any of the actual ancient test individuals that we want to analyse.

We add the following ancient populations to this plot:

- Levanluhta (two individuals from Finland from the first millenium AD)
- BolshoyOleniOstrov (a group of 3500 year old individuals from Northern Russia).
- WHG (short for Western Hunter-Gatherers, about 8000 years ago)
- LBK_EN (short for Linearbandkeramik Early Neolithic, from about 6,000 years ago)
- Yamnaya_Samara, a late Neolithic population from the Russian Steppe, about 4,800 years ago.

The first two populations are from a publication on ancient Fennoscandian genomes ([Lamnidis et al. 2018](https://www.nature.com/articles/s41467-018-07483-5)), and are instructive to understand what PCA can be used for. The latter three populations are from two famous publications ([Lazaridis et al. 2014](https://www.nature.com/articles/nature13673) and [Haak et al. 2015](https://www.nature.com/articles/nature14317)). It can be shown that modern European genetic diversity is formed by a mix of three ancestries represented by these ancient groups. To highlight these ancient populations, we plot them in black and using different symbols. While we're at it, we should also add the population called "Saami.DG":

```{r, fig.height=10}
ancient_populations <- c("Levanluhta", "BolshoyOleniOstrov", "WHG", "LBK_EN", "Yamnaya_Samara", "Saami.DG")

ggplot() +
  geom_point(
    data = pcaDat %>% dplyr::filter(Group %in% populations),
    mapping = aes(
      x = PC1, y = PC2,
      color = Group, shape = Group
    )
  ) +
  geom_point(
    data = pcaDat %>% dplyr::filter(Group %in% ancient_populations),
    mapping = aes(
      x = PC1, y = PC2
    ),
    color = "black", shape = 15
  ) +
  ggrepel::geom_label_repel(
    data = pcaDat %>% dplyr::filter(Group %in% ancient_populations) %>%
      dplyr::group_by(Group) %>%
      dplyr::summarise(PC1 = mean(PC1), PC2 = mean(PC2)),
    mapping = aes(
      x = PC1, y = PC2, label = Group
    )
  ) +
  scale_shape_manual(values = rep(0:14, len = 57)) +
  theme(legend.position = "bottom")
```

OK, so what are we looking at? This is quite a rich plot, of course, and we won't discuss all the details here. I just want to highlight two things. First, you can see that most present-day Europeans are scattered in a relatively tight space in the center of a triangle span up by the WHG on the lower left, LBK_EN on the lower right (seen from European points) and by Yamnaya_Samara (top). Indeed, a widely-accepted model for present-day Europeans assumes these three ancient source populations for all Europeans ([Lazaridis et al. 2014](https://www.nature.com/articles/nature13673) and [Haak et al. 2015](https://www.nature.com/articles/nature14317)).

The second thing that is noteworthy here is that present-day people from Northeastern Europe, such as Finns, Saami and other Uralic speaking populations are "dragged" towards the ancient samples form Bolshoy Oleni Ostrov. Indeed, a recent model published by us assumes that "Siberian" genetic ancestry entered Europe around 4000 years ago as a kind of fourth genetic component on top of the three other components discusseda bove, and is nowadays found in most Uralic speakers in Europe, including Finns, Saami and Estonians.

## East-Eurasian PCA

We can make a similar plot using the all-Eurasian PCA that we have run:

```{r}
populations <- readr::read_csv("data/popgen_course/AllEurasia.poplist.txt", col_names = F)$X1
```

```{r, fig.height=12}
ggplot() +
  geom_point(
    data = pcaDat2 %>% dplyr::filter(Group %in% populations),
    mapping = aes(
      x = PC1, y = PC2,
      color = Group, shape = Group
    )
  ) +
  geom_point(
    data = pcaDat2 %>% dplyr::filter(Group %in% ancient_populations),
    mapping = aes(
      x = PC1, y = PC2
    ),
    color = "black", shape = 15
  ) +
  ggrepel::geom_label_repel(
    data = pcaDat2 %>% dplyr::filter(Group %in% ancient_populations) %>%
      dplyr::group_by(Group) %>%
      dplyr::summarise(PC1 = mean(PC1), PC2 = mean(PC2)),
    mapping = aes(
      x = PC1, y = PC2, label = Group
    )
  ) +
  scale_shape_manual(values = rep(0:14, len = 108)) +
  theme(legend.position = "bottom")
```

This PCA looks quite different. Here, we have all Western-Eurasian groups squished together on the left side of the plot, and on the right we have East-Asian populations. The plot roughly reflects Geography, with Northern East-Asian people such as the Nganasan on the top-right, and Southern East-Asian people like the Taiwanese Ami on the lower right. Here we can now see that the ancient samples from Russia and Finnland, as well as present-day Uralic populations are actually distributed between East and West, contrary to most other Europeans. This confirms that these group in Europe have quite a distinctive East-Asian genetic ancestry, and we found that it is best represented by the Nganasan ([Lamnidis et al. 2018](https://www.nature.com/articles/s41467-018-07483-5)).