-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtidy-advanced.qmd
More file actions
1301 lines (959 loc) · 44.6 KB
/
tidy-advanced.qmd
File metadata and controls
1301 lines (959 loc) · 44.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# More _tidyverse_ practice
**In this chapter we will introduce new (and reinforce already familiar) _tidyverse_
concepts on another real-world data set, related to the metadata we
looked previously. We will build the
components of a processing and filtering (and in the next session, visualization) pipeline for
a new [Identity-by-Descent
(IBD)](https://en.wikipedia.org/wiki/Identity_by_descent) data set** that I
recently got my hands on.
We will proceed step by step, developing new components to the
processing and analytical pipeline and learning how _tidyverse_ can be useful
in a very practical way, before we learn how to wrap it all up in a
nice, structured, reproducible way in later parts of our workshop.
This is a real world example! You will be retracing the steps I had to work
through in my job in August 2025 where I got sent a completely unfamiliar
new data set of IBD segments and asked to do some science with it.
---
**Why do this _tidyverse_ stuff again? We just did this with the metadata table.**
**A few reasons:**
1. **The metadata is straightforward, easy to understand by looking at it. The
IBD segment data (you will see) contains millions of data points.** Being able
to look at the data with your own eyes makes translating concepts to _tidyverse_
operations much easier, which is why the metadata was a better starting point.
2. **However, the metadata is just that, metadata.** It's a collection of meta
information about individuals. But it's not data you can do (biological)
science with. **As such, it's much less exciting to _really_ dig into more
interesting concepts.**
3. Related to point 2. above --- **the fact that the IBD data is more complex
makes it more interesting to learn visualization on, like _ggplot2_ in the next
section. So the rest of this session is a preparation for that.**
4. The previous chapter on IBD basics was really about fundamental application
of the most important _tidyverse_ functions. **This sesson is much more of a
set of "recipes" that you should be able to apply to your own data**, regardless
of what statistics or biological measures you end up working with in your
projects.
---
**Again, here is how the start of our solutions script for this session will
look like. Same as before:**
```{r}
#| message: false
library(dplyr)
library(readr)
```
---
#### Tip on doing interactive data science
**Remember that the best way can solve any R problem is by building up
a solution one little step at a time. First work in the R console, try bits
and pieces of code interactively, and only when you're sure you understand
the problem (and solution) you can finalize it by writing it in your
solution script (often saving the result in a variable for later use).**
This helps a huge deal to avoid feeling overwhelmed by what can initially
seem like a problem that's too hard!
---
**Let's also read in data which we will be using in these exercises.** These
are coordinates of identity-by-descent (IBD) segments between pairs of
individuals in a
[huge aDNA study](https://www.nature.com/articles/s41586-023-06865-0)
we've already talked about. This data is big and quite complex --- I
prefiltered it to contain only IBD segments for chromosome 21. The IBD
data should correspond to individuals in the metadata we worked with in the
previous session.
```{r}
ibd_segments <- read_tsv("https://tinyurl.com/simgen-ibd-segments")
```
---
And **we also need to read the corresponding metadata, with which you are already
very closely familiar with**:
```{r}
metadata_all <- read_tsv("https://tinyurl.com/simgen-metadata")
```
**Note:** I call it `metadata_all` because later we will be working with
a smaller section of it, for more clarity. Therefore, `metadata_all` will
be the original, unprocessed data set. It's often very useful to keep raw
data in our session like this.
---
**Create a new R script in RStudio, (`File` `->` `New file` `->` `R Script`) and
save it somewhere on your computer as `tidy-advanced.R` (`File` `->` `Save`).
Copy the chunks of code above into it (the `library()` and `read_tsv()` commands
and let's get started)!**
---
## Real world story
**So, the story is that you just got a new data set from a bioinformatics
software, in this case the [IBDseq software](https://faculty.washington.edu/browning/ibdseq.html)
for detecting IBD segments between pairs of individuals.
Before you proceed with doing any kind of statistical analysis you need to
do two things:**
1. **Exploring the data to see _what_ is it that you just got. **Never trust
the data you've been given blindly and always do many sanity checks! In
the process, you will gain more confidence in it, and also learn its ins
and outs, which will be useful either way.
2. Filtering and processing it in a way which will make your work easier.
**Let's tackle step 1 first.**
---
## Exercise 1: Exploring new data
**Again, the first step is always to familiarize yourself with the basic
structure of the data.
Which columns does the IBD data have? What's the format of the data?
What ranges or distributions of columns' values do you have, and with what
data types? Do you have information for the entire genome (all chromosomes)?**
**Hint:** `head()`, `colnames()`, `glimpse()`, `str()`, `table()`, `summary()`
which are either applicable to an entire data frame, or a specific column
(column vectors can be extracted with the `$` operator, of course).
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
head(ibd_segments)
```
```{r}
colnames(ibd_segments)
```
```{r}
glimpse(ibd_segments)
```
We have information just for chromosome 1 to make the data set a little
smaller and easier for your laptops to handle.
```{r}
table(ibd_segments$chrom)
```
:::
---
**As a sanity check, find out if you have metadata information for every individual in
the `sample1` and `sample2` columns of the IBD table. What about the other
way around -- do all individuals in the metadata have some IBD relationship to
another individual in the IBD segments table? If not, find out which individuals are these.**
This is another sort of sanity checking you will be doing all the time. We can
only analyze data for which we have metadata information (population assignment,
geographical location, dating information), so let's make sure we have what we
need.
**Hint:** Another way to phrase this question is this: does every sample that
appears in either `sample1` or `sample2` column of `ibd_segments` have a corresponding row in the
`sampleId` column of `metadata_all`? Note that you can use the
function `unique()` to get all values unique in a given vector (i.e.,
`unique(ibd_segments$sample1)` gives all individuals in the `sample1`
column, which has otherwise many duplicated entries).
And remember the existence of the `all()` function, and the `%in%`
operator, which can check if `unique(ibd_segments$sample1)`
is in `metadata_all$sampleId`.
**Hint:** If this still doesn't make sense, please work through my solution
directly and try to understand it bit by bit! This is a bit more advanced
than our previous _tidyverse_ exercises.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
We can get unique values across a combination of vectors by first binding
everything together using `c(ibd$sample1, ibd$sample2)` which gets us a single
vector, then calling `unique()` on that:
```{r}
ibd_samples <- unique(c(ibd_segments$sample1, ibd_segments$sample2))
metadata_samples <- metadata_all$sampleId
```
Then we can use our well-known operator `%in%` to check that every sample in
the IBD data has a representative in the metadata:
```{r}
all(ibd_samples %in% metadata_samples)
```
This is a good sign! We're not missing any information about anyone we have
an IBD segment for! Otherwise, we would have trouble analyzing such person's
IBD patterns across time, geography, populations, etc.
How about the other way around? Note that this is not the same operation,
although it might look similar superficially:
```{r}
all(metadata_samples %in% ibd_samples)
```
The `FALSE` answer tells us that there are some individuals in our metadata
who are not participating in any pairwise IBD sharing. Who are these?
```{r}
metadata_all[!metadata_samples %in% ibd_samples, ]
```
:::
With the basic sanity check above, we can be sure that our IBD segments
table can be co-analyzed with the information in the metadata and that
we won't be missing anything important.
## Exercise 2: IBD processing
**Add a new column `length` to the `ibd_segments` data frame using the
`mutate()` function, which will contain the length of each IBD segment
in centimorgans (`end - start`). Save the data frame that `mutate()`
returns back to the variable `ibd_segments`.**
**Note:** Sometimes saving new things to already-defined variables leads to
very messy code and its cleaner to create new variables for nex results.
In this instance, we're basically building up a processing
_pipeline_ whose purpose is to filter / mutate / clean a data frame for
downstream use. In fact, later we will add individual steps of this pipeline
name in several subsequent steps.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
ibd_segments <- ibd_segments %>% mutate(length = end - start)
ibd_segments
```
Note that you don't always use a _tidyverse_ approach. Sometimes doing a simple
thing using a simple method is just... simpler:
```{r}
# we first make a copy of the original data
ibd_segments <- ibd_segments
# and then add the new column
ibd_segments$length <- ibd_segments$end - ibd_segments$start
ibd_segments
```
:::
## Exercise 3: Metadata processing
We have read our IBD table and added a new useful column `length` to it, so let's
proceed with the metadata. You already became familiar with it in the
previous session---the goal here now is to transform it to a form more
useful for downstream data co-analysis with the genomic IBD data.
**Think of it this way: metadata contains all relevant variables for each
individual in our data (age, location, population, country, etc.). IBD contains
only genomic information. We eventually need to merge those two sets of
information together to ask questions about population history and biology.**
---
**First, there's much more information than we need for now, just take
a look again:**
```{r}
colnames(metadata_all)
```
Realistically, given that our interest here is pairwise IBD segments,
a lot of this metadata information is quite useless at this stage.
**Let's make the data a bit smaller and easier to look at at a glance.**
---
**Use `select()` a subset of the metadata with the following columns and
store it in a new variable `metadata` (not back to `metadata_all`! we don't want to overwrite the
original big table `metadata_all` in case we need to refer to it a bit
later): `sampleId`, `country`, `continent`, `ageAverage`. Then rename
`sampleId` to `sample`, `popId` to `pop`, and `ageAverage` to `age` just
to save ourselves some typing later.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
metadata <- select(metadata_all, sample = sampleId, country, continent, age = ageAverage)
metadata
```
:::
---
**Just as you did in the previous chapter, use `mutate()` and `if_else()`
inside the `mutate()` call to make sure that the "modern" individuals have
the `age` set to 0, instead of `NA` (and everyone else's age stays the same).
In other words, for rows where `age` is `NA`, replace that `NA` with 0.**
Yes, you did this before, but try not to copy-paste your solution. Try doing
this on your own first.
**Hint:** Remember the `is.na()` bit we used before! Try it on the `age`
column vector if you need a reminder: `is.na(metadata$age)`.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
metadata <- mutate(metadata, age = if_else(is.na(age), 0, age))
```
Let's make sure we haven't missed anything else:
```{r}
all(!is.na(metadata$age))
```
:::
---
**Our analyses will exclusively focus on modern humans. Filter out the three
archaics in the `metadata`, saving the results into the same `metadata`
variable again. As a reminder, these are individuals whose `sample` name
is among `c("Vindija33.19", "AltaiNeandertal", "Denisova")` which you can
test in a `filter()` command using the `%in%` operator.**
**Hint:** Remember that you can get a `TRUE` / `FALSE` indexing vector
(remember our R bootcamp session!) by not only `column %in% c(... some
values...)` but you can also do the opposite test as
`!column %in% c(... some values...)` (notice the `!` operator in the
second bit of code).
Experiment with the `filter()` command in your R console first if you
want to check that your filtering result is correct.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
metadata <- filter(metadata, !sample %in% c("Vindija33.19", "AltaiNeandertal", "Denisova"))
```
:::
## Status of our data so far
We now have a cleaner IBD table looking like this:
```{r}
head(ibd_segments)
```
And here's our metadata information:
```{r}
head(metadata)
```
---
**In our original `metadata_all` table we have the `groupAge` column with these
values:**
```{r}
table(metadata_all$groupAge)
```
**Note that in the new `metadata` we removed it because it's not actually useful
for most data analysis purposes (it has only three values).**
For instance, later we might want to do some fancier analyses, looking at
IBD as a time series, not just across basically two temporal categories,
"young" and "old".
**To this end, let's create more useful time-bin categories! This will
require a bit more code than above solutions. Don't feel overwhelmed! I will
first introduce a useful function using a couple of examples for you
to play around with in your R console. Only then we will move on to an exercise
in which you will try to implement this on the full `metadata` table! For
now, keep on reading and experimenting in your R console!**
It will be worth it, because this kind of binning is something we do all the
time in computational biology, in every project.
---
**Let's introduce an incredibly useful function called `cut()`.
Take a look at `?cut` help page and skim through it to figure out what it does.
As a bit of a hint, we will want to add a new metadata column which will indicate
in which age bin (maybe, split in steps of 5000 years) do our individuals
belong to.**
Here's a small example to help us get started.
**Let's pretend for a moment that the `df` variable created below is a example representative of our full-scale big metadata table.**
**Copy this into your script, so that you can experiment with the `cut()`
technique in this section.**
```{r}
# a toy example data frame mimicking the age column in our huge metadata table
df <- data.frame(
sample = c("a", "b", "c", "d", "x", "y", "z", "q", "w", "e", "r", "t"),
age = c(0, 0, 1000, 0, 5000, 10000, 7000, 13000, 18000, 21000, 27000, 30000)
)
df
```
**Let's do the time-binning now! Think about what the `cut()` function does here based on the result it
gives you on the little example code below.** You can pretend for now that the
`ages` variable corresponds to age of our samples in the huge `metadata` table.
**Again, try running this on your own to see what happens when you run this
code in your R console. Particularly, take a look at the format of the new
column `age_bin`:**
```{r}
#| collapse: true
# let's first generate the breakpoints for our bins (check out `?seq` if
# you're confused by this!)
breakpoints <- seq(0, 50000, by = 5000)
# take a look at the breakpoints defined:
breakpoints
# create a new column of the data frame, containing bins of age using our breakpoints
df$age_bin <- cut(df$age, breaks = breakpoints)
# take a look at the data frame (with the new column `age_bin` added)
df
```
**The function `cut()` is extremely useful whenever you want to discretize
some continuous variable in bins (basically, a little similar to what
a histogram does in context of plotting).** Doing statistics on this kind of
binned data is something we do very often. So never forget that `cut()` is
there to help you! For example, you could also use the same concept to
partition samples into categories based on "low coverage", "medium coverage",
"high coverage", etc.
---
**In the example of the `cut()` function right above, what is the data type
of the column `age_bin` created by the `cut()` function? Use `glimpse(df)` to see
this data type, then skim through the documentation of `?factor`. What is a
"factor" according to this documentation?**
**Note:** This is a challenging topic. Please don't stress. When we're done
I'll walk through everything myself, interactively, and will explain things
in detail. Take this information on "factors" as something to be aware of
and a concept to be introduced to, but _not_ something to necessarily be an
expert on!
---
**Having learned about `?factor` above, consider the two following vectors
and use them for experimentation when coming up with answers.
What do you see when you print them out in your R console (by typing `x1`
and `x2`)? And what happens when you apply the
`typeof()` function on both of them? `x2` gives you a strange result --- why?
What do you get when you run the following command `levels(x2)`? What do you
get when you run `as.character(x2)`?**
```{r}
x1 <- c("hello", "hello", "these", "are", "characters/strings")
x2 <- factor(x1)
```
**In applying `cut()` to the toy data frame `df`, why is the `age_bin`
value equal to `NA` for some of the rows?**
**Note:** The discussion of "factors" should've been technically part of our
R bootcamp chapter, on the topic of "data types". However, that section was
already too technical, so I decided to move it here to the data analysis section,
because I think it makes more sense to explain it in a wider context.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
You can see that both vectors look the same, but the second has additional
information about "levels":
```{r}
x1
x2
```
The reason you get the `typeof(x1)` as "character" but `typeof(x2)` as "integer"
although both are "basically strings" is because factors are a special data
type for encoding categorical variables which can only have a set of fixed
possible values. Internally, for efficiency, levels of a factor variables
are stored as integers, and their values or just "labels" for those integers.
Importantly, factor levels are actually ordered, which is very useful for plotting,
as we will see later, because the plots maintain order of factors when visualized.
For now, don't worry about this too much. We needed to introduce the concept
of factors because they are very frequently used whenever we need to bin
continuous data, like we will now for assigning samples into bins based
on their `age` value.
---
Oh, and the answer to why `cut()` gave us the `NA` for every 0 value is
because of its `include.lowest` argument is set to `FALSE`. Compare these two:
```{r}
cut(df$age, breaks = seq(0, 50000, by = 10000))
```
```{r}
cut(df$age, breaks = seq(0, 50000, by = 10000), include.lowest = TRUE)
```
**However, this isn't what we want, because we want to treat present-day individuals
separately as their own category, not include them with other ancient people
less than 5000 years old.**
---
The `levels()` function is useful for getting the "labels" of the categories
as they are presented to you in various printouts in the console (rather
than their internal numerical encoding).
Consider the original vector `x2` (note the duplicated "hello" value):
```{r}
x2
```
And the `levels()` output:
```{r}
levels(x2)
```
When we run `as.character()`, we effectively convert the (categorical) factor
values into normal strings (i.e., basically convert those values into their
plain labels). This is very useful whenever we want to manipulate factors,
as we'll se below:
```{r}
as.character(x2)
```
:::
---
**The scientific notation format of bin labels with `5+e3` etc. is very
annoying to look at. How can you use the `dig.lab =` argument of the `cut()`
functions to make this prettier?
Experiment in the R console to figure this out,
then modify the `df$age_bin <- cut(df$age, breaks = breakpoints)` command
in your script accordingly.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
Let's experiment first:
```{r}
# this is where we started
cut(df$age, breaks = breakpoints)
```
```{r}
# this doesn't do it
cut(df$age, breaks = breakpoints, dig.lab = 3)
```
```{r}
# but this does!
cut(df$age, breaks = breakpoints, dig.lab = 10)
```
Our solution to get rid of the ugly scientific notation in our labels is
therefore:
```{r}
df$age_bin <- cut(df$age, breaks = breakpoints, dig.lab = 10)
df
```
Much nicer to look at and immediately readable!
:::
---
**You have now learned that `cut()` has an optional argument called
`include.lowest =`, which includes the lowest value of 0 (representing
the "present-day" age of our samples) in the lowest bin [0, 5]. However, in the
case of our assignment of samples from present-day, this is not what we want.
We want present-day individuals to have their own category called "present-day".**
Here's a useful bit of code I use often for this exact purpose, represented
as a complete self-contained chunk of the time-binning code. If we start
from the original toy example data frame (with the `NA` values assigned to
present-day ages of 0):
```{r}
df
```
We can fix this by the following three-step process, described in the comments.
_Don't stress about any of this!_ Just run this code on your own and try
to relate the `# text description of each step in comments` to the code
itself. Every thing that's being done here are bits and pieces introduced above.
```{r}
# 0. assign each row to a bin based on given breakpoints
df$age_bin <- cut(df$age, breaks = breakpoints, dig.lab = 10)
# 1. extract labels (no "present-day" category yet)
bin_levels <- levels(df$age_bin)
df <-
df %>%
mutate(
age_bin = as.character(age_bin), # 2. convert factors back to "plain strings"
age_bin = if_else(is.na(age_bin), "present-day", age_bin), # 3. replace NA with "present-day",
age_bin = factor(age_bin, levels = c("present-day", bin_levels)) # 4. convert to factor (to maintain order of levels)
)
```
**Please run the code above bit by bit, inspecting the variables you create
(or modify) after each step. Remember the `CTRL / CMD + Enter` shortcut
for stepping through code!**
When we print the modified `df` table, we see all bins properly assigned now,
including the present-day samples with ages at 0:
```{r}
df
```
**This is a very useful pattern which you will get to practice now on the full
`metadata` table, by literally applying the same code on the big table.**
---
**Now that you're familiar with the `cut()` technique for binning values,
use the functions `mutate()` and
`cut()` again (in exactly the same way as we did on the `df` table in the
code chunk above!) to create a new column `age_bin` this time on the whole
`metadata` table. The new column should carry a category age of each
individual in steps of 10000 years again.**
**Hint:** As we did above, first assign the `age_bin` column using the `cut()`
function, then modify the column accordingly with the `mutate()` snippet to
set "present-day" in the `age_bin` for all present-day individuals.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
You can match the step numbers to our example code applied on the smaller
table `df` above. They do the same thing!
```{r}
# step 0. we first bin samples according to their age
bin_step <- 10000
metadata$age_bin <- cut(metadata$age, breaks = seq(0, 50000, by = bin_step), dig.lab = 10)
# step 1. get the assigned bin labels
bin_levels <- levels(metadata$age_bin)
# then we modify the bins to include "present-day" labels:
metadata <- metadata %>%
mutate(
age_bin = as.character(age_bin), # step 2.
age_bin = if_else(is.na(age_bin), "present-day", age_bin), # step 3.
age_bin = factor(age_bin, levels = c("present-day", bin_levels)) # step 4.
)
```
We can check the binning result to see that we've been successful!
```{r}
table(metadata$age_bin)
```
The reason we converted the `age_bin` back to factor in the end is that
we want to maintain a temporal order of our labels (first "present-day",
then all other subsequent time bins), regardless of how mixed our rows
are.
:::
---
**How many individuals did we end up with in each `age_bin`? Use `table()`
to answer this question (and also to sanity check your final results), to make
sure everything worked correctly.**
## Recipe for merging metadata
After basic processing and filtering of the primary data (in our case, the
genomic coordinates of pairwise IBD segments), we often need to annotate this
data with some meta information... which is what we already processed too!
What does this mean? **Take a look at our IBD data in its current state again:**
```{r}
head(ibd_segments)
```
**When we analyze things down the line, we might want to look at IBD patterns
over space and time, look at some temporal changes, etc. However, our IBD
data has none of the information in it!** Besides names of individuals
(from which population, at which time, which geographical location?) we
have nothing, so even if we were to do run our
friends `group_by()` and `summarize()` to get some summary statistics to
get some insights into the history or biological relationships between our samples,
we have no idea to correlate them to other variables... precisely because those
variables are elsewhere in a completely different table we called `metadata`.
**We now need to put both sources of information into a single joint format.
I call this "metadata merging", and it is something we always want to do
regardless of what kinds of statistics or measures we work with.**
**The technique below is again a very useful recipe for your future work.**
Again, the annotation that we need is in the metadata table --- it has
information about ages and geography:
```{r}
head(metadata)
```
**What we need to do is join our IBD data with a selection of the most
important variables in our metadata, which is what we're going to do now.**
**This bit might be a bit frustrating and repetitive if I made this into
an exercise for you, so just inspect the following code, run it yourself
(and put it into your own script) and try to make sense of it. Take this
as a recipe that you can modify and reuse in your own projects later!**
I'm happy to answer questions!
#### **Step 1.**
For each IBD segment (a row in our table `ibd_segments`) we
have two individuals on each row, `sample1` and `sample2`. Therefore, we
will need to annotate each row of the `ibd_segments` table with metadata
information for both individuals. **Let's start by making a copy of the
metadata table, caling those copies `metadata1` and `metadata2`:**
```{r}
metadata1 <- metadata
metadata2 <- metadata
```
#### **Step 2.**
The columns of these metadata copies are the same, which you can verify by running this:
```{r}
#| collapse: true
colnames(metadata1)
head(metadata1, 1)
```
```{r}
#| collapse: true
colnames(metadata2)
head(metadata2, 1)
```
Obviously this doesn't work if we want to refer to a location of `sample1`
or `sample2`, respectively. Here's another useful trick for this. Observe
what `paste0()` function does in this situation:
```{r}
# original columns
colnames(metadata1)
# columns with a number added after
paste0(colnames(metadata1), "1")
```
**So using `paste0()` we can append "1" or "2" to each column name of both tables.
We can use the same approach to _assign_ new columns to both of our
metadata copies:**
```{r}
colnames(metadata1) <- paste0(colnames(metadata1), "1")
colnames(metadata2) <- paste0(colnames(metadata2), "2")
```
**Now we have both metadata tables with columns renamed. Let's check this to
make sure we did this correctly:**
```{r}
colnames(metadata1)
colnames(metadata2)
```
#### **Step 3.**
Now we are ready to join the table of IBD segments.
The final piece of the puzzle is a
[JOIN operation](https://en.wikipedia.org/wiki/Join_(SQL)) from the realm
of computer databases. A rather complex topic, which can be interpreted
very easily in our situation, and can be schematically described by the following
diagram of a so-called "inner join", implemented in _dplyr_ by a
function `inner_join()`:
**Note:** If you're more familiar with _tidyverse_ from before, I still think
joins might be a relatively new topic. If so, please take a bit of time to
read the [_dplyr_](https://dplyr.tidyverse.org/reference/mutate-joins.html)
documentation on join operations. Joins are a superpower of data science.
<center>

</center>
<br>
What this operation does is that it creates a "merged" table, consisting
of the columns of both "left" and "right" tables, where the "merged" table
rows consist of such rows from "left" and "right", where the two tables
have a matching "key" column (here highlighted in green). **Just take a moment
and try to trace the one row of the "merged" table at the bottom of the diagram
to the row(s) of the "left" and "right" tables based on the value of the green
key column in both of them.**
Now, consider the columns of our three tables now:
- `ibd_segments` columns:
```{r}
#| echo: FALSE
colnames(ibd_segments)
```
- `metadata1` and `metadata2` columns:
```{r}
#| echo: FALSE
colnames(metadata1)
```
```{r}
#| echo: FALSE
colnames(metadata2)
```
**Here's what happens when you perform an inner join like this (note that
we specify `by` what column should we be merging, effectively saying which
column is the key corresponding to the green column in the diagram above):**
```{r}
ibd_merged <- inner_join(ibd_segments, metadata1, by = "sample1")
```
**Take a look at the results, saved in a new variable `ibd_merged`, by typing
it into your R console. Our IBD
table now has added metadata information for individuals in column
`sample1`, awesome! We can verify this by listing all column names:**
```{r}
# original IBD table:
head(ibd_segments, n = 3)
# IBD table with metadata1 merged in:
head(ibd_merged, n = 3)
```
**How do we add metadata for the second sample? We do it in the same way,
except now we join in the second table (and we save the results back to the new
variable `ibd_merged`, to be able to compare the differences):**
```{r}
ibd_merged <- inner_join(ibd_merged, metadata2, by = "sample2")
```
**In fact, I do this so often for basically every computational biology
project, that I like to use this concise bit of code to do it all in one go
(this is what you should add to your script!):**
```{r}
ibd_merged <-
ibd_segments %>% # 1. primary data without metadata
inner_join(metadata1, by = "sample1") %>% # 2. add metadata for sample1
inner_join(metadata2, by = "sample2") # 3. add metadata for sample2
```
**Again, we can verify that our new table has metadata columns for both
`sample1` and `sample2`:**
```{r}
head(ibd_segments, n = 3)
head(ibd_merged, n = 3)
```
---
**And this concludes merging of the primary IBD data with all necessary
meta information!** _Every computational project has something like this
somewhere._ Of course, in the case of your primary data which could be observations
at particular ecological sampling sites or excavation sites, etc., this
annotation might involve metadata with more specific geographical information,
environmental data, etc. But the idea of using join for the annotation
remains the same.
## Exercise 4. Final data processing
You've already seen the super useful function `paste()` (and also `paste0()`)
which useful whenever we need to join the elements of two (or more) character
vectors together. The difference between them is that `paste0()` doesn't add
a space between the individual values and `paste()` does (the latter also
allows you to customize the string which should be used for joining instead
of the space).
**Take a look at your (now almost finalized) table `ibd_merged`:**
```{r}
head(ibd_merged)
```
When we get to comparing IBD patterns between countries, time bins, etc.,
it might be useful to have metadata columns specific for that purpose. What
do I mean by this? For instance, consider these two vectors and the result
of pasting them together. Having this combined pair information makes it
very easy to cross-compare multiple sample categories together, rather than
having to keep track of two variables `v1` or `v2`, we have just the merged
pairs:
```{r}
v1 <- c("Denmark", "Czech Republic", "Estonia")
v2 <- c("Finland", "Italy", "Estonia")
pairs_v <- paste(v1, v2, sep = ":")
pairs_v
```
---
Let's create this information for `country1/2`, `continent1/2`, and
`age_bin1/2` columns.
**Specifically, use the same `paste()` trick in a `mutate()` function to add a
new column to your
`ibd_merged` table named `country_pair`, which will contain a vector of
values created by joining of the columns `country1` and `country2`. Add
this column `.before` the `chrom` column for clearer visibility (check
out `?mutate` if you need a refresher on how does its `.before =`
argument work). Then do the same to create a `region_pair` based on
`continent1` and `continent2`. Save the result back to the
variable named `ibd_merged`.**
**Hint:** Note that you can create multiple new columns with `mutate()`,
but you can also pipe `%>%` one `mutate()` into another `mutate()` into
another `mutate()`, etc.!
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
ibd_merged <- mutate(
ibd_merged,
country_pair = paste(country1, country2, sep = ":"),
region_pair = paste(continent1, continent2, sep = ":"),
.before = chrom
)
# let's verify the results
ibd_merged
# values in the new `region_pair` column
table(ibd_merged$region_pair)
```
:::