intro-datatable/material.qmd at main · paocorrales/intro-datatable · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
---
title: "Introduction to data.table"
editor: visual
editor_options:
  chunk_output_type: inline
engine: knitr
webr:
  packages: ['ggplot2', 'data.table'] # Install R packages on document open
  cell-options:
    autorun: true
filters:
  - webr
---

```{r}
#| label: setup
#| echo: false
#| message: false
library(printr)

knitr::opts_hooks$set(solution = function(options) {
  options[["code-fold"]] <- TRUE
  options[["code-summary"]] <- "Show one solution"
  options[["eval"]] <- FALSE
  options
})
```

```{webr-r}
#| context: setup
library(data.table)
library(ggplot2)
library(printr)
rolling_stone <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-07/rolling_stone.csv')
```

## What data.table?

At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.
It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.

## Why data.table?

-   Fast and efficient
-   Does not have dependencies
-   Syntax is very concise
-   Ensures backwards compatibility

## Reference semantics

Most R functions and methods uses copy-on-modify.
This means that modifying an object almost always creates a new copy of the object, while the original one is kept unmodified.
For example this code

``` r
my_data |>
  mutate(new_column = old_column*2)
```

returns a new tibble that is a copy of `my_data` with a new column but it doesn't modify `my_data`.

data.table uses modify-in-place, which means that objects are **not** copied when modified.
This code

``` r
my_data[, new_column := old_column*2]
```

doesn't create a new copy of `my_data` but it rather modifies `my_data` directly.
This is similar to the base R code:

``` r
my_data$new_column <- data$new_column
```

Modify-in-place is one of the features that makes data.table so efficient, but it can also make code harder to think about and lead to surprising results (especially if a data.table is modified inside a function).

Functions that modify a data.table in place start with "set", like `setcolorder()` (reorders columns) or `setnames()` (renames columns).

## Reading data with data.table

The first step of most data analyses is to read data in memory.
We can use the `data.table::fread()` function (the f is for *fast*) to read regular delimited files such as csv files.
This function not only is fast, but it automatically detects the delimiter and guesses the class of each column and the number of rows in the file.

```{r}
library(data.table)
```

``` r
rolling_stone <- fread('data/rolling_stone.csv')
```

It is also possible to read data from a URL.

```{r}
rolling_stone <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-07/rolling_stone.csv')
```

`fread()` will return a data.table object.
If you read a dataset as a data.frame or tibble and want to create it as a But sometimes you'll have a data.frame or a tibble that you loaded into R using other methods.
You can convert them to data.table either with the `as.data.table()` or `setDT()` functions.
The difference is that `as.data.table()` returns a copy of the data as a data.table and `setDT()` changes the data in place (as denoted by the "set" prefix).

```{r}
setDT(rolling_stone)
```

### The Rolling Stone data

The data we'll use comes from the [TidyTuesday project](https://github.com/rfordatascience/tidytuesday), includes album rankings from Rolling Stone magazine.
It has [21 variables](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-07/readme.md#data-dictionary) and 691 observations, not a big task for data.table but it will help us to explore some of the functionalities.

```{r}
str(rolling_stone)
```

## The syntax

The general data.table syntax looks like this:

```{=html}
<div  style=" margin-left: auto; margin-right: auto;display: block; text-align:center; width:fit-content;">
<p style="font-size:250%;font-weight:bold" >
<code>DT[<span class="rows">i</span>, <span class="columns">j</span>, <span class="by">by</span>]</code>
</p>
</div>
```

Where <code>**DT**</code> is a data.table object, the <code class="rows">[**i**]{.rows}</code> argument is used for filtering and joining operations, the <code class="columns">[**j**]{.columns}</code> argument can summarise and transform, and the <code class="by">[**by**]{.by}</code> argument defines the groups to which to apply these operations.

You can read the syntax as "In these rows, do this, grouped by that".
It is very concise but easy to read (sometimes).

Let's start operating over `i`{.rows}, meaning over the rows of the table.

## Subset rows in [i]{.rows}

Which is the number one album in the 2003 ranking?
We need to filter our dataset to show only rows where `rank_2003` is 1:

```{r}
rolling_stone[rank_2003 == 1]
```

This is similar to how we'd use a regular data.frame:

```{r}
rolling_stone[rolling_stone$rank_2003 == 1]
```

::: {.alert .alert-info}
**data.tables are data.frames**

Notice that we can use data.frame syntax on a data.table and it will work just fine in most cases.
:::

But because all expressions in <code class="rows">[i]{.rows}</code> are evaluated in the context of the data.table, we don't need to type the name of the data.frame again.
This becomes even more convenient with longer expressions.
For example, which are the female bands with more than 3 members?

```{r}
rolling_stone[artist_gender == "Female" & artist_member_count > 3]
```

It is also possible to order de rows by any of the columns.

```{r}
rolling_stone[order(rank_2003)] |>
  head()
```

::: {.alert .alert-success}
**Filter some rows!**

1.  Is your favourite band/artist listed in the data (column `sort_name` or `clean_name`)?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[clean_name == "ABBA"]
    ```

2.  Who are the "30 under 30" in 2020?
    That is, bands with an average age under 30 years (column `ave_age_at_top_500`) that are in the top 30 in 2020 (`rank_2020`).

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[ave_age_at_top_500 < 30 & rank_2020 < 30]
    ```
:::

## Operating with column(s) in [j]{.columns}

Filtering and ordering is fun an all, but we need to operate on the columns.
Just for fun, what is the mean ranking in 2003?

```{r}
rolling_stone[, mean(rank_2003, na.rm = TRUE)]
```

::: {.alert .alert-info}
**Data.table is not type-stable**

Notice that the result of the last expression is a single numeric vector and *not* a data.table.
Operations in [j]{.columns} will return a data.table if the result of the expression is a list or a data.table, and other objects if the result is a single object of a different type.
:::

Now, the mean of the numbers 1 through 500 is `r mean(1:500)`, so this results tells us that something strange is going on.
Some ranking seems to be missing.
How many unique elements do we have?
The `uniqueN()` function is from data.table and is a more efficient version of doing `length(unique(rank_2003))`.

```{r}
rolling_stone[, uniqueN(rank_2003, na.rm = TRUE)]
```

Huh.
We seem to be missing two rankings?

::: {.alert .alert-success}
**Your turn**

1.  Are more popular albums on Spotify (column `spotify_popularity`) higher in the 2003 ranking (column `rank_2003`)?
    Compute the correlation between the two columns (hint: there are missing values, so you will need to use `use = "complete.obs"`).

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, cor(spotify_popularity, rank_2003, use = "complete.obs")]
    ```

2.  Which rankings are missing in the database?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, which(!(1:500 %in% rank_2003))]
    ```
:::

It's probably more interesting to calculate the mean rank by genre.
Let's first compute the mean of just one genre.

```{r}
rolling_stone[genre == "Electronic", mean(rank_2003, na.rm = TRUE)]
```

::: {.alert .alert-info}
data.table is very efficient at subsetting rows and doing calculations over columns.
For this expression:

``` r
rolling_stone[genre == "Electronic", mean(rank_2003, na.rm = TRUE)]
```

data.table:

1.  finds the *row indices* that match `genre == "Electronic"`. It does not subset over the entire table.
2.  as [j]{.columns} uses only one column to compute the mean, data.table subsets only that column and computes the mean.

data.table can see all three components of the query (i, j and by) and optimise it altogether before evaluation, not each separately.
:::

There are a lot of genres, so it wouldn't be too convenient to write one line for each one.
It's much better to group rows by genre using [`by`]{.by}:

```{r}
rolling_stone[, mean(rank_2003, na.rm = TRUE), by = genre]
```

When doing stuff with [by]{.by}, the result is always a data.table with the columns used to define the groups and columns to store the results.
If we don't give it any names, data.table assigns the defaults `V1`, `V2`, etc.
This is fine for quick one-liners, but for regular code it's much better to name the results.
We do this by returning a named list:

```{r}
rolling_stone[, list(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre]
```

This idiom is so common that data.table allows us to use `.` as an alias for `list`, so most of the time you'd see something like this:

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre]
```

As long as [j]{.columns}-expression returns a list, each element of the list will be converted to a column in the resulting data.table.
This allows us to return more than one summary expression.
What is the mean rank and number of bands included in the rank per genre?

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE),
                  N = sum(!is.na(rank_2003))),
              by = genre]
```

Some of the genres are relatively highly rated, but have very few examples included.

::: {.alert .alert-info}
You can select only some columns using the same syntax but without applying any summary function.
To return a data.table with just the columns `sort_name` and `rank_2003` you'd do:

``` r
rolling_stone[, .(sort_name, rank_2003)]
```
:::

You might've noticed that "Blues/Blues Rock" also appears as "Blues/Blues ROck".
Someone made a typo!
We need to modify the genre column to fix that mistake.
One way would be to turn all genres to all lower case and forget about cases altogether.

```{r, results='hide'}
rolling_stone[, genre := tolower(genre)]
```

This operator `:=` is called "walrus".
It can be used to update existing columns (like we just did) or to create new ones.
It is possible also to delete columns with `DT[, variable := NULL]`.

Notice that we didn't get any output and we didn't assigned the result.
But `rolling_stone` is modified anyway:

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre]
```

The `:=` operator modifies columns by reference, so the `rolling_stone` variable is modified in place instead of making a copy.
This makes it very efficient, but potentially surprising!

::: {.alert .alert-success}
**Your turn**

1.  How may bands in the Latin genre (column `genre`) appeared in the raking of 2020?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[genre == "latin", sum(!is.na(rank_2020))]
    ```

2.  How many male, female or mixed-gender (column `artist_gender`) bands were included in the 2020 ranking (column `rank_2020`)?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, sum(!is.na(rank_2020)), by = artist_gender]
    ```

3.  Add a column with the average of the birth year of each band (columns `artist_birth_year_sum` and `artist_member_count` are relevant)?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, artist_birth_year_mean := artist_birth_year_sum/artist_member_count]
    ```

4.  Add a column with the average ranking of each album (columns `rank_2003`, `rank_2012` and `rank_2020` and `album_id`).

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, mean_raking := mean(c(rank_2003, rank_2012, rank_2020), na.rm = TRUE), by = album_id]
    ```
:::

## Putting it all together

Up to now we made one operation at a time but it would be useful to chain operations (like you would do with dplyr for example).
Let's say we want to order genres by their mean ranking in 2003.

We can chain operations with data.table by continuing to use `[`:

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre][order(mean_rank)]
```

Or like this for better readability:

``` r
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre] |>
    _[order(mean_rank)]
```

This works because since R 4.3.0, `_` is a placeholder for the left-hand side of the pipe.

There are [other possibilities](https://rdatatable-community.github.io/The-Raft/posts/2024-01-28-piping_data_tables-elio_campitelli/).
You could use the old and wise magrittr pipe `%>%` along with `.` as a placeholder.

``` r
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE)), by = genre] %>%
    .[order(mean_rank)]
```

::: {.alert .alert-info}
It is very important to distinguish between the `.` as alias for `list()` an the `.` as a placeholder within the magrittr pipe.
:::

Let's get serious about analysing these data.
It would be interesting to add the proportion as well as the number of albums in each genre:

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE),
                  N = sum(!is.na(rank_2003))), by = genre] |>
  _[, prop := N/sum(N)] |>
  _[order(-prop)]
```

A full 25% of bands (125) in the 2003 ranking have no assigned genre.
Having a empty label is not ideal, let's modify the variable.

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE),
                  N = sum(!is.na(rank_2003))), by = genre] |>
  _[, prop := N/sum(N)] |>
  _[, genre := ifelse(genre == "", "unknown genre", genre)] |>
  _[order(-prop)]
```

When doing lots of column modifications like in the previous lines, it's more convenient (and sometimes more efficient) to do all the modifications in a single step.
For that we can use the `let()` function, which is like a functional form of the walrus operator that can assign multiple columns at once:

```{r}
rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE),
                  N = sum(!is.na(rank_2003))), by = genre] |>
  _[, let(prop = N/sum(N),
          genre = fifelse(genre == "", "unknown genre", genre))] |>
  _[order(-prop)]
```

::: {.alert .alert-info}
You will probably find examples where a quoted walrus`":="` is used instead of the function `let()`.
This is a functional form of the walrus operator that allows you to operate over more that one column at the same time.
:::

Looking at numbers in a table is not always enough.
Let's plot the results.

```{r}
library(ggplot2) #Or any another package to make plots

rolling_stone[, .(mean_rank = mean(rank_2003, na.rm = TRUE),
                  N = sum(!is.na(rank_2003))), by = genre] |>
  _[, let(prop = N/sum(N),
          genre = ifelse(genre == "", "unknown genre", genre))] |>
  _[order(-prop)] |>
  _[, genre := forcats::fct_reorder(genre, prop)] |>
  ggplot(aes(prop, genre)) +
  geom_col()
```

That's right, you can pipe in a plot after doing the calculations.

::: {.alert .alert-success}
**Your turn**

1.  Are bands (artists with `artist_member_count` greater than 1) more successful than solo artists?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}

    rolling_stone[, is_band := artist_member_count > 1] |>
      _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
            mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
            mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
        by = is_band]

    # Notice that due to reference semantics, this operation adds the
    # is_band column to the data.table. You can avoid this by using
    # an expression in the by argument.
    rolling_stone |>
      _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
            mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
            mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
        by = .(is_band = artist_member_count > 1)]

    ```

2.  What is the proportion of albums recorded in a Studio and their mean position in 2020?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, .(mean_rank_2020 = mean(rank_2020, na.rm = TRUE),
                      N = sum(!is.na(rank_2020))), by = type] |>
      _[, prop := N/sum(N)]

    ```

3.  What is the mean number of years between an artist debut album and the release of their first top 500 album (see the column `years_between`) for each genre?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone[, .(mean_years = mean(years_between, na.rm = TRUE), .N),
                    by = genre] |>
      _[order(mean_years)]
    ```
:::

## Special symbols

data.table has special symbols that provide extra functionality.
We are going to only mention `.N` , which counts the number of rows in the group and it is particularly useful when used along with `by`.How many albums per year were included in the 2003 ranking?
To answer this question we can first filter only albums that were included in the 2003 ranking and then count the number of rows per `release_year`?

```{r}
rolling_stone[!is.na(rank_2003)] |>
  _[, .N, by = release_year]
```

When using the `.N` symbol, the name of the result is automatically named "N".
Let's plot that:

```{r}
rolling_stone[!is.na(rank_2003)] |>
  _[, .N, by = release_year] |>
  ggplot(aes(release_year, N)) +
  geom_col()
```

Interestingly, the majority of the albums in the raking of 2003 where released between the mid 60s and late 70s.
Was music better at that time?
Or maybe this is a reflection of the people compiling the ranking.

## Reshaping data

Do all rankings have the same album make-up in terms of year of release?
We can count the number of non-NA values for each ranking by release year:

```{r}
rolling_stone[, .(N_2003 = sum(!is.na(rank_2003)),
                  N_2012 = sum(!is.na(rank_2012)),
                  N_2020 = sum(!is.na(rank_2020))), by = release_year] |>
  head()
```

This is very tedious, error-prone and makes plotting the result more difficult.
Every time you are applying the same operation to multiple columns it is very likely that you'd be better off reshaping your data.
In this case, we would like to have a single column with the ranking position and another column that identifies which year's ranking it is.
So, lets *melt* the table.

```{r}
rolling_stone |>
  melt(id.vars = c("release_year", "album_id"),
       measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
       variable.name = "rank_year",
       value.name = "rank") |>
  head()
```

It is important to identify the key or id variables (`id.vars` argument) associated to the observation, the albums.
In this case we have `release_year` and `album_id`.
The measure variables (`measure.vars` argument) are the variables we want to reshape.
By default, the new variables are `variable` and `value` but we changed those names using the `variable.name` and `value.name` arguments.

::: {.alert .alert-info}
When you have a lot of columns that go into `measure.vars` a nd that are named consistently, you can use pattern-based column marching using regular expressions and the `patterns()` function like here.

``` r
rolling_stone |>
  melt(id.vars = c("release_year", "album_id"),
       measure.vars = patterns("^rank"),
       variable.name = "rank_year",
       value.name = "rank")
```
:::

We can compute the album proportion again.
But now we need to group by two variables: the year of release and the raking year.
The `by` argument can also accepts a list so we'll use the `.()` alias.

```{r}
rolling_stone |>
  melt(id.vars = c("release_year", "album_id"),
       measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
       variable.name = "rank_year",
       value.name = "rank") |>
  _[!is.na(rank)] |>
  _[, .N, by = .(release_year, rank_year)] |>
  head()
```

Let's plot the results again.

```{r}
rolling_stone |>
  melt(id.vars = c("release_year", "album_id"),
       measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
       variable.name = "rank_year",
       value.name = "rank") |>
  _[!is.na(rank)] |>
  _[, .N, by = .(release_year, rank_year)] |>
  ggplot(aes(release_year, N)) +
  geom_line(aes(color = rank_year))
```

It looks like the albums released around 1970 were and still are the most popular.

::: {.alert .alert-success}
**Your turn again!**

1.  What is the mean rank on each ranking year for gender?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone |>
      melt(id.vars = c("release_year", "album_id", "artist_gender"),
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
           variable.name = "rank_year",
           value.name = "rank") |>
      _[, .(mean_rank = mean(rank, na.rm = TRUE)),
        by = .(rank_year, artist_gender)]
    ```
:::

The list has different type of albums, some are Studio albums, some are compilations, etc.

```{r}
rolling_stone[, unique(type)]
```

Has the distribution of types of album changed over the decades?
Let's focus on the proportion of Compilation albums compared with the sum of Compilation and Studio albums and the 2003 ranking.
First, we can compute the number of albums included in the ranking by type and by decade.

```{r}
(n_type_decade <- rolling_stone |>
  _[, .(N = sum(!is.na(rank_2003))),
    by = .(decade = floor(release_year / 10) * 10, type)])
```

To compute the number of Studio albums divided by the sum of Studio albums and compilations we need each type of album as its own column.
We need to reshape our data again!
But this time instead of making it longer, we need to make it wider.
We can use `dcast()` for this

```{r}
n_type_decade |>
  dcast(decade ~ type, value.var = "N")
```

There are a lot of `NA`s for the combinations that didn't appear in our long data.
We know that in this case, these mean zero albums for that category for that decade.
We can replace all `NA`s with zeroes with the `setnafill()` function (as the "set" prefix implies, this function modifies by reference).

```{r}
n_type_decade |>
  dcast(decade ~ type, value.var = "N") |>
  setnafill(fill = 0) |>
  head()
```

With that, we can now compute and plot the value we want

```{r}
n_type_decade |>
  dcast(decade ~ type, value.var = "N") |>
  setnafill(fill = 0) |>
  _[, ratio := Compilation/(Studio + Compilation)] |>
  ggplot(aes(decade, ratio)) +
  geom_line()
```

Compilation albums make up less than 10% of the picks, except in the 90s.

::: {.alert .alert-success}
**Your turn!**

1.  What is the proportion of Male to Female artists for each decade in the 2003 ranking?

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone |>
      _[, .(N = sum(!is.na(rank_2003))),
        by = .(decade = floor(release_year / 10) * 10, artist_gender)] |>
      dcast(decade ~ artist_gender, value.var = "N") |>
      setnafill(fill = 0) |>
      _[, ratio := Male / (Male + Female)] |>
      ggplot(aes(decade, ratio)) +
      geom_line()
    ```

2.  Has this changed between the different ranking years?
    (You need to first melt and then dcast)

    ```{webr-r}
    #| autorun: false
    # Write your code here!
    rolling_stone
    ```

    ```{r, solution=TRUE}
    rolling_stone |>
      melt(id.vars = c("release_year", "album_id", "artist_gender"),
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
           variable.name = "rank_year",
           value.name = "rank") |>
      _[, .(N = sum(!is.na(rank))),
        by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |>
      dcast(decade + rank_year ~ artist_gender, value.var = "N") |>
      _[, ratio := Male / (Male + Female)] |>
      ggplot(aes(decade, ratio)) +
      geom_line(aes(color = rank_year))

    ```
:::