You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/how-r-thinks-about-data.Rmd
+2-3Lines changed: 2 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ If you are ever unsure, it never hurts to explicitly name an argument.
82
82
83
83
To learn more about a function, you can type a `?` in front of the name of the function, which will bring up the official documentation for that function:
84
84
85
-
```{r}
85
+
```{r, head-help}
86
86
?head
87
87
```
88
88
@@ -604,8 +604,7 @@ You will be naming a of objects in R, and there are a few common naming rules an
604
604
- avoid dots `.` in names, as they have a special meaning in R, and may be confusing to others
605
605
- two common formats are `snake_case` and `camelCase`
606
606
- be consistent, at least within a script, ideally within a whole project
607
-
- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or
608
-
[tidyverse's](https://style.tidyverse.org/)
607
+
- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or [tidyverse's](https://style.tidyverse.org/)
<!-- - talk about file paths and tab completion -->
12
-
13
10
<!-- - should we teach the `here()` package? -->
14
-
15
11
<!-- - base vs. tidyverse -->
16
-
17
12
<!-- - pipes -->
18
-
19
13
<!-- - select -->
20
-
21
14
<!-- - filter -->
22
-
23
15
<!-- - idea of conditional subsetting -->
24
-
25
16
<!-- - ==, >, !, |, & -->
26
-
27
17
<!-- - show %in% -->
28
-
29
18
<!-- - mutate -->
30
-
31
19
<!-- - making a date column -->
32
-
33
20
<!-- - group_by -->
34
-
35
21
<!-- - summarize -->
36
-
37
22
<!-- - mutate -->
38
-
39
23
<!-- - ungroup -->
40
-
41
24
<!-- - pivot_wider -->
42
-
43
25
<!-- - exporting data -->
44
-
45
26
<!-- Challenge ideas: -->
46
-
47
27
<!-- - plotting a time series using the date -->
48
-
49
28
<!-- - predicting # of rows before pivoting -->
50
-
51
29
<!-- - important to plan out reshaping steps in advance -->
52
-
53
30
<!-- - filter operations -->
54
-
55
31
<!-- - weight between two values -->
56
-
57
32
<!-- - maybe throw in an | example -->
58
-
59
33
<!-- - combination filter and select where doing the order wrong yields an error -->
60
-
61
34
<!-- - only has columns record_id, species_id, sex, and hindfoot_length, but weight has to be NA -->
62
-
63
35
<!-- - a couple of simple group_by operations -->
64
-
65
36
<!-- - how many combinations of plot id and genus are there -->
66
-
67
37
<!-- - could show that distinct also works here -->
68
-
69
38
<!-- - what will happen if you group by weight? -->
70
-
71
39
<!-- - an operation that requires group_by and mutate -->
72
-
73
40
<!-- - an operation that requires multiple group_by steps -->
74
41
75
42
:::::::::::::::::::::::::::::::::::::: questions
@@ -160,10 +127,8 @@ class(surveys)
160
127
Whoa!
161
128
What is this thing?
162
129
It has multiple classes?
163
-
Well, it's called a `tibble`, and it is the `tidyverse` version of a data.
164
-
frame.
165
-
It *is* a data.
166
-
frame, but with some added perks.
130
+
Well, it's called a `tibble`, and it is the `tidyverse` version of a data.frame.
131
+
It *is* a data.frame, but with some added perks.
167
132
It prints out a little more nicely, it highlights `NA` values and negative values in red, and it will generally communicate with you more (in terms of warnings and errors, which is a good thing).
@@ -192,8 +157,7 @@ Finally, the `tidyverse` has only continued to grow, and has strong support from
192
157
One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data.
193
158
The `dplyr` and `tidyr` packages in the `tidyverse` provide a series of powerful functions for many common data manipulation tasks.
194
159
195
-
We'll start off with two of the most commonly used `dplyr` functions: `select()`, which selects certain columns of a data.
196
-
frame, and `filter()`, which filters out rows according to certain criteria.
160
+
We'll start off with two of the most commonly used `dplyr` functions: `select()`, which selects certain columns of a data.frame, and `filter()`, which filters out rows according to certain criteria.
@@ -204,8 +168,7 @@ Between `select()` and `filter()`, it can be hard to remember which operates on
204
168
205
169
#### `select()`
206
170
207
-
To use the `select()` function, the first argument is the name of the data.
208
-
frame, and the rest of the arguments are *unquoted* names of the columns you want:
171
+
To use the `select()` function, the first argument is the name of the data.frame, and the rest of the arguments are *unquoted* names of the columns you want:
First, `select()` will operate on the `surveys` data.
313
-
frame, removing the column `day`.
314
-
The resulting data.
315
-
frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
274
+
First, `select()` will operate on the `surveys` data.frame, removing the column `day`.
275
+
The resulting data.frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
316
276
317
277
Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once.
318
278
An alternative approach is to create **intermediate** objects:
@@ -342,10 +302,13 @@ It then gets sent into the `filter()` function, where it is further modified, an
342
302
It can also be helpful to think of `%>%` as meaning "and then".
343
303
Since many `tidyverse` functions have verbs for names, a pipeline can be read like a sentence.
344
304
345
-
:::::::::::::::::::::::::::::::::::::::::::: instructor It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing.
It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing.
346
308
If your cursor is on any line of a pipeline, running that line will run the whole thing.
347
309
348
310
You can also show that by highlighting a section of a pipeline, you can run only the first X steps of it.
311
+
349
312
::::::::::::::::::::::::::::::::::::::::::::
350
313
351
314
If we want to store this final product as an object, we use an assignment arrow at the start:
@@ -365,8 +328,7 @@ This approach is very interactive, allowing you to see the results of each step
365
328
366
329
## Challenge 2: Using pipes
367
330
368
-
Use the surveys data to make a data.
369
-
frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988.
331
+
Use the surveys data to make a data.frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988.
370
332
Use a pipe between the function calls.
371
333
372
334
:::::::::::::::::::::::: solution
@@ -492,10 +454,8 @@ This isn't necessarily the most useful plot, but we will learn some techniques t
492
454
Many data analysis tasks can be achieved using the split-apply-combine approach: you split the data into groups, apply some analysis to each group, and combine the results in some way.
493
455
`dplyr` has a few convenient functions to enable this approach, the main two being `group_by()` and `summarize()`.
494
456
495
-
`group_by()` takes a data.
496
-
frame and the name of one or more columns with categorical values that define the groups.
497
-
`summarize()` then collapses each group into a one-row summary of the group, giving you back a data.
498
-
frame with one row per group.
457
+
`group_by()` takes a data.frame and the name of one or more columns with categorical values that define the groups.
458
+
`summarize()` then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group.
499
459
The syntax for `summarize()` is similar to `mutate()`, where you define new columns based on values of other columns.
500
460
Let's try calculating the mean weight of all our animals by sex.
501
461
@@ -528,16 +488,14 @@ surveys %>%
528
488
n = n())
529
489
```
530
490
531
-
Our resulting data.
532
-
frame is much larger, since we have a greater number of groups.
491
+
Our resulting data.frame is much larger, since we have a greater number of groups.
533
492
We also see a strange value showing up in our `mean_weight` column: `NaN`.
534
493
This stands for "Not a Number", and it often results from trying to do an operation a vector with zero entries.
535
494
How can a vector have zero entries?
536
495
Well, if a particular group (like the AB species ID + `NA` sex group) has **only**`NA` values for weight, then the `na.rm = T` argument in `mean()` will remove **all** the values prior to calculating the mean.
537
496
The result will be a value of `NaN`.
538
497
Since we are not particularly interested in these values, let's add a step to our pipeline to remove rows where weight is `NA`**before** doing any other steps.
539
-
This means that any groups with only `NA` values will disappear from our data.
540
-
frame before we formally create the groups with `group_by()`.
498
+
This means that any groups with only `NA` values will disappear from our data.frame before we formally create the groups with `group_by()`.
541
499
542
500
```{r filter-group-by}
543
501
surveys %>%
@@ -571,32 +529,26 @@ surveys %>%
571
529
arrange(desc(mean_weight))
572
530
```
573
531
574
-
You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.` These are warning you that your resulting data.
575
-
frame has retained some group structure, which means any subsequent operations on that data.
576
-
frame will happen at the group level.
577
-
If you look at the resulting data.
578
-
frame printed out in your console, you will see these lines:
532
+
You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.`
533
+
These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level.
534
+
If you look at the resulting data.frame printed out in your console, you will see these lines:
579
535
580
536
```
581
537
# A tibble: 46 × 4
582
538
# Groups: species_id [18]
583
539
```
584
540
585
-
They tell us we have a data.
586
-
frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups.
541
+
They tell us we have a data.frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups.
587
542
We will see something similar if we use `group_by()` alone:
588
543
589
544
```{r group-by-alone}
590
545
surveys %>%
591
546
group_by(species_id, sex)
592
547
```
593
548
594
-
What we get back is the entire `surveys` data.
595
-
frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations.
596
-
Groups are often maintained throughout a pipeline, and if you assign the resulting data.
597
-
frame to a new object, it will also have those groups.
598
-
This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.
599
-
frame, not by group.
549
+
What we get back is the entire `surveys` data.frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations.
550
+
Groups are often maintained throughout a pipeline, and if you assign the resulting data.frame to a new object, it will also have those groups.
551
+
This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group.
600
552
Therefore, it is a good habit to remove the groups at the end of a pipeline containing `group_by()`:
601
553
602
554
```{r ungroup}
@@ -609,11 +561,9 @@ surveys %>%
609
561
ungroup()
610
562
```
611
563
612
-
Now our data.
613
-
frame just says `# A tibble: 46 × 4` at the top, with no groups.
564
+
Now our data.frame just says `# A tibble: 46 × 4` at the top, with no groups.
614
565
615
-
While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.
616
-
frame.
566
+
While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame.
617
567
For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is.
618
568
For this, we can use `group_by()` and `mutate()` together:
619
569
@@ -695,14 +645,12 @@ sp_by_plot
695
645
```
696
646
697
647
That looks great, but it is a bit difficult to compare values across plots.
698
-
It would be nice if we could reshape this data.
699
-
frame to make those comparisons easier.
648
+
It would be nice if we could reshape this data.frame to make those comparisons easier.
700
649
Well, the `tidyr` package from the `tidyverse` has a pair of functions that allow you to reshape data by pivoting it: `pivot_wider()` and `pivot_longer()`.
701
650
`pivot_wider()` will make the data wider, which means increasing the number of columns and reducing the number of rows.
702
651
`pivot_longer()` will do the opposite, reducing the number of columns and increasing the number of rows.
703
652
704
-
In this case, it might be nice to create a data.
705
-
frame where each species has its own row, and each plot has its own column containing the mean weight for a given species.
653
+
In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species.
706
654
We will use `pivot_wider()` to reshape our data in this way.
707
655
It takes 3 arguments:
708
656
@@ -715,8 +663,7 @@ Any columns not used for `names_from` or `values_from` will not be pivoted.
715
663
{alt='Diagram depicting the behavior of `pivot_wider()` on a small tabular dataset.'}
716
664
717
665
In our case, we want the new columns to be named from our `plot_id` column, with the values coming from the `mean_weight` column.
718
-
We can pipe our data.
719
-
frame right into `pivot_wider()` and add those two arguments:
666
+
We can pipe our data.frame right into `pivot_wider()` and add those two arguments:
First, we have a new column for each `plot_id` value.
733
-
There is one old column left in the data.
734
-
frame: `species_id`.
679
+
There is one old column left in the data.frame: `species_id`.
735
680
It wasn't used in `pivot_wider()`, so it stays, and now contains a single entry for each unique `species_id` value.
736
681
737
682
Finally, a lot of `NA`s have appeared.
738
-
Some species aren't found in every plot, but because a data.
739
-
frame has to have a value in every row and every column, an `NA` is inserted.
683
+
Some species aren't found in every plot, but because a data.frame has to have a value in every row and every column, an `NA` is inserted.
740
684
We can double-check this to verify what is going on.
741
685
742
-
Looking in our new pivoted data.
743
-
frame, we can see that there is an `NA` value for the species `BA` in plot `1`.
744
-
Let's take our `sp_by_plot` data.
745
-
frame and look for the `mean_weight` of that species + plot combination.
686
+
Looking in our new pivoted data.frame, we can see that there is an `NA` value for the species `BA` in plot `1`.
687
+
Let's take our `sp_by_plot` data.frame and look for the `mean_weight` of that species + plot combination.
746
688
747
689
```{r pivot-wider-check}
748
690
sp_by_plot %>%
@@ -752,16 +694,14 @@ sp_by_plot %>%
752
694
We get back 0 rows.
753
695
There is no `mean_weight` for the species `BA` in plot `1`.
754
696
This either happened because no `BA` were ever caught in plot `1`, or because every `BA` caught in plot `1` had an `NA` weight value and all the rows got removed when we used `filter(!is.na(weight))` in the process of making `sp_by_plot`.
755
-
Because there are no rows with that species + plot combination, in our pivoted data.
756
-
frame, the value gets filled with `NA`.
697
+
Because there are no rows with that species + plot combination, in our pivoted data.frame, the value gets filled with `NA`.
757
698
758
699
There is another `pivot_` function that does the opposite, moving data from a wide to long format, called `pivot_longer()`.
759
700
It takes 3 arguments: `cols` for the columns you want to pivot, `names_to` for the name of the new column which will contain the old column names, and `values_to` for the name of the new column which will contain the old values.
760
701
761
702
{alt='Diagram depicting the behavior of `pivot_longer()` on a small tabular dataset.'}
762
703
763
-
We can pivot our new wide data.
764
-
frame to a long format using `pivot_longer()`.
704
+
We can pivot our new wide data.frame to a long format using `pivot_longer()`.
765
705
We want to pivot all the columns except `species_id`, and we will use `PLOT` for the new column of plot IDs, and `MEAN_WT` for the new column of mean weight values.
766
706
767
707
```{r pivot-longer}
@@ -782,8 +722,7 @@ Data are often recorded in spreadsheets in a wider format, but lots of `tidyvers
782
722
783
723
## Exporting data
784
724
785
-
Let's say we want to send the wide version of our `sb_by_plot` data.
786
-
frame to a colleague who doesn't use R.
725
+
Let's say we want to send the wide version of our `sb_by_plot` data.frame to a colleague who doesn't use R.
787
726
In this case, we might want to save it as a CSV file.
788
727
789
728
First, we might want to modify the names of the columns, since right now they are bare numbers, which aren't very informative.
@@ -807,10 +746,8 @@ surveys_sp <- sp_by_plot %>%
807
746
surveys_sp
808
747
```
809
748
810
-
Now we can save this data.
811
-
frame to a CSV using the `write_csv()` function from the `readr` package.
812
-
The first argument is the name of the data.
813
-
frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
749
+
Now we can save this data.frame to a CSV using the `write_csv()` function from the `readr` package.
750
+
The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
0 commit comments