Skip to content

Commit 2fbc3ad

Browse files
committed
Tweak aeolus output manually
1 parent 2254bfe commit 2fbc3ad

File tree

2 files changed

+40
-104
lines changed

2 files changed

+40
-104
lines changed

episodes/how-r-thinks-about-data.Rmd

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ If you are ever unsure, it never hurts to explicitly name an argument.
8282

8383
To learn more about a function, you can type a `?` in front of the name of the function, which will bring up the official documentation for that function:
8484

85-
```{r}
85+
```{r, head-help}
8686
?head
8787
```
8888

@@ -604,8 +604,7 @@ You will be naming a of objects in R, and there are a few common naming rules an
604604
- avoid dots `.` in names, as they have a special meaning in R, and may be confusing to others
605605
- two common formats are `snake_case` and `camelCase`
606606
- be consistent, at least within a script, ideally within a whole project
607-
- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or
608-
[tidyverse's](https://style.tidyverse.org/)
607+
- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or [tidyverse's](https://style.tidyverse.org/)
609608

610609
::::::::::::::::::::::::::::::::::::: keypoints
611610

episodes/working-with-data.Rmd

Lines changed: 38 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -5,71 +5,38 @@ exercises: 4
55
---
66

77
<!-- - importing complete_old CSV -->
8-
98
<!-- - touch on column parsing -->
10-
119
<!-- - talk about file paths and tab completion -->
12-
1310
<!-- - should we teach the `here()` package? -->
14-
1511
<!-- - base vs. tidyverse -->
16-
1712
<!-- - pipes -->
18-
1913
<!-- - select -->
20-
2114
<!-- - filter -->
22-
2315
<!-- - idea of conditional subsetting -->
24-
2516
<!-- - ==, >, !, |, & -->
26-
2717
<!-- - show %in% -->
28-
2918
<!-- - mutate -->
30-
3119
<!-- - making a date column -->
32-
3320
<!-- - group_by -->
34-
3521
<!-- - summarize -->
36-
3722
<!-- - mutate -->
38-
3923
<!-- - ungroup -->
40-
4124
<!-- - pivot_wider -->
42-
4325
<!-- - exporting data -->
44-
4526
<!-- Challenge ideas: -->
46-
4727
<!-- - plotting a time series using the date -->
48-
4928
<!-- - predicting # of rows before pivoting -->
50-
5129
<!-- - important to plan out reshaping steps in advance -->
52-
5330
<!-- - filter operations -->
54-
5531
<!-- - weight between two values -->
56-
5732
<!-- - maybe throw in an | example -->
58-
5933
<!-- - combination filter and select where doing the order wrong yields an error -->
60-
6134
<!-- - only has columns record_id, species_id, sex, and hindfoot_length, but weight has to be NA -->
62-
6335
<!-- - a couple of simple group_by operations -->
64-
6536
<!-- - how many combinations of plot id and genus are there -->
66-
6737
<!-- - could show that distinct also works here -->
68-
6938
<!-- - what will happen if you group by weight? -->
70-
7139
<!-- - an operation that requires group_by and mutate -->
72-
7340
<!-- - an operation that requires multiple group_by steps -->
7441

7542
:::::::::::::::::::::::::::::::::::::: questions
@@ -160,10 +127,8 @@ class(surveys)
160127
Whoa!
161128
What is this thing?
162129
It has multiple classes?
163-
Well, it's called a `tibble`, and it is the `tidyverse` version of a data.
164-
frame.
165-
It *is* a data.
166-
frame, but with some added perks.
130+
Well, it's called a `tibble`, and it is the `tidyverse` version of a data.frame.
131+
It *is* a data.frame, but with some added perks.
167132
It prints out a little more nicely, it highlights `NA` values and negative values in red, and it will generally communicate with you more (in terms of warnings and errors, which is a good thing).
168133

169134
:::::::::::::::::::::::::::::::::::::::::: callout
@@ -192,8 +157,7 @@ Finally, the `tidyverse` has only continued to grow, and has strong support from
192157
One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data.
193158
The `dplyr` and `tidyr` packages in the `tidyverse` provide a series of powerful functions for many common data manipulation tasks.
194159

195-
We'll start off with two of the most commonly used `dplyr` functions: `select()`, which selects certain columns of a data.
196-
frame, and `filter()`, which filters out rows according to certain criteria.
160+
We'll start off with two of the most commonly used `dplyr` functions: `select()`, which selects certain columns of a data.frame, and `filter()`, which filters out rows according to certain criteria.
197161

198162
:::::::::::::::::::::::::::::::::::::::::: callout
199163

@@ -204,8 +168,7 @@ Between `select()` and `filter()`, it can be hard to remember which operates on
204168

205169
#### `select()`
206170

207-
To use the `select()` function, the first argument is the name of the data.
208-
frame, and the rest of the arguments are *unquoted* names of the columns you want:
171+
To use the `select()` function, the first argument is the name of the data.frame, and the rest of the arguments are *unquoted* names of the columns you want:
209172

210173
```{r select}
211174
select(surveys, plot_id, species_id, hindfoot_length)
@@ -227,8 +190,7 @@ select(surveys, c(3:5, 10))
227190
```
228191

229192
You should be careful when using this method, since you are being less explicit about which columns you want.
230-
However, it can be useful if you have a data.
231-
frame with many columns and you don't want to type out too many names.
193+
However, it can be useful if you have a data.frame with many columns and you don't want to type out too many names.
232194

233195
Finally, you can select columns based on whether they match a certain criteria by using the `where()` function.
234196
If we want all numeric columns, we can ask to `select` all the columns `where` the class `is numeric`:
@@ -309,10 +271,8 @@ filter(select(surveys, -day), month >= 7)
309271
```
310272

311273
R will evaluate statements from the inside out.
312-
First, `select()` will operate on the `surveys` data.
313-
frame, removing the column `day`.
314-
The resulting data.
315-
frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
274+
First, `select()` will operate on the `surveys` data.frame, removing the column `day`.
275+
The resulting data.frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
316276

317277
Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once.
318278
An alternative approach is to create **intermediate** objects:
@@ -342,10 +302,13 @@ It then gets sent into the `filter()` function, where it is further modified, an
342302
It can also be helpful to think of `%>%` as meaning "and then".
343303
Since many `tidyverse` functions have verbs for names, a pipeline can be read like a sentence.
344304

345-
:::::::::::::::::::::::::::::::::::::::::::: instructor It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing.
305+
:::::::::::::::::::::::::::::::::::::::::::: instructor
306+
307+
It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing.
346308
If your cursor is on any line of a pipeline, running that line will run the whole thing.
347309

348310
You can also show that by highlighting a section of a pipeline, you can run only the first X steps of it.
311+
349312
::::::::::::::::::::::::::::::::::::::::::::
350313

351314
If we want to store this final product as an object, we use an assignment arrow at the start:
@@ -365,8 +328,7 @@ This approach is very interactive, allowing you to see the results of each step
365328

366329
## Challenge 2: Using pipes
367330

368-
Use the surveys data to make a data.
369-
frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988.
331+
Use the surveys data to make a data.frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988.
370332
Use a pipe between the function calls.
371333

372334
:::::::::::::::::::::::: solution
@@ -492,10 +454,8 @@ This isn't necessarily the most useful plot, but we will learn some techniques t
492454
Many data analysis tasks can be achieved using the split-apply-combine approach: you split the data into groups, apply some analysis to each group, and combine the results in some way.
493455
`dplyr` has a few convenient functions to enable this approach, the main two being `group_by()` and `summarize()`.
494456

495-
`group_by()` takes a data.
496-
frame and the name of one or more columns with categorical values that define the groups.
497-
`summarize()` then collapses each group into a one-row summary of the group, giving you back a data.
498-
frame with one row per group.
457+
`group_by()` takes a data.frame and the name of one or more columns with categorical values that define the groups.
458+
`summarize()` then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group.
499459
The syntax for `summarize()` is similar to `mutate()`, where you define new columns based on values of other columns.
500460
Let's try calculating the mean weight of all our animals by sex.
501461

@@ -528,16 +488,14 @@ surveys %>%
528488
n = n())
529489
```
530490

531-
Our resulting data.
532-
frame is much larger, since we have a greater number of groups.
491+
Our resulting data.frame is much larger, since we have a greater number of groups.
533492
We also see a strange value showing up in our `mean_weight` column: `NaN`.
534493
This stands for "Not a Number", and it often results from trying to do an operation a vector with zero entries.
535494
How can a vector have zero entries?
536495
Well, if a particular group (like the AB species ID + `NA` sex group) has **only** `NA` values for weight, then the `na.rm = T` argument in `mean()` will remove **all** the values prior to calculating the mean.
537496
The result will be a value of `NaN`.
538497
Since we are not particularly interested in these values, let's add a step to our pipeline to remove rows where weight is `NA` **before** doing any other steps.
539-
This means that any groups with only `NA` values will disappear from our data.
540-
frame before we formally create the groups with `group_by()`.
498+
This means that any groups with only `NA` values will disappear from our data.frame before we formally create the groups with `group_by()`.
541499

542500
```{r filter-group-by}
543501
surveys %>%
@@ -571,32 +529,26 @@ surveys %>%
571529
arrange(desc(mean_weight))
572530
```
573531

574-
You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.` These are warning you that your resulting data.
575-
frame has retained some group structure, which means any subsequent operations on that data.
576-
frame will happen at the group level.
577-
If you look at the resulting data.
578-
frame printed out in your console, you will see these lines:
532+
You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.`
533+
These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level.
534+
If you look at the resulting data.frame printed out in your console, you will see these lines:
579535

580536
```
581537
# A tibble: 46 × 4
582538
# Groups: species_id [18]
583539
```
584540

585-
They tell us we have a data.
586-
frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups.
541+
They tell us we have a data.frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups.
587542
We will see something similar if we use `group_by()` alone:
588543

589544
```{r group-by-alone}
590545
surveys %>%
591546
group_by(species_id, sex)
592547
```
593548

594-
What we get back is the entire `surveys` data.
595-
frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations.
596-
Groups are often maintained throughout a pipeline, and if you assign the resulting data.
597-
frame to a new object, it will also have those groups.
598-
This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.
599-
frame, not by group.
549+
What we get back is the entire `surveys` data.frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations.
550+
Groups are often maintained throughout a pipeline, and if you assign the resulting data.frame to a new object, it will also have those groups.
551+
This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group.
600552
Therefore, it is a good habit to remove the groups at the end of a pipeline containing `group_by()`:
601553

602554
```{r ungroup}
@@ -609,11 +561,9 @@ surveys %>%
609561
ungroup()
610562
```
611563

612-
Now our data.
613-
frame just says `# A tibble: 46 × 4` at the top, with no groups.
564+
Now our data.frame just says `# A tibble: 46 × 4` at the top, with no groups.
614565

615-
While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.
616-
frame.
566+
While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame.
617567
For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is.
618568
For this, we can use `group_by()` and `mutate()` together:
619569

@@ -695,14 +645,12 @@ sp_by_plot
695645
```
696646

697647
That looks great, but it is a bit difficult to compare values across plots.
698-
It would be nice if we could reshape this data.
699-
frame to make those comparisons easier.
648+
It would be nice if we could reshape this data.frame to make those comparisons easier.
700649
Well, the `tidyr` package from the `tidyverse` has a pair of functions that allow you to reshape data by pivoting it: `pivot_wider()` and `pivot_longer()`.
701650
`pivot_wider()` will make the data wider, which means increasing the number of columns and reducing the number of rows.
702651
`pivot_longer()` will do the opposite, reducing the number of columns and increasing the number of rows.
703652

704-
In this case, it might be nice to create a data.
705-
frame where each species has its own row, and each plot has its own column containing the mean weight for a given species.
653+
In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species.
706654
We will use `pivot_wider()` to reshape our data in this way.
707655
It takes 3 arguments:
708656

@@ -715,8 +663,7 @@ Any columns not used for `names_from` or `values_from` will not be pivoted.
715663
![](fig/pivot_wider.png){alt='Diagram depicting the behavior of `pivot_wider()` on a small tabular dataset.'}
716664

717665
In our case, we want the new columns to be named from our `plot_id` column, with the values coming from the `mean_weight` column.
718-
We can pipe our data.
719-
frame right into `pivot_wider()` and add those two arguments:
666+
We can pipe our data.frame right into `pivot_wider()` and add those two arguments:
720667

721668
```{r pivot-wider}
722669
sp_by_plot_wide <- sp_by_plot %>%
@@ -726,23 +673,18 @@ sp_by_plot_wide <- sp_by_plot %>%
726673
sp_by_plot_wide
727674
```
728675

729-
Now we've got our reshaped data.
730-
frame.
676+
Now we've got our reshaped data.frame.
731677
There are a few things to notice.
732678
First, we have a new column for each `plot_id` value.
733-
There is one old column left in the data.
734-
frame: `species_id`.
679+
There is one old column left in the data.frame: `species_id`.
735680
It wasn't used in `pivot_wider()`, so it stays, and now contains a single entry for each unique `species_id` value.
736681

737682
Finally, a lot of `NA`s have appeared.
738-
Some species aren't found in every plot, but because a data.
739-
frame has to have a value in every row and every column, an `NA` is inserted.
683+
Some species aren't found in every plot, but because a data.frame has to have a value in every row and every column, an `NA` is inserted.
740684
We can double-check this to verify what is going on.
741685

742-
Looking in our new pivoted data.
743-
frame, we can see that there is an `NA` value for the species `BA` in plot `1`.
744-
Let's take our `sp_by_plot` data.
745-
frame and look for the `mean_weight` of that species + plot combination.
686+
Looking in our new pivoted data.frame, we can see that there is an `NA` value for the species `BA` in plot `1`.
687+
Let's take our `sp_by_plot` data.frame and look for the `mean_weight` of that species + plot combination.
746688

747689
```{r pivot-wider-check}
748690
sp_by_plot %>%
@@ -752,16 +694,14 @@ sp_by_plot %>%
752694
We get back 0 rows.
753695
There is no `mean_weight` for the species `BA` in plot `1`.
754696
This either happened because no `BA` were ever caught in plot `1`, or because every `BA` caught in plot `1` had an `NA` weight value and all the rows got removed when we used `filter(!is.na(weight))` in the process of making `sp_by_plot`.
755-
Because there are no rows with that species + plot combination, in our pivoted data.
756-
frame, the value gets filled with `NA`.
697+
Because there are no rows with that species + plot combination, in our pivoted data.frame, the value gets filled with `NA`.
757698

758699
There is another `pivot_` function that does the opposite, moving data from a wide to long format, called `pivot_longer()`.
759700
It takes 3 arguments: `cols` for the columns you want to pivot, `names_to` for the name of the new column which will contain the old column names, and `values_to` for the name of the new column which will contain the old values.
760701

761702
![](fig/pivot_longer.png){alt='Diagram depicting the behavior of `pivot_longer()` on a small tabular dataset.'}
762703

763-
We can pivot our new wide data.
764-
frame to a long format using `pivot_longer()`.
704+
We can pivot our new wide data.frame to a long format using `pivot_longer()`.
765705
We want to pivot all the columns except `species_id`, and we will use `PLOT` for the new column of plot IDs, and `MEAN_WT` for the new column of mean weight values.
766706

767707
```{r pivot-longer}
@@ -782,8 +722,7 @@ Data are often recorded in spreadsheets in a wider format, but lots of `tidyvers
782722

783723
## Exporting data
784724

785-
Let's say we want to send the wide version of our `sb_by_plot` data.
786-
frame to a colleague who doesn't use R.
725+
Let's say we want to send the wide version of our `sb_by_plot` data.frame to a colleague who doesn't use R.
787726
In this case, we might want to save it as a CSV file.
788727

789728
First, we might want to modify the names of the columns, since right now they are bare numbers, which aren't very informative.
@@ -807,10 +746,8 @@ surveys_sp <- sp_by_plot %>%
807746
surveys_sp
808747
```
809748

810-
Now we can save this data.
811-
frame to a CSV using the `write_csv()` function from the `readr` package.
812-
The first argument is the name of the data.
813-
frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
749+
Now we can save this data.frame to a CSV using the `write_csv()` function from the `readr` package.
750+
The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
814751

815752
```{r write-csv}
816753
write_csv(surveys_sp, "data/cleaned/surveys_meanweight_species_plot.csv")

0 commit comments

Comments
 (0)