diff --git a/episodes/how-r-thinks-about-data.Rmd b/episodes/how-r-thinks-about-data.Rmd
index e4d9067c4..b29f39dd6 100644
--- a/episodes/how-r-thinks-about-data.Rmd
+++ b/episodes/how-r-thinks-about-data.Rmd
@@ -4,7 +4,7 @@ teaching: 60
exercises: 3
---
-:::::::::::::::::::::::::::::::::::::: questions
+:::::::::::::::::::::::::::::::::::::: questions
- How does R store and represent data?
@@ -20,7 +20,6 @@ exercises: 3
::::::::::::::::::::::::::::::::::::::::::::::::
-
## Setup
```{r setup, include=FALSE}
@@ -34,9 +33,11 @@ library(ratdat)
## The data.frame
-We just spent quite a bit of time learning how to create visualizations from the `complete_old` data, but we did not talk much about **what** this `complete_old` thing is. It's important to understand how R thinks about, represents, and stores data in order for us to have a productive working relationship with R.
+We just spent quite a bit of time learning how to create visualizations from the `complete_old` data, but we did not talk much about **what** this `complete_old` thing is.
+It's important to understand how R thinks about, represents, and stores data in order for us to have a productive working relationship with R.
-The `complete_old` data is stored in R as a **data.frame**, which is the most common way that R represents tabular data (data that can be stored in a table format, like a spreadsheet). We can check what `complete_old` is by using the `class()` function:
+The `complete_old` data is stored in R as a **data.frame**, which is the most common way that R represents tabular data (data that can be stored in a table format, like a spreadsheet).
+We can check what `complete_old` is by using the `class()` function:
```{r class}
class(complete_old)
@@ -49,9 +50,13 @@ head(complete_old)
tail(complete_old)
```
-We used these functions with just one argument, the object `complete_old`, and we didn't give the argument a name, like we often did with `ggplot2`. In R, a function's arguments come in a particular order, and if you put them in the correct order, you don't need to name them. In this case, the name of the argument is `x`, so we can name it if we want, but since we know it's the first argument, we don't need to.
+We used these functions with just one argument, the object `complete_old`, and we didn't give the argument a name, like we often did with `ggplot2`.
+In R, a function's arguments come in a particular order, and if you put them in the correct order, you don't need to name them.
+In this case, the name of the argument is `x`, so we can name it if we want, but since we know it's the first argument, we don't need to.
-Some arguments are optional. For example, the `n` argument in `head()` specifies the number of rows to print. It defaults to 6, but we can override that by specifying a different number:
+Some arguments are optional.
+For example, the `n` argument in `head()` specifies the number of rows to print.
+It defaults to 6, but we can override that by specifying a different number:
```{r head-n}
head(complete_old, n = 10)
@@ -69,7 +74,9 @@ Additionally, if we name them, we can put them in any order we want:
head(n = 10, x = complete_old)
```
-Generally, it's good practice to start with the required arguments, like the data.frame whose rows you want to see, and then to name the optional arguments. If you are ever unsure, it never hurts to explicitly name an argument.
+Generally, it's good practice to start with the required arguments, like the data.
+frame whose rows you want to see, and then to name the optional arguments.
+If you are ever unsure, it never hurts to explicitly name an argument.
### Aside: Getting Help
@@ -79,36 +86,56 @@ To learn more about a function, you can type a `?` in front of the name of the f
?head
```
-Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability. The first section, **Description**, gives you a concise description of what the function does, but it may not always be enough. The **Arguments** section defines all the arguments for the function and is usually worth reading thoroughly. Finally, the **Examples** section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.
+Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability.
+The first section, **Description**, gives you a concise description of what the function does, but it may not always be enough.
+The **Arguments** section defines all the arguments for the function and is usually worth reading thoroughly.
+Finally, the **Examples** section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.
-Another great source of information is **package vignettes**. Many packages have vignettes, which are like tutorials that introduce the package, specific functions, or general methods. You can run `vignette(package = "package_name")` to see a list of vignettes in that package. Once you have a name, you can run `vignette("vignette_name", "package_name")` to view that vignette. You can also use a web browser to go to `https://cran.r-project.org/web/packages/package_name/vignettes/` where you will find a list of links to each vignette. Some packages will have their own websites, which often have nicely formatted vignettes and tutorials.
+Another great source of information is **package vignettes**.
+Many packages have vignettes, which are like tutorials that introduce the package, specific functions, or general methods.
+You can run `vignette(package = "package_name")` to see a list of vignettes in that package.
+Once you have a name, you can run `vignette("vignette_name", "package_name")` to view that vignette.
+You can also use a web browser to go to `https://cran.r-project.org/web/packages/package_name/vignettes/` where you will find a list of links to each vignette.
+Some packages will have their own websites, which often have nicely formatted vignettes and tutorials.
-Finally, learning to search for help is probably the most useful skill for any R user. The key skill is figuring out what you should actually search for. It's often a good idea to start your search with `R` or `R programming`. If you have the name of a package you want to use, start with `R package_name`.
+Finally, learning to search for help is probably the most useful skill for any R user.
+The key skill is figuring out what you should actually search for.
+It's often a good idea to start your search with `R` or `R programming`.
+If you have the name of a package you want to use, start with `R package_name`.
-Many of the answers you find will be from a website called Stack Overflow, where people ask programming questions and others provide answers. It is generally poor form to ask duplicate questions, so before you decide to post your own, do some thorough searching to see if it has been answered before (it likely has). If you do decide to post a question on Stack Overflow, or any other help forum, you will want to create a **reproducible example** or **reprex**. If you are asking a complicated question requiring your own data and a whole bunch of code, people probably won't be able or willing to help you. However, if you can hone in on the specific thing you want help with, and create a minimal example using smaller, fake data, it will be much easier for others to help you. If you search `how to make a reproducible example in R`, you will find some great resources to help you out.
+Many of the answers you find will be from a website called Stack Overflow, where people ask programming questions and others provide answers.
+It is generally poor form to ask duplicate questions, so before you decide to post your own, do some thorough searching to see if it has been answered before (it likely has).
+If you do decide to post a question on Stack Overflow, or any other help forum, you will want to create a **reproducible example** or **reprex**.
+If you are asking a complicated question requiring your own data and a whole bunch of code, people probably won't be able or willing to help you.
+However, if you can hone in on the specific thing you want help with, and create a minimal example using smaller, fake data, it will be much easier for others to help you.
+If you search `how to make a reproducible example in R`, you will find some great resources to help you out.
#### Generative AI Help
::::::::::::::::::::::::::::: instructor
### Choose how to teach this section
+
The section on generative AI is intended to be concise but Instructors may choose to devote more time to the topic in a workshop.
Depending on your own level of experience and comfort with talking about and using these tools, you could choose to do any of the following:
-* Explain how large language models work and are trained, and/or the difference between generative AI, other forms of AI that currently exist, and the limits of what LLMs can do (e.g., they can't "reason").
-* Demonstrate how you recommend that learners use generative AI.
-* Discuss the ethical concerns listed below, as well as others that you are aware of, to help learners make an informed choice about whether or not to use generative AI tools.
+- Explain how large language models work and are trained, and/or the difference between generative AI, other forms of AI that currently exist, and the limits of what LLMs can do (e.g., they can't "reason").
+- Demonstrate how you recommend that learners use generative AI.
+- Discuss the ethical concerns listed below, as well as others that you are aware of, to help learners make an informed choice about whether or not to use generative AI tools.
-This is a fast-moving technology.
+This is a fast-moving technology.
If you are preparing to teach this section and you feel it has become outdated, please open an issue on the lesson repository to let the Maintainers know and/or a pull request to suggest updates and improvements.
::::::::::::::::::::::::::::::::::::::::
-In addition to the resources we've already mentioned for getting help with R, it's becoming increasingly common to turn to _generative AI_ chatbots such as ChatGPT to get help while coding. You will probably receive some useful guidance by presenting your error message to the chatbot and asking it what went wrong.
+In addition to the resources we've already mentioned for getting help with R, it's becoming increasingly common to turn to *generative AI* chatbots such as ChatGPT to get help while coding.
+You will probably receive some useful guidance by presenting your error message to the chatbot and asking it what went wrong.
-However, the way this help is provided by the chatbot is different. Answers on Stack Overflow have (probably) been given by a human as a direct response to the question asked. But generative AI chatbots, which are based on an advanced statistical model, respond by generating the _most likely_ sequence of text that would follow the prompt they are given.
+However, the way this help is provided by the chatbot is different.
+Answers on Stack Overflow have (probably) been given by a human as a direct response to the question asked.
+But generative AI chatbots, which are based on an advanced statistical model, respond by generating the *most likely* sequence of text that would follow the prompt they are given.
-While responses from generative AI tools can often be helpful, they are not always reliable.
+While responses from generative AI tools can often be helpful, they are not always reliable.
These tools sometimes generate plausible but incorrect or misleading information, so (just as with an answer found on the internet) it is essential to verify their accuracy.
You need the knowledge and skills to be able to understand these responses, to judge whether or not they are accurate, and to fix any errors in the code it offers you.
@@ -116,22 +143,24 @@ In addition to asking for help, programmers can use generative AI tools to gener
However, there are drawbacks that you should be aware of.
The models used by these tools have been "trained" on very large volumes of data, much of it taken from the internet, and the responses they produce reflect that training data, and may recapitulate its inaccuracies or biases.
-The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference). For more information see the [AI Environmental Impact Primer](https://huggingface.co/blog/sasha/ai-environment-primer) developed by researchers at HuggingFace, an AI hosting platform.
+The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference).
+For more information see the [AI Environmental Impact Primer](https://huggingface.co/blog/sasha/ai-environment-primer) developed by researchers at HuggingFace, an AI hosting platform.
Concerns also exist about the way the data for this training was obtained, with questions raised about whether the people developing the LLMs had permission to use it.
Other ethical concerns have also been raised, such as reports that workers were exploited during the training process.
**We recommend that you avoid getting help from generative AI during the workshop** for several reasons:
1. For most problems you will encounter at this stage, help and answers can be found among the first results returned by searching the internet.
-2. The foundational knowledge and skills you will learn in this lesson by writing and fixing your own programs are essential to be able to evaluate the correctness and safety of any code you receive from online help or a generative AI chatbot.
- If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.
-3. As you start out with programming, the mistakes you make will be the kinds that have also been made -- and overcome! -- by everybody else who learned to program before you.
+2. The foundational knowledge and skills you will learn in this lesson by writing and fixing your own programs are essential to be able to evaluate the correctness and safety of any code you receive from online help or a generative AI chatbot.
+ If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.
+3. As you start out with programming, the mistakes you make will be the kinds that have also been made -- and overcome! -- by everybody else who learned to program before you.
Since these mistakes and the questions you are likely to have at this stage are common, they are also better represented than other, more specialised problems and tasks in the data that was used to train generative AI tools.
- This means that a generative AI chatbot is _more likely to produce accurate responses_ to questions that novices ask, which could give you a false impression of how reliable they will be when you are ready to do things that are more advanced.
+ This means that a generative AI chatbot is *more likely to produce accurate responses* to questions that novices ask, which could give you a false impression of how reliable they will be when you are ready to do things that are more advanced.
### Knowing more about our data.frame
-Let's get back to investigating our `complete_old` data.frame. We can get some useful summaries of each variable using the `summary()` function:
+Let's get back to investigating our `complete_old` data.frame.
+We can get some useful summaries of each variable using the `summary()` function:
```{r}
summary(complete_old)
@@ -143,11 +172,17 @@ And, as we have already done, we can use `str()` to look at the structure of an
str(complete_old)
```
-We get quite a bit of useful information here. First, we are told that we have a data.frame of `r nrow(complete_old)` observations, or rows, and `r ncol(complete_old)` variables, or columns.
+We get quite a bit of useful information here.
+First, we are told that we have a data.frame of `r nrow(complete_old)` observations, or rows, and `r ncol(complete_old)` variables, or columns.
-Next, we get a bit of information on each variable, including its type (`int` or `chr`) and a quick peek at the first 10 values. You might ask why there is a `$` in front of each variable. This is because the `$` is an operator that allows us to select individual columns from a data.frame.
+Next, we get a bit of information on each variable, including its type (`int` or `chr`) and a quick peek at the first 10 values.
+You might ask why there is a `$` in front of each variable.
+This is because the `$` is an operator that allows us to select individual columns from a data.frame.
-The `$` operator also allows you to use tab-completion to quickly select which variable you want from a given data.frame. For example, to get the `year` variable, we can type `complete_old$` and then hit Tab. We get a list of the variables that we can move through with up and down arrow keys. Hit Enter when you reach `year`, which should finish this code:
+The `$` operator also allows you to use tab-completion to quickly select which variable you want from a given data.frame.
+For example, to get the `year` variable, we can type `complete_old$` and then hit Tab.
+We get a list of the variables that we can move through with up and down arrow keys.
+Hit Enter when you reach `year`, which should finish this code:
```{r dollar-subsetting}
complete_old$year
@@ -157,9 +192,14 @@ What we get back is a whole bunch of numbers, the entries in the `year` column p
## Vectors: the building block of data
-You might have noticed that our last result looked different from when we printed out the `complete_old` data.frame itself. That's because it is not a data.frame, it is a **vector**. A vector is a 1-dimensional series of values, in this case a vector of numbers representing years.
+You might have noticed that our last result looked different from when we printed out the `complete_old` data.frame itself.
+That's because it is not a data.frame, it is a **vector**.
+A vector is a 1-dimensional series of values, in this case a vector of numbers representing years.
-Data.frames are made up of vectors; each column in a data.frame is a vector. Vectors are the basic building blocks of all data in R. Basically, everything in R is a vector, a bunch of vectors stitched together in some way, or a function. Understanding how vectors work is crucial to understanding how R treats data, so we will spend some time learning about them.
+Data.frames are made up of vectors; each column in a data.frame is a vector.
+Vectors are the basic building blocks of all data in R.
+Basically, everything in R is a vector, a bunch of vectors stitched together in some way, or a function.
+Understanding how vectors work is crucial to understanding how R treats data, so we will spend some time learning about them.
There are 4 main types of vectors (also known as *atomic vectors*):
@@ -171,7 +211,10 @@ There are 4 main types of vectors (also known as *atomic vectors*):
4. `"logical"` for `TRUE` and `FALSE`, which can also be represented as `T` and `F`. In other contexts, these may be referred to as "Boolean" data.
-Vectors can only be of a **single type**. Since each column in a data.frame is a vector, this means an accidental character following a number, like `29,` can change the type of the whole vector. Mixing up vector types is one of the most common mistakes in R, and it can be tricky to figure out. It's often very useful to check the types of vectors.
+Vectors can only be of a **single type**.
+Since each column in a data.frame is a vector, this means an accidental character following a number, like `29,` can change the type of the whole vector.
+Mixing up vector types is one of the most common mistakes in R, and it can be tricky to figure out.
+It's often very useful to check the types of vectors.
To create a vector from scratch, we can use the `c()` function, putting values inside, separated by commas.
@@ -179,7 +222,8 @@ To create a vector from scratch, we can use the `c()` function, putting values i
c(1, 2, 5, 12, 4)
```
-As you can see, those values get printed out in the console, just like with `complete_old$year`. To store this vector so we can continue to work with it, we need to assign it to an object.
+As you can see, those values get printed out in the console, just like with `complete_old$year`.
+To store this vector so we can continue to work with it, we need to assign it to an object.
```{r assign-vector}
num <- c(1, 2, 5, 12, 4)
@@ -200,7 +244,8 @@ char <- c("apple", "pear", "grape")
class(char)
```
-Remember that each entry, like `"apple"`, needs to be surrounded by quotes, and entries are separated with commas. If you do something like `"apple, pear, grape"`, you will have only a single entry containing that whole string.
+Remember that each entry, like `"apple"`, needs to be surrounded by quotes, and entries are separated with commas.
+If you do something like `"apple, pear, grape"`, you will have only a single entry containing that whole string.
Finally, let's make a logical vector:
@@ -209,7 +254,7 @@ logi <- c(TRUE, FALSE, TRUE, TRUE)
class(logi)
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 1: Coercion
@@ -226,7 +271,7 @@ char_logi <- c("a", "b", TRUE)
tricky <- c("a", "b", "1", FALSE)
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r coercion-challenge-answer}
class(num_logi)
@@ -239,39 +284,48 @@ R will automatically convert values in a vector so that they are all the same ty
::::::::::::::::::::::::
-2. How many values in `combined_logical` are `"TRUE"` (as a character)?
+2. How many values in `combined_logical` are `"TRUE"` (as a character)?
```{r combined-logical-challenge}
combined_logical <- c(num_logi, char_logi)
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r combined-logical-challenge-answer}
combined_logical
class(combined_logical)
```
-
-Only one value is `"TRUE"`. Coercion happens when each vector is created, so the `TRUE` in `num_logi` becomes a `1`, while the `TRUE` in `char_logi` becomes `"TRUE"`. When these two vectors are combined, R doesn't remember that the `1` in `num_logi` used to be a `TRUE`, it will just coerce the `1` to `"1"`.
+
+Only one value is `"TRUE"`.
+Coercion happens when each vector is created, so the `TRUE` in `num_logi` becomes a `1`, while the `TRUE` in `char_logi` becomes `"TRUE"`.
+When these two vectors are combined, R doesn't remember that the `1` in `num_logi` used to be a `TRUE`, it will just coerce the `1` to `"1"`.
::::::::::::::::::::::::
3. Now that you've seen a few examples of coercion, you might have started to see that there are some rules about how types get converted. There is a hierarchy to coercion. Can you draw a diagram that represents the hierarchy of what types get converted to other types?
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
logical → integer → numeric → character
-Logical vectors can only take on two values: `TRUE` or `FALSE`. Integer vectors can only contain integers, so `TRUE` and `FALSE` can be coerced to `1` and `0`. Numeric vectors can contain numbers with decimals, so integers can be coerced from, say, `6` to `6.0` (though R will still display a numeric `6` as `6`.). Finally, any string of characters can be represented as a character vector, so any of the other types can be coerced to a character vector.
+Logical vectors can only take on two values: `TRUE` or `FALSE`.
+Integer vectors can only contain integers, so `TRUE` and `FALSE` can be coerced to `1` and `0`.
+Numeric vectors can contain numbers with decimals, so integers can be coerced from, say, `6` to `6.0` (though R will still display a numeric `6` as `6`.
+).
+Finally, any string of characters can be represented as a character vector, so any of the other types can be coerced to a character vector.
::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::
-Coercion is not something you will often do intentionally; rather, when combining vectors or reading data into R, a stray character that you missed may change an entire numeric vector into a character vector. It is a good idea to check the `class()` of your results frequently, particularly if you are running into confusing error messages.
+Coercion is not something you will often do intentionally; rather, when combining vectors or reading data into R, a stray character that you missed may change an entire numeric vector into a character vector.
+It is a good idea to check the `class()` of your results frequently, particularly if you are running into confusing error messages.
## Missing data
-One of the great things about R is how it handles missing data, which can be tricky in other programming languages. R represents missing data as `NA`, without quotes, in vectors of any type. Let's make a numeric vector with an `NA` value:
+One of the great things about R is how it handles missing data, which can be tricky in other programming languages.
+R represents missing data as `NA`, without quotes, in vectors of any type.
+Let's make a numeric vector with an `NA` value:
```{r na-vec}
weights <- c(25, 34, 12, NA, 42)
@@ -283,7 +337,8 @@ R doesn't make assumptions about how you want to handle missing data, so if we p
min(weights)
```
-This is a very good thing, since we won't accidentally forget to consider our missing data. If we decide to exclude our missing values, many basic math functions have an argument to **r**e**m**ove them:
+This is a very good thing, since we won't accidentally forget to consider our missing data.
+If we decide to exclude our missing values, many basic math functions have an argument to **r**e**m**ove them:
```{r func-na-rm}
min(weights, na.rm = TRUE)
@@ -291,21 +346,30 @@ min(weights, na.rm = TRUE)
## Vectors as arguments
-A common reason to create a vector from scratch is to use in a function argument. The `quantile()` function will calculate a quantile for a given vector of numeric values. We set the quantile using the `probs` argument. We also need to set `na.rm = TRUE`, since there are `NA` values in the `weight` column.
+A common reason to create a vector from scratch is to use in a function argument.
+The `quantile()` function will calculate a quantile for a given vector of numeric values.
+We set the quantile using the `probs` argument.
+We also need to set `na.rm = TRUE`, since there are `NA` values in the `weight` column.
```{r single-quantile}
quantile(complete_old$weight, probs = 0.25, na.rm = TRUE)
```
-Now we get back the 25% quantile value for weights. However, we often want to know more than one quantile. Luckily, the `probs` argument is **vectorized**, meaning it can take a whole vector of values. Let's try getting the 25%, 50% (median), and 75% quantiles all at once.
+Now we get back the 25% quantile value for weights.
+However, we often want to know more than one quantile.
+Luckily, the `probs` argument is **vectorized**, meaning it can take a whole vector of values.
+Let's try getting the 25%, 50% (median), and 75% quantiles all at once.
```{r multi-quantile}
quantile(complete_old$weight, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
```
-While the `c()` function is very flexible, it doesn't necessarily scale well. If you want to generate a long vector from scratch, you probably don't want to type everything out manually. There are a few functions that can help generate vectors.
+While the `c()` function is very flexible, it doesn't necessarily scale well.
+If you want to generate a long vector from scratch, you probably don't want to type everything out manually.
+There are a few functions that can help generate vectors.
-First, putting `:` between two numbers will generate a vector of integers starting with the first number and ending with the last. The `seq()` function allows you to generate similar sequences, but changing by any amount.
+First, putting `:` between two numbers will generate a vector of integers starting with the first number and ending with the last.
+The `seq()` function allows you to generate similar sequences, but changing by any amount.
```{r seq}
# generates a sequence of integers
@@ -336,7 +400,7 @@ rep(c("a", "b", "c"), times = 4)
rep(1:10, each = 4)
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 2: Creating sequences
@@ -346,7 +410,7 @@ rep(1:10, each = 4)
rep(-3:3, 3)
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r seq-challenge-answer}
rep(-3:3, 3)
@@ -364,7 +428,7 @@ rep(my_seq, 3)
2. Calculate the quantiles for the `complete_old` hindfoot lengths at every 5% level (0%, 5%, 10%, 15%, etc.)
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r quantiles-challenge-answer}
quantile(complete_old$hindfoot_length,
@@ -375,10 +439,11 @@ quantile(complete_old$hindfoot_length,
::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::
-
## Building with vectors
-We have now seen vectors in a few different forms: as columns in a data.frame and as single vectors. However, they can be manipulated into lots of other shapes and forms. Some other common forms are:
+We have now seen vectors in a few different forms: as columns in a data.frame and as single vectors.
+However, they can be manipulated into lots of other shapes and forms.
+Some other common forms are:
- matrices
- 2-dimensional numeric representations
@@ -388,7 +453,7 @@ We have now seen vectors in a few different forms: as columns in a data.frame an
- lists are very flexible ways to store vectors
- a list can contain vectors of many different types and lengths
- an entry in a list can be another list, so lists can get deeply nested
- - a data.frame is a type of list where each column is an individual vector and each vector has to be the same length, since a data.frame has an entry in every column for each row
+ - a data.frame is a type of list where each column is an individual vector and each vector has to be the same length, since a data.frame has an entry in every column for each row
- factors
- a way to represent categorical data
- factors can be ordered or unordered
@@ -397,7 +462,8 @@ We have now seen vectors in a few different forms: as columns in a data.frame an
### Factors
-We will spend a bit more time talking about factors, since they are often a challenging type of data to work with. We can create a factor from scratch by putting a character vector made using `c()` into the `factor()` function:
+We will spend a bit more time talking about factors, since they are often a challenging type of data to work with.
+We can create a factor from scratch by putting a character vector made using `c()` into the `factor()` function:
```{r factors}
sex <- factor(c("male", "female", "female", "male", "female", NA))
@@ -411,7 +477,8 @@ We can inspect the levels of the factor using the `levels()` function:
levels(sex)
```
-The **`forcats`** package from the `tidyverse` has a lot of convenient functions for working with factors. We will show you a few common operations, but the `forcats` package has many more useful functions.
+The **`forcats`** package from the `tidyverse` has a lot of convenient functions for working with factors.
+We will show you a few common operations, but the `forcats` package has many more useful functions.
```{r forcats}
library(forcats)
@@ -427,12 +494,14 @@ fct_na_value_to_level(sex, "(Missing)")
```
-In general, it is a good practice to leave your categorical data as a **character** vector until you need to use a factor. Here are some reasons you might need a factor:
+In general, it is a good practice to leave your categorical data as a **character** vector until you need to use a factor.
+Here are some reasons you might need a factor:
1. Another function requires you to use a factor
2. You are plotting categorical data and want to control the ordering of categories in the plot
-Since factors can behave differently from character vectors, it is always a good idea to check what type of data you're working with. You might use a new function for the first time and be confused by the results, only to realize later that it produced a factor as an output, when you thought it was a character vector.
+Since factors can behave differently from character vectors, it is always a good idea to check what type of data you're working with.
+You might use a new function for the first time and be confused by the results, only to realize later that it produced a factor as an output, when you thought it was a character vector.
It is fairly straightforward to convert a factor to a character vector:
@@ -454,7 +523,8 @@ as.numeric(as.character(f_num))
## Assignment, objects, and values
-We've already created quite a few objects in R using the `<-` assignment arrow, but there are a few finer details worth talking about. First, let's start with a quick challenge.
+We've already created quite a few objects in R using the `<-` assignment arrow, but there are a few finer details worth talking about.
+First, let's start with a quick challenge.
::::::::::::::::::::::::::::::::::::: challenge
@@ -468,7 +538,6 @@ y <- x
x <- 10
```
-
:::::::::::::::::::::::: solution
```{r assignment-challenge-answer}
@@ -481,7 +550,11 @@ y
::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::
-Understanding what's going on here will help you avoid a lot of confusion when working in R. When we assign something to an object, the first thing that happens is the righthand side gets *evaluated*. The same thing happens when you run something in the console: if you type `x` into the console and hit Enter, R returns the value of `x`. So when we first ran the line `y <- x`, `x` first gets evaluated to the value of `5`, and this gets assigned to `y`. The objects `x` and `y` are not actually linked to each other in any way, so when we change the value of `x` to `10`, `y` is unaffected.
+Understanding what's going on here will help you avoid a lot of confusion when working in R.
+When we assign something to an object, the first thing that happens is the righthand side gets *evaluated*.
+The same thing happens when you run something in the console: if you type `x` into the console and hit Enter, R returns the value of `x`.
+So when we first ran the line `y <- x`, `x` first gets evaluated to the value of `5`, and this gets assigned to `y`.
+The objects `x` and `y` are not actually linked to each other in any way, so when we change the value of `x` to `10`, `y` is unaffected.
This also means you can run multiple nested operations, store intermediate values as separate objects, or overwrite values:
@@ -521,10 +594,9 @@ You will be naming a of objects in R, and there are a few common naming rules an
- avoid dots `.` in names, as they have a special meaning in R, and may be confusing to others
- two common formats are `snake_case` and `camelCase`
- be consistent, at least within a script, ideally within a whole project
-- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or
-[tidyverse's](https://style.tidyverse.org/)
+- you can use a style guide like [Google's](https://google.github.io/styleguide/Rguide.xml) or [tidyverse's](https://style.tidyverse.org/)
-::::::::::::::::::::::::::::::::::::: keypoints
+::::::::::::::::::::::::::::::::::::: keypoints
- functions like `head()`, `str()`, and `summary()` are useful for exploring data.frames
- most things in R are vectors, vectors stitched together, or functions
diff --git a/episodes/introduction-r-rstudio.Rmd b/episodes/introduction-r-rstudio.Rmd
index 733a551a2..48e60827d 100644
--- a/episodes/introduction-r-rstudio.Rmd
+++ b/episodes/introduction-r-rstudio.Rmd
@@ -4,7 +4,7 @@ teaching: 45
exercises: 0
---
-:::::::::::::::::::::::::::::::::::::: questions
+:::::::::::::::::::::::::::::::::::::: questions
- Why should you use R and RStudio?
- How do you get started working in R and RStudio?
@@ -25,7 +25,8 @@ exercises: 0
R refers to a programming language as well as the software that runs R code.
-[RStudio](https://rstudio.com) is a software interface that can make it easier to write R scripts and interact with the R software. It's a very popular platform, and RStudio also maintains the [`tidyverse`](https://www.tidyverse.org/) series of packages we will use in this lesson.
+[RStudio](https://rstudio.com) is a software interface that can make it easier to write R scripts and interact with the R software.
+It's a very popular platform, and RStudio also maintains the [`tidyverse`](https://www.tidyverse.org/) series of packages we will use in this lesson.
## Why learn R?
@@ -41,71 +42,98 @@ You can walk through this analogy if you want, or skip over it if you don't find
## Your new pedantic collaborator...
-You're working on a project when your advisor suggests that you begin working with one of their long-time collaborators. According to your advisor, this collaborator is very talented, but only speaks a language that you don't know. Your advisor assures you that this is ok, the collaborator won't judge you for starting to learn the language, and will happily answer your questions. However, the collaborator is also quite pedantic. While they don't mind that you don't speak their language fluently yet, they are always going to answer you quite literally.
+You're working on a project when your advisor suggests that you begin working with one of their long-time collaborators.
+According to your advisor, this collaborator is very talented, but only speaks a language that you don't know.
+Your advisor assures you that this is ok, the collaborator won't judge you for starting to learn the language, and will happily answer your questions.
+However, the collaborator is also quite pedantic.
+While they don't mind that you don't speak their language fluently yet, they are always going to answer you quite literally.
-You decide to reach out to the collaborator. You find that they email you back very quickly, almost immediately most of the time. Since you're just learning their language, you often make mistakes. Sometimes, they tell you that you've made a grammatical error or warn you that what you asked for doesn't make a lot of sense. Sometimes these warnings are difficult to understand, because you don't really have a grasp of the underlying grammar. Sometimes you get an answer back, with no warnings, but you realize that it doesn't make sense, because what you asked for isn't quite what you *wanted*. Since this collaborator responds almost immediately, without tiring, you can quickly reformulate your question and send it again.
+You decide to reach out to the collaborator.
+You find that they email you back very quickly, almost immediately most of the time.
+Since you're just learning their language, you often make mistakes.
+Sometimes, they tell you that you've made a grammatical error or warn you that what you asked for doesn't make a lot of sense.
+Sometimes these warnings are difficult to understand, because you don't really have a grasp of the underlying grammar.
+Sometimes you get an answer back, with no warnings, but you realize that it doesn't make sense, because what you asked for isn't quite what you *wanted*.
+Since this collaborator responds almost immediately, without tiring, you can quickly reformulate your question and send it again.
-In this way, you begin to learn the language your collaborator speaks, as well as the particular way they think about your work. Eventually, the two of you develop a good working relationship, where you understand how to ask them questions effectively, and how to work through any issues in communication that might arise.
+In this way, you begin to learn the language your collaborator speaks, as well as the particular way they think about your work.
+Eventually, the two of you develop a good working relationship, where you understand how to ask them questions effectively, and how to work through any issues in communication that might arise.
This collaborator's name is R.
-When you send commands to R, you get a response back. Sometimes, when you make mistakes, you will get back a nice, informative error message or warning. However, sometimes the warnings seem to reference a much "deeper" level of R than you're familiar with. Or, even worse, you may get the wrong answer with no warning because the command you sent is perfectly valid, but isn't what you actually want. While you may first have some success working with R by memorizing certain commands or reusing other scripts, this is akin to using a collection of tourist phrases or pre-written statements when having a conversation. You might make a mistake (like getting directions to the library when you need a bathroom), and you are going to be limited in your flexibility (like furiously paging through a tourist guide looking for the term for "thrift store").
+When you send commands to R, you get a response back.
+Sometimes, when you make mistakes, you will get back a nice, informative error message or warning.
+However, sometimes the warnings seem to reference a much "deeper" level of R than you're familiar with.
+Or, even worse, you may get the wrong answer with no warning because the command you sent is perfectly valid, but isn't what you actually want.
+While you may first have some success working with R by memorizing certain commands or reusing other scripts, this is akin to using a collection of tourist phrases or pre-written statements when having a conversation.
+You might make a mistake (like getting directions to the library when you need a bathroom), and you are going to be limited in your flexibility (like furiously paging through a tourist guide looking for the term for "thrift store").
-This is all to say that we are going to spend a bit of time digging into some of the more fundamental aspects of the R language, and these concepts may not feel as immediately useful as, say, learning to make plots with `ggplot2`. However, learning these more fundamental concepts will help you develop an understanding of how R thinks about data and code, how to interpret error messages, and how to flexibly expand your skills to new situations.
+This is all to say that we are going to spend a bit of time digging into some of the more fundamental aspects of the R language, and these concepts may not feel as immediately useful as, say, learning to make plots with `ggplot2`.
+However, learning these more fundamental concepts will help you develop an understanding of how R thinks about data and code, how to interpret error messages, and how to flexibly expand your skills to new situations.
:::::::::::::::::::::::::::::
### R does not involve lots of pointing and clicking, and that's a good thing
-Since R is a programming language, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that's a good thing! So, if you want to redo your analysis because you collected more data, you don't have to remember which button you clicked in which order to obtain your results; you just have to run your script again.
+Since R is a programming language, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that's a good thing!
+So, if you want to redo your analysis because you collected more data, you don't have to remember which button you clicked in which order to obtain your results; you just have to run your script again.
-Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.
+Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.
-Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
+Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
### R code is great for reproducibility
-Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis.
+Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis.
-R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.
+R integrates with other tools to generate manuscripts from your code.
+If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.
-An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
+An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
### R is interdisciplinary and extensible
-With tens of thousands of packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
+With tens of thousands of packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data.
+For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
### R works on data of all shapes and sizes
-The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won't make much difference to you.
+The skills you learn with R scale easily with the size of your dataset.
+Whether your dataset has hundreds or millions of lines, it won't make much difference to you.
-R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.
+R is designed for data analysis.
+It comes with special data structures and data types that make handling of missing data and statistical factors convenient.
R can read data from many different file types, including geospatial data, and connect to local and remote databases.
### R produces high-quality graphics
-R has well-developed plotting capabilities, and the `ggplot2` package is one of, if not the most powerful pieces of plotting software available today. We will begin learning to use `ggplot2` in the next episode.
+R has well-developed plotting capabilities, and the `ggplot2` package is one of, if not the most powerful pieces of plotting software available today.
+We will begin learning to use `ggplot2` in the next episode.
### R has a large and welcoming community
-Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/), or on the [RStudio community](https://community.rstudio.com/).
+Thousands of people use R daily.
+Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/), or on the [RStudio community](https://community.rstudio.com/).
+
+Since R is very popular among researchers, most of the help communities and learning materials are aimed towards other researchers.
+Python is a similar language to R, and can accomplish many of the same tasks, but is widely used by software developers and software engineers, so Python resources and communities are not as oriented towards researchers.
-Since R is very popular among researchers, most of the help communities and learning materials are aimed towards other researchers. Python is a similar language to R, and can accomplish many of the same tasks, but is widely used by software developers and software engineers, so Python resources and communities are not as oriented towards researchers.
-
### Not only is R free, but it is also open-source and cross-platform
-Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
+Anyone can inspect the source code to see how R works.
+Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
## Navigating RStudio
-We will use the RStudio integrated development environment (IDE) to write code into scripts, run code in R, navigate files on our computer, inspect objects we create in R, and look at the plots we make. RStudio has many other features that can help with things like version control, developing R packages, and writing Shiny apps, but we won't cover those in the workshop.
+We will use the RStudio integrated development environment (IDE) to write code into scripts, run code in R, navigate files on our computer, inspect objects we create in R, and look at the plots we make.
+RStudio has many other features that can help with things like version control, developing R packages, and writing Shiny apps, but we won't cover those in the workshop.
{alt='Screenshot of RStudio showing the 4 "panes".'}
In the above screenshot, we can see 4 "panes" in the default layout:
-- Top-Left: the **Source** pane that displays scripts and other files.
+- Top-Left: the **Source** pane that displays scripts and other files.
- If you only have 3 panes, and the Console pane is in the top left, press Shift+Cmd+N (Mac) or Shift+Ctrl+N (Windows) to open a blank R script, which should make the Source pane appear.
- Top-Right: the **Environment/History** pane, which shows all the objects in your current R session (Environment) and your command history (History)
- there are some other tabs here, including Connections, Build, Tutorial, and possibly Git
@@ -114,15 +142,19 @@ In the above screenshot, we can see 4 "panes" in the default layout:
- There are also tabs for Terminal and Jobs
- Bottom-Right: the **Files/Plots/Help/Viewer** pane to navigate files or view plots and help pages
-You can customize the layout of these panes, as well as many settings such as RStudio color scheme, font, and even keyboard shortcuts. You can access these settings by going to the menu bar, then clicking on Tools → Global Options.
+You can customize the layout of these panes, as well as many settings such as RStudio color scheme, font, and even keyboard shortcuts.
+You can access these settings by going to the menu bar, then clicking on Tools → Global Options.
RStudio puts most of the things you need to work in R into a single window, and also includes features like keyboard shortcuts, autocompletion of code, and syntax highlighting (different types of code are colored differently, making it easier to navigate your code).
## Getting set up in RStudio
-It is a good practice to organize your projects into self-contained folders right from the start, so we will start building that habit now. A well-organized project is easier to navigate, more reproducible, and easier to share with others. Your project should start with a top-level folder that contains everything necessary for the project, including data, scripts, and images, all organized into sub-folders.
+It is a good practice to organize your projects into self-contained folders right from the start, so we will start building that habit now.
+A well-organized project is easier to navigate, more reproducible, and easier to share with others.
+Your project should start with a top-level folder that contains everything necessary for the project, including data, scripts, and images, all organized into sub-folders.
-RStudio provides a "Projects" feature that can make it easier to work on individual projects in R. We will create a project that we will keep everything for this workshop.
+RStudio provides a "Projects" feature that can make it easier to work on individual projects in R.
+We will create a project that we will keep everything for this workshop.
1. Start RStudio (you should see a view similar to the screenshot above).
2. In the top right, you will see a blue 3D cube and the words "Project: (None)". Click on this icon.
@@ -134,23 +166,36 @@ RStudio provides a "Projects" feature that can make it easier to work on individ
Next time you open RStudio, you can click that 3D cube icon, and you will see options to open existing projects, like the one you just made.
-One of the benefits to using RStudio Projects is that they automatically set the **working directory** to the top-level folder for the project. The working directory is the folder where R is working, so it views the location of all files (including data and scripts) as being relative to the working directory. You may come across scripts that include something like `setwd("/Users/YourUserName/MyCoolProject")`, which directly sets a working directory. This is usually much less portable, since that specific directory might not be found on someone else's computer (they probably don't have the same username as you). Using RStudio Projects means we don't have to deal with manually setting the working directory.
+One of the benefits to using RStudio Projects is that they automatically set the **working directory** to the top-level folder for the project.
+The working directory is the folder where R is working, so it views the location of all files (including data and scripts) as being relative to the working directory.
+You may come across scripts that include something like `setwd("/Users/YourUserName/MyCoolProject")`, which directly sets a working directory.
+This is usually much less portable, since that specific directory might not be found on someone else's computer (they probably don't have the same username as you).
+Using RStudio Projects means we don't have to deal with manually setting the working directory.
-There are a few settings we will need to adjust to improve the reproducibility of our work. Go to your menu bar, then click Tools → Global Options to open up the Options window.
+There are a few settings we will need to adjust to improve the reproducibility of our work.
+Go to your menu bar, then click Tools → Global Options to open up the Options window.
{alt='Screenshot of the RStudio Global Options, with "Restore .RData into workspace at startup" unchecked, and "Save workspace to .RData on exit" set to "Never".'}
-Make sure your settings match those highlighted in yellow. We don't want RStudio to store the current status of our R session and reload it the next time we start R. This might sound convenient, but for the sake of reproducibility, we want to start with a clean, empty R session every time we work. That means that we have to record everything we do into scripts, save any data we need into files, and store outputs like images as files. We want to get used to everything we generate in a single R session being *disposable*. We want our scripts to be able to regenerate things we need, other than "raw materials" like data.
+Make sure your settings match those highlighted in yellow.
+We don't want RStudio to store the current status of our R session and reload it the next time we start R.
+This might sound convenient, but for the sake of reproducibility, we want to start with a clean, empty R session every time we work.
+That means that we have to record everything we do into scripts, save any data we need into files, and store outputs like images as files.
+We want to get used to everything we generate in a single R session being *disposable*.
+We want our scripts to be able to regenerate things we need, other than "raw materials" like data.
## Organizing your project directory
:::::::::::::::::::::::::::::::::::::::::::: instructor
-If you are teaching remotely and sharing only the RStudio window, the new windows that pop up while creating folders will not be shared via Zoom. You can switch to sharing your entire screen, which will allow learners to see the popup windows.
+If you are teaching remotely and sharing only the RStudio window, the new windows that pop up while creating folders will not be shared via Zoom.
+You can switch to sharing your entire screen, which will allow learners to see the popup windows.
::::::::::::::::::::::::::::::::::::::::::::
-Using a consistent folder structure across all your new projects will help keep a growing project organized, and make it easy to find files in the future. This is especially beneficial if you are working on multiple projects, since you will know where to look for particular kinds of files.
+Using a consistent folder structure across all your new projects will help keep a growing project organized, and make it easy to find files in the future.
+This is especially beneficial if you are working on multiple projects, since you will know where to look for particular kinds of files.
-We will use a basic structure for this workshop, which is often a good place to start, and can be extended to meet your specific needs. Here is a diagram describing the structure:
+We will use a basic structure for this workshop, which is often a good place to start, and can be extended to meet your specific needs.
+Here is a diagram describing the structure:
```
R-Ecology-Workshop
@@ -166,30 +211,52 @@ R-Ecology-Workshop
└─── documents
```
-Within our project folder (`R-Ecology-Workshop`), we first have a `scripts` folder to hold any scripts we write. We also have a `data` folder containing `cleaned` and `raw` subfolders. In general, you want to keep your `raw` data completely untouched, so once you put data into that folder, you do not modify it. Instead, you read it into R, and if you make any modifications, you write that modified file into the `cleaned` folder. We also have an `images` folder for plots we make, and a `documents` folder for any other documents you might produce.
+Within our project folder (`R-Ecology-Workshop`), we first have a `scripts` folder to hold any scripts we write.
+We also have a `data` folder containing `cleaned` and `raw` subfolders.
+In general, you want to keep your `raw` data completely untouched, so once you put data into that folder, you do not modify it.
+Instead, you read it into R, and if you make any modifications, you write that modified file into the `cleaned` folder.
+We also have an `images` folder for plots we make, and a `documents` folder for any other documents you might produce.
-Let's start making our new folders. Go to the **Files** pane (bottom right), and check the current directory, highlighted in yellow below. You should be in the directory for the project you just made, in our case `R-Ecology-Workshop`. You shouldn't see any folders in here yet.
+Let's start making our new folders.
+Go to the **Files** pane (bottom right), and check the current directory, highlighted in yellow below.
+You should be in the directory for the project you just made, in our case `R-Ecology-Workshop`.
+You shouldn't see any folders in here yet.
{alt='RStudio Files pane with current directory path highlighted.'}
-Next, click the **New Folder** button, and type in `scripts` to generate your `scripts` folder. It should appear in the Files list now. Repeat the process to make your `data`, `images`, and `documents` folders. Then, click on the `data` folder in the Files pane. This will take you into the `data` folder, which will be empty. Use the **New Folder** button to create `raw` and `cleaned` folders. To return to the `R-Ecology-Workshop` folder, click on it in the file path, which is highlighted in yellow in the previous image. It's worth noting that the **Files** pane helps you create, find, and open files, but moving through your files won't change where the **working directory** of your project is.
+Next, click the **New Folder** button, and type in `scripts` to generate your `scripts` folder.
+It should appear in the Files list now.
+Repeat the process to make your `data`, `images`, and `documents` folders.
+Then, click on the `data` folder in the Files pane.
+This will take you into the `data` folder, which will be empty.
+Use the **New Folder** button to create `raw` and `cleaned` folders.
+To return to the `R-Ecology-Workshop` folder, click on it in the file path, which is highlighted in yellow in the previous image.
+It's worth noting that the **Files** pane helps you create, find, and open files, but moving through your files won't change where the **working directory** of your project is.
## Working in R and RStudio
-The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write these instructions in the form of *code*, which is a common language that is understood by the computer and humans (after some practice). We call these instructions *commands*, and we tell the computer to follow the instructions by *running* (also called *executing*) the commands.
+The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions.
+We write these instructions in the form of *code*, which is a common language that is understood by the computer and humans (after some practice).
+We call these instructions *commands*, and we tell the computer to follow the instructions by *running* (also called *executing*) the commands.
### Console vs. script
-You can run commands directly in the R console, or you can write them into an R script. It may help to think of working in the console vs. working in a script as something like cooking. The console is like making up a new recipe, but not writing anything down. You can carry out a series of steps and produce a nice, tasty dish at the end. However, because you didn't write anything down, it's harder to figure out exactly what you did, and in what order.
+You can run commands directly in the R console, or you can write them into an R script.
+It may help to think of working in the console vs. working in a script as something like cooking.
+The console is like making up a new recipe, but not writing anything down.
+You can carry out a series of steps and produce a nice, tasty dish at the end.
+However, because you didn't write anything down, it's harder to figure out exactly what you did, and in what order.
-Writing a script is like taking nice notes while cooking- you can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don't have to try to remember what went well and what didn't. It's actually even easier than cooking, since you can hit one button and the computer "cooks" the whole recipe for you!
+Writing a script is like taking nice notes while cooking- you can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don't have to try to remember what went well and what didn't.
+It's actually even easier than cooking, since you can hit one button and the computer "cooks" the whole recipe for you!
-An additional benefit of scripts is that you can leave **comments** for yourself or others to read. Lines that start with `#` are considered comments and will not be interpreted as R code.
+An additional benefit of scripts is that you can leave **comments** for yourself or others to read.
+Lines that start with `#` are considered comments and will not be interpreted as R code.
#### Console
- The R console is where code is run/executed
-- The **prompt**, which is the `>` symbol, is where you can type commands
+- The **prompt**, which is the `>` symbol, is where you can type commands
- By pressing Enter, R will execute those commands and print the result.
- You can work here, and your history is saved in the History pane, but you can't access it in the future
@@ -202,14 +269,19 @@ An additional benefit of scripts is that you can leave **comments** for yourself
- If you highlight multiple lines of code, you can run all of them by pressing Cmd+Enter (Mac) or Ctrl+Enter (Windows)
- By preserving commands in a script, you can edit and rerun them quickly, save them for later, and share them with others
- You can leave comments for yourself by starting a line with a `#`
-
+
#### Example
-Let's try running some code in the console and in a script. First, click down in the Console pane, and type out `1+1`. Hit Enter to run the code. You should see your code echoed, and then the value of `2` returned.
+Let's try running some code in the console and in a script.
+First, click down in the Console pane, and type out `1+1`.
+Hit Enter to run the code.
+You should see your code echoed, and then the value of `2` returned.
-Now click into your blank script, and type out `1+1`. With your cursor on that line, hit Cmd+Enter (Mac) or Ctrl+Enter (Windows) to run the code. You will see that your code was sent from the script to the console, where it returned a value of `2`, just like when you ran your code directly in the console.
+Now click into your blank script, and type out `1+1`.
+With your cursor on that line, hit Cmd+Enter (Mac) or Ctrl+Enter (Windows) to run the code.
+You will see that your code was sent from the script to the console, where it returned a value of `2`, just like when you ran your code directly in the console.
-::::::::::::::::::::::::::::::::::::: keypoints
+::::::::::::::::::::::::::::::::::::: keypoints
- R is a programming language and software used to run commands in that language
- RStudio is software to make it easier to write and run code in R
diff --git a/episodes/visualizing-ggplot.Rmd b/episodes/visualizing-ggplot.Rmd
index 153869c91..dc6299e62 100644
--- a/episodes/visualizing-ggplot.Rmd
+++ b/episodes/visualizing-ggplot.Rmd
@@ -4,7 +4,7 @@ teaching: 90
exercises: 4
---
-:::::::::::::::::::::::::::::::::::::: questions
+:::::::::::::::::::::::::::::::::::::: questions
- How do you make plots using R?
- How do you customize and modify plots?
@@ -29,17 +29,24 @@ exercises: 4
knitr::opts_chunk$set(dpi = 200, out.height = 600, out.width = 600, R.options = list(max.print = 100))
```
-We are going to be using **functions** from the **`ggplot2`** package to create visualizations of data. Functions are predefined bits of code that automate more complicated actions. R itself has many built-in functions, but we can access many more by loading other **packages** of functions and data into R.
+We are going to be using **functions** from the **`ggplot2`** package to create visualizations of data.
+Functions are predefined bits of code that automate more complicated actions.
+R itself has many built-in functions, but we can access many more by loading other **packages** of functions and data into R.
-If you don't have a blank, untitled script open yet, go ahead and open one with Shift+Cmd+N (Mac) or Shift+Ctrl+N (Windows). Then save the file to your `scripts/` folder, and title it `workshop_code.R`.
+If you don't have a blank, untitled script open yet, go ahead and open one with Shift+Cmd+N (Mac) or Shift+Ctrl+N (Windows).
+Then save the file to your `scripts/` folder, and title it `workshop_code.R`.
-Earlier, you had to **install** the `ggplot2` package by running `install.packages("ggplot2")`. That installed the package onto your computer so that R can access it. In order to use it in our current session, we have to **load** the package using the `library()` function.
+Earlier, you had to **install** the `ggplot2` package by running `install.packages("ggplot2")`.
+That installed the package onto your computer so that R can access it.
+In order to use it in our current session, we have to **load** the package using the `library()` function.
::::::::::::::::::::::::::::: callout
-If you do not have `ggplot2` installed, you can run `install.packages("ggplot2")` in the **console**.
+If you do not have `ggplot2` installed, you can run `install.packages("ggplot2")` in the **console**.
-It is a good practice not to put `install.packages()` into a script. This is because every time you run that whole script, the package will be reinstalled, which is typically unnecessary. You want to install the package to your computer once, and then load it with `library()` in each script where you need to use it.
+It is a good practice not to put `install.packages()` into a script.
+This is because every time you run that whole script, the package will be reinstalled, which is typically unnecessary.
+You want to install the package to your computer once, and then load it with `library()` in each script where you need to use it.
:::::::::::::::::::::::::::::::::::::
@@ -47,15 +54,18 @@ It is a good practice not to put `install.packages()` into a script. This is bec
library(ggplot2)
```
-Later we will learn how to read data from external files into R, but for now we are going to use a clean and ready-to-use dataset that is provided by the **`ratdat`** data package. To make our dataset available, we need to load this package too.
+Later we will learn how to read data from external files into R, but for now we are going to use a clean and ready-to-use dataset that is provided by the **`ratdat`** data package.
+To make our dataset available, we need to load this package too.
-```{r load-ratdat-package,message=FALSE}
+```{r load-ratdat-package, message=FALSE}
library(ratdat)
```
-The `ratdat` package contains data from the [Portal Project](https://github.com/weecology/PortalData), which is a long-term dataset from Portal, Arizona, in the Chihuahuan desert.
+The `ratdat` package contains data from the [Portal Project](https://github.com/weecology/PortalData), which is a long-term dataset from Portal, Arizona, in the Chihuahuan desert.
-We will be using a dataset called `complete_old`, which contains older years of survey data. Let's try to learn a little bit about the data. We can use a `?` in front of the name of the dataset, which will bring up the help page for the data.
+We will be using a dataset called `complete_old`, which contains older years of survey data.
+Let's try to learn a little bit about the data.
+We can use a `?` in front of the name of the dataset, which will bring up the help page for the data.
```{r data-help}
?complete_old
@@ -63,7 +73,8 @@ We will be using a dataset called `complete_old`, which contains older years of
Here we can read descriptions of each variable in our data.
-To actually take a look at the data, we can use the `View()` function to open an interactive viewer, which behaves like a simplified version of a spreadsheet program. It's a handy function, but somewhat limited when trying to view large datasets.
+To actually take a look at the data, we can use the `View()` function to open an interactive viewer, which behaves like a simplified version of a spreadsheet program.
+It's a handy function, but somewhat limited when trying to view large datasets.
```{r view-data, eval=FALSE}
View(complete_old)
@@ -77,7 +88,9 @@ We can find out more about the dataset by using the `str()` function to examine
str(complete_old)
```
-`str()` will tell us how many observations/rows (obs) and variables/columns we have, as well as some information about each of the variables. We see the name of a variable (such as `year`), followed by the kind of variable (**int** for integer, **chr** for character), and the first 10 entries in that variable. We will talk more about different data types and structures later on.
+`str()` will tell us how many observations/rows (obs) and variables/columns we have, as well as some information about each of the variables.
+We see the name of a variable (such as `year`), followed by the kind of variable (**int** for integer, **chr** for character), and the first 10 entries in that variable.
+We will talk more about different data types and structures later on.
## Plotting with **`ggplot2`**
@@ -87,15 +100,22 @@ Probably worth mentioning that people often just say **ggplot** when referring t
::::::::::::::::::::::::::::::::::::::::::::
-**`ggplot2`** is a powerful package that allows you to create complex plots from tabular data (data in a table format with rows and columns). The **`gg`** in **`ggplot2`** stands for "grammar of graphics", and the package uses consistent vocabulary to create plots of widely varying types. Therefore, we only need small changes to our code if the underlying data changes or we decide to make a box plot instead of a scatter plot. This approach helps you create publication-quality plots with minimal adjusting and tweaking.
+**`ggplot2`** is a powerful package that allows you to create complex plots from tabular data (data in a table format with rows and columns).
+The **`gg`** in **`ggplot2`** stands for "grammar of graphics", and the package uses consistent vocabulary to create plots of widely varying types.
+Therefore, we only need small changes to our code if the underlying data changes or we decide to make a box plot instead of a scatter plot.
+This approach helps you create publication-quality plots with minimal adjusting and tweaking.
-**`ggplot2`** is part of the **`tidyverse`** series of packages, which tend to like data in the "long" or "tidy" format, which means each column represents a single variable, and each row represents a single observation. Well-structured data will save you lots of time making figures with **`ggplot2`**. For now, we will use data that are already in this format. We start learning R by using **`ggplot2`** because it relies on concepts that we will need when we talk about data transformation in the next lessons.
+**`ggplot2`** is part of the **`tidyverse`** series of packages, which tend to like data in the "long" or "tidy" format, which means each column represents a single variable, and each row represents a single observation.
+Well-structured data will save you lots of time making figures with **`ggplot2`**.
+For now, we will use data that are already in this format.
+We start learning R by using **`ggplot2`** because it relies on concepts that we will need when we talk about data transformation in the next lessons.
**`ggplot`** plots are built step by step by adding new layers, which allows for extensive flexibility and customization of plots.
::::::::::::::::::::::::::::: callout
-Some languages, like Python, require certain spacing or indentation for code to run properly. This isn't the case in R, so if you see spaces or indentation in the code from this lesson, it is to improve readability.
+Some languages, like Python, require certain spacing or indentation for code to run properly.
+This isn't the case in R, so if you see spaces or indentation in the code from this lesson, it is to improve readability.
:::::::::::::::::::::::::::::
@@ -105,47 +125,65 @@ To build a plot, we will use a basic template that can be used for different typ
ggplot(data = , mapping = aes()) + ()
```
-We use the `ggplot()` function to create a plot. In order to tell it what data to use, we need to specify the `data` **argument**. An argument is an input that a function takes, and you set arguments using the `=` sign.
+We use the `ggplot()` function to create a plot.
+In order to tell it what data to use, we need to specify the `data` **argument**.
+An argument is an input that a function takes, and you set arguments using the `=` sign.
```{r bare-plot}
ggplot(data = complete_old)
```
-We get a blank plot because we haven't told `ggplot()` which variables we want to correspond to parts of the plot. We can specify the "mapping" of variables to plot elements, such as x/y coordinates, size, or shape, by using the `aes()` function. We'll also add a comment, which is any line starting with a `#`. It's a good idea to use comments to organize your code or clarify what you are doing.
+We get a blank plot because we haven't told `ggplot()` which variables we want to correspond to parts of the plot.
+We can specify the "mapping" of variables to plot elements, such as x/y coordinates, size, or shape, by using the `aes()` function.
+We'll also add a comment, which is any line starting with a `#`.
+It's a good idea to use comments to organize your code or clarify what you are doing.
```{r plot-with-axes}
# adding a mapping to x and y axes
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length))
```
-Now we've got a plot with x and y axes corresponding to variables from `complete_old`. However, we haven't specified how we want the data to be displayed. We do this using `geom_` functions, which specify the type of `geom`etry we want, such as points, lines, or bars. We can add a `geom_point()` layer to our plot by using the `+` sign. We indent onto a new line to make it easier to read, and we have to **end** the first line with the `+` sign.
+Now we've got a plot with x and y axes corresponding to variables from `complete_old`.
+However, we haven't specified how we want the data to be displayed.
+We do this using `geom_` functions, which specify the type of `geom`etry we want, such as points, lines, or bars.
+We can add a `geom_point()` layer to our plot by using the `+` sign.
+We indent onto a new line to make it easier to read, and we have to **end** the first line with the `+` sign.
```{r scatter-plot}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
```
-You may notice a warning that missing values were removed. If a variable necessary to make the plot is missing from a given row of data (in this case, `hindfoot_length` or `weight`), it can't be plotted. `ggplot2` just uses a warning message to let us know that some rows couldn't be plotted.
-
+You may notice a warning that missing values were removed.
+If a variable necessary to make the plot is missing from a given row of data (in this case, `hindfoot_length` or `weight`), it can't be plotted.
+`ggplot2` just uses a warning message to let us know that some rows couldn't be plotted.
::::::::::::::::::::::::::::: callout
-**Warning** messages are one of a few ways R will communicate with you. Warnings can be thought of as a "heads up". Nothing necessarily went *wrong*, but the author of that function wanted to draw your attention to something. In the above case, it's worth knowing that some of the rows of your data were not plotted because they had missing data.
+**Warning** messages are one of a few ways R will communicate with you.
+Warnings can be thought of as a "heads up".
+Nothing necessarily went *wrong*, but the author of that function wanted to draw your attention to something.
+In the above case, it's worth knowing that some of the rows of your data were not plotted because they had missing data.
-A more serious type of message is an **error**. Here's an example:
+A more serious type of message is an **error**.
+Here's an example:
```{r geom-error}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
geom_poit()
```
-As you can see, we only get the error message, with no plot, because something has actually gone wrong. This particular error message is fairly common, and it happened because we misspelled `point` as `poit`. Because there is no function named `geom_poit()`, R tells us it can't find a function with that name.
+As you can see, we only get the error message, with no plot, because something has actually gone wrong.
+This particular error message is fairly common, and it happened because we misspelled `point` as `poit`.
+Because there is no function named `geom_poit()`, R tells us it can't find a function with that name.
:::::::::::::::::::::::::::::
## Changing aesthetics
-Building **`ggplot`** plots is often an iterative process, so we'll continue developing the scatter plot we just made. You may have noticed that parts of our scatter plot have many overlapping points, making it difficult to see all the data. We can adjust the transparency of the points using the `alpha` argument, which takes a value between 0 and 1:
+Building **`ggplot`** plots is often an iterative process, so we'll continue developing the scatter plot we just made.
+You may have noticed that parts of our scatter plot have many overlapping points, making it difficult to see all the data.
+We can adjust the transparency of the points using the `alpha` argument, which takes a value between 0 and 1:
```{r change-alpha, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
@@ -161,7 +199,8 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
::::::::::::::::::::::::::::: callout
-Two common issues you might run into when working in R are forgetting a closing bracket or a closing quote. Let's take a look at what each one does.
+Two common issues you might run into when working in R are forgetting a closing bracket or a closing quote.
+Let's take a look at what each one does.
Try running the following code:
@@ -170,7 +209,10 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(color = "blue", alpha = 0.2
```
-You will see a `+` appear in your console. This is R telling you that it expects more input in order to finish running the code. It is missing a closing bracket to end the `geom_point` function call. You can hit Esc in the console to reset it.
+You will see a `+` appear in your console.
+This is R telling you that it expects more input in order to finish running the code.
+It is missing a closing bracket to end the `geom_point` function call.
+You can hit Esc in the console to reset it.
Something similar will happen if you run the following code:
@@ -188,20 +230,23 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(color = "blue", alpha = 0.2))
```
-This time we have an extra closing `)`, which R doesn't know what to do with. It tells you there is an unexpected `)`, but it doesn't pinpoint exactly where. With enough time working in R, you will get better at spotting mismatched brackets.
+This time we have an extra closing `)`, which R doesn't know what to do with.
+It tells you there is an unexpected `)`, but it doesn't pinpoint exactly where.
+With enough time working in R, you will get better at spotting mismatched brackets.
:::::::::::::::::::::::::::::
### Adding another variable
-Let's try coloring our points according to the sampling plot type (plot here refers to the physical area where rodents were sampled and has nothing to do with making graphs). Since we're now mapping a variable (`plot_type`) to a component of the ggplot2 plot (`color`), we need to put the argument inside `aes()`:
+Let's try coloring our points according to the sampling plot type (plot here refers to the physical area where rodents were sampled and has nothing to do with making graphs).
+Since we're now mapping a variable (`plot_type`) to a component of the ggplot2 plot (`color`), we need to put the argument inside `aes()`:
```{r color-plot-type, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color = plot_type)) +
geom_point(alpha = 0.2)
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 1: Modifying plots
@@ -209,7 +254,7 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color
Do you think this is a good way to represent `sex` with these data?
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r modify-points-challenge-answer, warning=FALSE}
ggplot(data = complete_old,
@@ -221,14 +266,14 @@ ggplot(data = complete_old,
2. Now try changing the plot so that the `color` of the points vary by `year`. Do you notice a difference in the color scale compared to changing color by plot type? Why do you think this happened?
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r modify-color-challenge-answer, warning=FALSE}
ggplot(data = complete_old,
mapping = aes(x = weight, y = hindfoot_length, color = year)) +
geom_point(alpha = 0.2)
```
-
+
- For Part 2, the color scale is different compared to using `color = plot_type` because `plot_type` and `year` are different variable types. `plot_type` is a categorical variable, so `ggplot2` defaults to use a **discrete** color scale, whereas `year` is a numeric variable, so `ggplot2` uses a **continuous** color scale.
::::::::::::::::::::::::
@@ -236,7 +281,9 @@ ggplot(data = complete_old,
## Changing scales
-The default discrete color scale isn't always ideal: it isn't friendly to viewers with colorblindness and it doesn't translate well to grayscale. However, **`ggplot2`** comes with quite a few other color scales, including the fantastic `viridis` scales, which are designed to be colorblind and grayscale friendly. We can change scales by adding `scale_` functions to our plots:
+The default discrete color scale isn't always ideal: it isn't friendly to viewers with colorblindness and it doesn't translate well to grayscale.
+However, **`ggplot2`** comes with quite a few other color scales, including the fantastic `viridis` scales, which are designed to be colorblind and grayscale friendly.
+We can change scales by adding `scale_` functions to our plots:
```{r scale-viridis, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color = plot_type)) +
@@ -244,7 +291,8 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color
scale_color_viridis_d()
```
-Scales don't just apply to colors- any plot component that you put inside `aes()` can be modified with `scale_` functions. Just as we modified the scale used to map `plot_type` to `color`, we can modify the way that `weight` is mapped to the `x` axis by using the `scale_x_log10()` function:
+Scales don't just apply to colors- any plot component that you put inside `aes()` can be modified with `scale_` functions.
+Just as we modified the scale used to map `plot_type` to `color`, we can modify the way that `weight` is mapped to the `x` axis by using the `scale_x_log10()` function:
```{r scale-log, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color = plot_type)) +
@@ -252,17 +300,20 @@ ggplot(data = complete_old, mapping = aes(x = weight, y = hindfoot_length, color
scale_x_log10()
```
-One nice thing about `ggplot` and the `tidyverse` in general is that groups of functions that do similar things are given similar names. Any function that modifies a `ggplot` scale starts with `scale_`, making it easier to search for the right function.
+One nice thing about `ggplot` and the `tidyverse` in general is that groups of functions that do similar things are given similar names.
+Any function that modifies a `ggplot` scale starts with `scale_`, making it easier to search for the right function.
## Boxplot
-Let's try making a different type of plot altogether. We'll start off with our same basic building blocks using `ggplot()` and `aes()`.
+Let's try making a different type of plot altogether.
+We'll start off with our same basic building blocks using `ggplot()` and `aes()`.
```{r blank-boxplot}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length))
```
-This time, let's try making a boxplot, which will have `plot_type` on the x axis and `hindfoot_length` on the y axis. We can do this by adding `geom_boxplot()` to our `ggplot()`:
+This time, let's try making a boxplot, which will have `plot_type` on the x axis and `hindfoot_length` on the y axis.
+We can do this by adding `geom_boxplot()` to our `ggplot()`:
```{r boxplot}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -276,7 +327,9 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, co
geom_boxplot()
```
-It looks like `color` has only affected the outlines of the boxplot, not the rectangular portions. This is because the `color` only impacts 1-dimensional parts of a `ggplot`: points and lines. To change the color of 2-dimensional parts of a plot, we use `fill`:
+It looks like `color` has only affected the outlines of the boxplot, not the rectangular portions.
+This is because the `color` only impacts 1-dimensional parts of a `ggplot`: points and lines.
+To change the color of 2-dimensional parts of a plot, we use `fill`:
```{r boxplot-fill, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, fill = plot_type)) +
@@ -285,9 +338,12 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, fi
::::::::::::::::::::::::::::: callout
-One thing you may notice is that the axis labels are overlapping each other, depending on how wide your plot viewer is. One way to help make them more legible is to **wrap** the text. We can do that by modifying the **labels** for the `x` axis `scale`.
+One thing you may notice is that the axis labels are overlapping each other, depending on how wide your plot viewer is.
+One way to help make them more legible is to **wrap** the text.
+We can do that by modifying the **labels** for the `x` axis `scale`.
-We use the `scale_x_discrete()` function because we have a discrete axis, and we modify the `labels` argument. The function `label_wrap_gen()` will wrap the text of the labels to make them more legible.
+We use the `scale_x_discrete()` function because we have a discrete axis, and we modify the `labels` argument.
+The function `label_wrap_gen()` will wrap the text of the labels to make them more legible.
```{r boxplot-label-wrap, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, fill = plot_type)) +
@@ -299,9 +355,12 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, fi
## Adding geoms
-One of the most powerful aspects of **`ggplot`** is the way we can add components to a plot in successive layers. While boxplots can be very useful for summarizing data, it is often helpful to show the raw data as well. With **`ggplot`**, we can easily add another `geom_` to our plot to show the raw data.
+One of the most powerful aspects of **`ggplot`** is the way we can add components to a plot in successive layers.
+While boxplots can be very useful for summarizing data, it is often helpful to show the raw data as well.
+With **`ggplot`**, we can easily add another `geom_` to our plot to show the raw data.
-Let's add `geom_point()` to visualize the raw data. We will modify the `alpha` argument to help with overplotting.
+Let's add `geom_point()` to visualize the raw data.
+We will modify the `alpha` argument to help with overplotting.
```{r boxplot-points, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -309,7 +368,8 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
geom_point(alpha = 0.2)
```
-Uh oh... all our points for a given `x` axis category fall exactly on a line, which isn't very useful. We can shift to using `geom_jitter()`, which will add points with a bit of random noise added to the positions to prevent this from happening.
+Uh oh... all our points for a given `x` axis category fall exactly on a line, which isn't very useful.
+We can shift to using `geom_jitter()`, which will add points with a bit of random noise added to the positions to prevent this from happening.
```{r boxplot-jitter, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -317,9 +377,11 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
geom_jitter(alpha = 0.2)
```
-You may have noticed that some of our data points are now appearing on our plot twice: the outliers are plotted as black points from `geom_boxplot()`, but they are also plotted with `geom_jitter()`. Since we don't want to represent these data multiple times in the same form (points), we can stop `geom_boxplot()` from plotting them. We do this by setting the `outlier.shape` argument to `NA`, which means the outliers don't have a shape to be plotted.
+You may have noticed that some of our data points are now appearing on our plot twice: the outliers are plotted as black points from `geom_boxplot()`, but they are also plotted with `geom_jitter()`.
+Since we don't want to represent these data multiple times in the same form (points), we can stop `geom_boxplot()` from plotting them.
+We do this by setting the `outlier.shape` argument to `NA`, which means the outliers don't have a shape to be plotted.
-```{r boxplot-outliers, warning = F}
+```{r boxplot-outliers, warning=F}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.2)
@@ -333,7 +395,8 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length, co
geom_jitter(alpha = 0.2)
```
-Notice that both the color of the points and the color of the boxplot lines changed. Any time we specify an `aes()` mapping inside our initial `ggplot()` function, that mapping will apply to all our `geom`s.
+Notice that both the color of the points and the color of the boxplot lines changed.
+Any time we specify an `aes()` mapping inside our initial `ggplot()` function, that mapping will apply to all our `geom`s.
If we want to limit the mapping to a single `geom`, we can put the mapping into the specific `geom_` function, like this:
@@ -343,7 +406,10 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
geom_jitter(aes(color = plot_type), alpha = 0.2)
```
-Now our points are colored according to `plot_type`, but the boxplots are all the same color. One thing you might notice is that even with `alpha = 0.2`, the points obscure parts of the boxplot. This is because the `geom_point()` layer comes after the `geom_boxplot()` layer, which means the points are plotted on top of the boxes. To put the boxplots on top, we switch the order of the layers:
+Now our points are colored according to `plot_type`, but the boxplots are all the same color.
+One thing you might notice is that even with `alpha = 0.2`, the points obscure parts of the boxplot.
+This is because the `geom_point()` layer comes after the `geom_boxplot()` layer, which means the points are plotted on top of the boxes.
+To put the boxplots on top, we switch the order of the layers:
```{r reverse-layers, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -351,7 +417,10 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
geom_boxplot(outlier.shape = NA)
```
-Now we have the opposite problem! The white `fill` of the boxplots completely obscures some of the points. To address this problem, we can remove the `fill` from the boxplots altogether, leaving only the black lines. To do this, we set `fill` to `NA`:
+Now we have the opposite problem!
+The white `fill` of the boxplots completely obscures some of the points.
+To address this problem, we can remove the `fill` from the boxplots altogether, leaving only the black lines.
+To do this, we set `fill` to `NA`:
```{r fill-na, warning=FALSE}
ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -361,15 +430,18 @@ ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
Now we can see all the raw data and our boxplots on top.
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 2: Change `geom`s
-Violin plots are similar to boxplots- try making one using `plot_type` and `hindfoot_length` as the x and y variables. Remember that all geom functions start with `geom_`, followed by the type of geom.
+Violin plots are similar to boxplots- try making one using `plot_type` and `hindfoot_length` as the x and y variables.
+Remember that all geom functions start with `geom_`, followed by the type of geom.
-This might also be a place to test your search engine skills. It is often useful to search for `R package_name stuff you want to search`. So for this example we might search for `R ggplot2 violin plot`.
+This might also be a place to test your search engine skills.
+It is often useful to search for `R package_name stuff you want to search`.
+So for this example we might search for `R ggplot2 violin plot`.
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r violin-challenge-answer, warning=FALSE}
ggplot(data = complete_old,
@@ -382,9 +454,10 @@ ggplot(data = complete_old,
::::::::::::::::::::::::
-For an *extra challenge*, , make the color of the points and outlines of the violins vary by `plot_type`, and set the fill of the violins to white. Try playing with the order of the layers to see what looks best.
+For an *extra challenge*, , make the color of the points and outlines of the violins vary by `plot_type`, and set the fill of the violins to white.
+Try playing with the order of the layers to see what looks best.
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r violin-challenge-answer-2, warning=FALSE}
ggplot(data = complete_old,
@@ -400,11 +473,14 @@ ggplot(data = complete_old,
## Changing themes
-So far we've been changing the appearance of parts of our plot related to our data and the `geom_` functions, but we can also change many of the non-data components of our plot.
+So far we've been changing the appearance of parts of our plot related to our data and the `geom_` functions, but we can also change many of the non-data components of our plot.
-At this point, we are pretty happy with the basic layout of our plot, so we can **assign** it to a plot to a named **object**. We do this using the **assignment arrow** `<-`. What we are doing here is taking the result of the code on the right side of the arrow, and assigning it to an object whose name is on the left side of the arrow.
+At this point, we are pretty happy with the basic layout of our plot, so we can **assign** it to a plot to a named **object**.
+We do this using the **assignment arrow** `<-`.
+What we are doing here is taking the result of the code on the right side of the arrow, and assigning it to an object whose name is on the left side of the arrow.
-We will create an object called `myplot`. If you run the name of the `ggplot2` object, it will show the plot, just like if you ran the code itself.
+We will create an object called `myplot`.
+If you run the name of the `ggplot2` object, it will show the plot, just like if you ran the code itself.
```{r}
myplot <- ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_length)) +
@@ -414,15 +490,22 @@ myplot <- ggplot(data = complete_old, mapping = aes(x = plot_type, y = hindfoot_
myplot
```
-This process of assigning something to an **object** is not specific to `ggplot2`, but rather a general feature of R. We will be using it a lot in the rest of this lesson. We can now work with the `myplot` object as if it was a block of `ggplot2` code, which means we can use `+` to add new components to it.
+This process of assigning something to an **object** is not specific to `ggplot2`, but rather a general feature of R.
+We will be using it a lot in the rest of this lesson.
+We can now work with the `myplot` object as if it was a block of `ggplot2` code, which means we can use `+` to add new components to it.
-We can change the overall appearance using `theme_` functions. Let's try a black-and-white theme by adding `theme_bw()` to our plot:
+We can change the overall appearance using `theme_` functions.
+Let's try a black-and-white theme by adding `theme_bw()` to our plot:
```{r theme-bw, warning=FALSE}
myplot + theme_bw()
```
-As you can see, a number of parts of the plot have changed. `theme_` functions usually control many aspects of a plot's appearance all at once, for the sake of convenience. To individually change parts of a plot, we can use the `theme()` function, which can take many different arguments to change things about the text, grid lines, background color, and more. Let's try changing the size of the text on our axis titles. We can do this by specifying that the `axis.title` should be an `element_text()` with `size` set to 14.
+As you can see, a number of parts of the plot have changed.
+`theme_` functions usually control many aspects of a plot's appearance all at once, for the sake of convenience.
+To individually change parts of a plot, we can use the `theme()` function, which can take many different arguments to change things about the text, grid lines, background color, and more.
+Let's try changing the size of the text on our axis titles.
+We can do this by specifying that the `axis.title` should be an `element_text()` with `size` set to 14.
```{r text-size, warning=FALSE}
myplot +
@@ -430,7 +513,9 @@ myplot +
theme(axis.title = element_text(size = 14))
```
-Another change we might want to make is to remove the vertical grid lines. Since our x axis is categorical, those grid lines aren't useful. To do this, inside `theme()`, we will change the `panel.grid.major.x` to an `element_blank()`.
+Another change we might want to make is to remove the vertical grid lines.
+Since our x axis is categorical, those grid lines aren't useful.
+To do this, inside `theme()`, we will change the `panel.grid.major.x` to an `element_blank()`.
```{r element-blank, warning=FALSE}
myplot +
@@ -439,7 +524,8 @@ myplot +
panel.grid.major.x = element_blank())
```
-Another useful change might be to remove the color legend, since that information is already on our x axis. For this one, we will set `legend.position` to "none".
+Another useful change might be to remove the color legend, since that information is already on our x axis.
+For this one, we will set `legend.position` to "none".
```{r legend-remove, warning=FALSE}
myplot +
@@ -452,28 +538,31 @@ myplot +
::::::::::::::::::::::::::::: callout
-Because there are so many possible arguments to the `theme()` function, it can sometimes be hard to find the right one. Here are some tips for figuring out how to modify a plot element:
+Because there are so many possible arguments to the `theme()` function, it can sometimes be hard to find the right one.
+Here are some tips for figuring out how to modify a plot element:
- type out `theme()`, put your cursor between the parentheses, and hit Tab to bring up a list of arguments
- - you can scroll through the arguments, or start typing, which will shorten the list of potential matches
+ - you can scroll through the arguments, or start typing, which will shorten the list of potential matches
- like many things in the `tidyverse`, similar argument start with similar names
- - there are `axis`, `legend`, `panel`, `plot`, and `strip` arguments
+ - there are `axis`, `legend`, `panel`, `plot`, and `strip` arguments
- arguments have hierarchy
- - `text` controls all text in the whole plot
- - `axis.title` controls the text for the axis titles
- - `axis.title.x` controls the text for the x axis title
+ - `text` controls all text in the whole plot
+ - `axis.title` controls the text for the axis titles
+ - `axis.title.x` controls the text for the x axis title
:::::::::::::::::::::::::::::
::::::::::::::::::::::::::::: callout
-You may have noticed that we have used 3 different approaches to getting rid of something in `ggplot`:
+You may have noticed that we have used 3 different approaches to getting rid of something in `ggplot`:
- `outlier.shape = NA` to remove the outliers from our boxplot
- `panel.grid.major.x = element_blank()` to remove the x grid lines
- `legend.position = "none"` to remove our legend
-Why are there so many ways to do what seems like the same thing?? This is a common frustration when working with R, or with any programming language. There are a couple reasons for it:
+Why are there so many ways to do what seems like the same thing??
+This is a common frustration when working with R, or with any programming language.
+There are a couple reasons for it:
1. Different people contribute to different packages and functions, and they may choose to do things differently.
2. Code may *appear* to be doing the same thing, when the details are actually quite different. The inner workings of `ggplot2` are actually quite complex, since it turns out making plots is a very complicated process! Because of this, things that seem the same (removing parts of a plot), may actually be operating on very different components or stages of the final plot.
@@ -483,7 +572,9 @@ Why are there so many ways to do what seems like the same thing?? This is a comm
## Changing labels
-Our plot is really shaping up now. However, we probably want to make our axis titles nicer, and perhaps add a main title to the plot. We can do this using the `labs()` function:
+Our plot is really shaping up now.
+However, we probably want to make our axis titles nicer, and perhaps add a main title to the plot.
+We can do this using the `labs()` function:
```{r labels, warning=FALSE}
myplot +
@@ -495,17 +586,19 @@ myplot +
y = "Hindfoot length (mm)")
```
-We removed our legend from this plot, but you can also change the titles of various legends using `labs()`. For example, `labs(color = "Plot type")` would change the title of a color scale legend to "Plot type".
+We removed our legend from this plot, but you can also change the titles of various legends using `labs()`.
+For example, `labs(color = "Plot type")` would change the title of a color scale legend to "Plot type".
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 3: Customizing a plot
-Modify the previous plot by adding a descriptive subtitle. Increase the font size of the plot title and make it bold.
+Modify the previous plot by adding a descriptive subtitle.
+Increase the font size of the plot title and make it bold.
**Hint**: "bold" is referred to as a font "face"
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r customizing-challenge-answer, warning=FALSE}
myplot +
@@ -523,11 +616,13 @@ myplot +
## Faceting
-One of the most powerful features of **`ggplot`** is the ability to quickly split a plot into multiple smaller plots based on a categorical variable, which is called **faceting**.
+One of the most powerful features of **`ggplot`** is the ability to quickly split a plot into multiple smaller plots based on a categorical variable, which is called **faceting**.
-So far we've mapped variables to the x axis, the y axis, and color, but trying to add a 4th variable becomes difficult. Changing the shape of a point might work, but only for very few categories, and even then, it can be hard to tell the differences between the shapes of small points.
+So far we've mapped variables to the x axis, the y axis, and color, but trying to add a 4th variable becomes difficult.
+Changing the shape of a point might work, but only for very few categories, and even then, it can be hard to tell the differences between the shapes of small points.
-Instead of cramming one more variable into a single plot, we will use the `facet_wrap()` function to generate a series of smaller plots, split out by `sex`. We also use `ncol` to specify that we want them arranged in a single column:
+Instead of cramming one more variable into a single plot, we will use the `facet_wrap()` function to generate a series of smaller plots, split out by `sex`.
+We also use `ncol` to specify that we want them arranged in a single column:
```{r facet-wrap, warning=FALSE}
myplot +
@@ -544,7 +639,8 @@ myplot +
::::::::::::::::::::::::::::: callout
-Faceting comes in handy in many scenarios. It can be useful when:
+Faceting comes in handy in many scenarios.
+It can be useful when:
- a categorical variable has too many levels to differentiate by color (such as a dataset with 20 countries)
- your data overlap heavily, obscuring categories
@@ -570,29 +666,35 @@ finalplot <- myplot +
facet_wrap(vars(sex), ncol = 1)
```
-After this, we can run `ggsave()` to save our plot. The first argument we give is the path to the file we want to save, including the correct file extension. This code will make an image called `rodent_size_plots.jpg` in the `images/` folder of our current project. We are making a `.jpg`, but you can save `.pdf`, `.tiff`, and other file formats. Next, we tell it the name of the plot object we want to save. We can also specify things like the width and height of the plot in inches.
+After this, we can run `ggsave()` to save our plot.
+The first argument we give is the path to the file we want to save, including the correct file extension.
+This code will make an image called `rodent_size_plots.jpg` in the `images/` folder of our current project.
+We are making a `.jpg`, but you can save `.pdf`, `.tiff`, and other file formats.
+Next, we tell it the name of the plot object we want to save.
+We can also specify things like the width and height of the plot in inches.
```{r save-plot, eval=FALSE}
ggsave(filename = "images/rodent_size_plots.jpg", plot = finalplot,
height = 6, width = 8)
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 4: Make your own plot
-Try making your own plot! You can run `str(complete_old)` or `?complete_old` to explore variables you might use in your new plot. Feel free to use variables we have already seen, or some we haven't explored yet.
+Try making your own plot!
+You can run `str(complete_old)` or `?complete_old` to explore variables you might use in your new plot.
+Feel free to use variables we have already seen, or some we haven't explored yet.
Here are a couple ideas to get you started:
- - make a histogram of one of the numeric variables
- - try using a different color `scale_`
- - try changing the size of points or thickness of lines in a `geom`
+- make a histogram of one of the numeric variables
+- try using a different color `scale_`
+- try changing the size of points or thickness of lines in a `geom`
::::::::::::::::::::::::::::::::::::::::::::::::
-
-::::::::::::::::::::::::::::::::::::: keypoints
+::::::::::::::::::::::::::::::::::::: keypoints
- the `ggplot()` function initiates a plot, and `geom_` functions add representations of your data
- use `aes()` when mapping a variable from the data to a part of the plot
diff --git a/episodes/working-with-data.Rmd b/episodes/working-with-data.Rmd
index 853feacde..517e47973 100644
--- a/episodes/working-with-data.Rmd
+++ b/episodes/working-with-data.Rmd
@@ -23,9 +23,7 @@ exercises: 4
-
-
@@ -40,8 +38,8 @@ exercises: 4
-
-:::::::::::::::::::::::::::::::::::::: questions
+
+:::::::::::::::::::::::::::::::::::::: questions
- How do you manipulate tabular data in R?
@@ -59,7 +57,6 @@ exercises: 4
- Export data to a CSV file.
::::::::::::::::::::::::::::::::::::::::::::::::
-
```{r setup, include=FALSE}
knitr::opts_chunk$set(dpi = 200, out.height = 600, out.width = 600, R.options = list(max.print = 100))
@@ -71,19 +68,29 @@ library(tidyverse)
## Importing data
-Up until this point, we have been working with the `complete_old` dataframe contained in the `ratdat` package. However, you typically won't access data from an R package; it is much more common to access data files stored somewhere on your computer. We are going to download a CSV file containing the surveys data to our computer, which we will then read into R.
+Up until this point, we have been working with the `complete_old` dataframe contained in the `ratdat` package.
+However, you typically won't access data from an R package; it is much more common to access data files stored somewhere on your computer.
+We are going to download a CSV file containing the surveys data to our computer, which we will then read into R.
Click this link to download the file: .
-You will be prompted to save the file on your computer somewhere. Save it inside the `cleaned` data folder, which is in the `data` folder in your `R-Ecology-Workshop` folder. Once it's inside our project, we will be able to point R towards it.
+You will be prompted to save the file on your computer somewhere.
+Save it inside the `cleaned` data folder, which is in the `data` folder in your `R-Ecology-Workshop` folder.
+Once it's inside our project, we will be able to point R towards it.
#### File paths
-When we reference other files from an R script, we need to give R precise instructions on where those files are. We do that using something called a **file path**. It looks something like this: `"Documents/Manuscripts/Chapter_2.txt"`. This path would tell your computer how to get from whatever folder contains the `Documents` folder all the way to the `.txt` file.
+When we reference other files from an R script, we need to give R precise instructions on where those files are.
+We do that using something called a **file path**.
+It looks something like this: `"Documents/Manuscripts/Chapter_2.txt"`.
+This path would tell your computer how to get from whatever folder contains the `Documents` folder all the way to the `.txt` file.
-There are two kinds of paths: **absolute** and **relative**. Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder. Because we are keeping all of our work in the `R-Ecology-Workshop` folder, all of our paths can be relative to this folder.
+There are two kinds of paths: **absolute** and **relative**.
+Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder.
+Because we are keeping all of our work in the `R-Ecology-Workshop` folder, all of our paths can be relative to this folder.
-Now, let's read our CSV file into R and store it in an object named `surveys`. We will use the `read_csv` function from the `tidyverse`'s `readr` package, and the argument we give will be the **relative path** to the CSV file.
+Now, let's read our CSV file into R and store it in an object named `surveys`.
+We will use the `read_csv` function from the `tidyverse`'s `readr` package, and the argument we give will be the **relative path** to the CSV file.
```{r read-csv}
surveys <- read_csv("data/cleaned/surveys_complete_77_89.csv")
@@ -91,48 +98,71 @@ surveys <- read_csv("data/cleaned/surveys_complete_77_89.csv")
:::::::::::::::::::::::::::::::::::::::::: callout
-Typing out paths can be error prone, so we can utilize a keyboard shortcut. Inside the parentheses of `read_csv()`, type out a pair of quotes and put your cursor between them. Then hit Tab. A small menu showing your folders and files should show up. You can use the ↑ and ↓ keys to move through the options, or start typing to narrow them down. You can hit Enter to select a file or folder, and hit Tab again to continue building the file path. This might take a bit of getting used to, but once you get the hang of it, it will speed up writing file paths and reduce the number of mistakes you make.
+Typing out paths can be error prone, so we can utilize a keyboard shortcut.
+Inside the parentheses of `read_csv()`, type out a pair of quotes and put your cursor between them.
+Then hit Tab.
+A small menu showing your folders and files should show up.
+You can use the ↑ and ↓ keys to move through the options, or start typing to narrow them down.
+You can hit Enter to select a file or folder, and hit Tab again to continue building the file path.
+This might take a bit of getting used to, but once you get the hang of it, it will speed up writing file paths and reduce the number of mistakes you make.
::::::::::::::::::::::::::::::::::::::::::
-You may have noticed a bit of feedback from R when you ran the last line of code. We got some useful information about the CSV file we read in. We can see:
+You may have noticed a bit of feedback from R when you ran the last line of code.
+We got some useful information about the CSV file we read in.
+We can see:
- the number of rows and columns
- the **delimiter** of the file, which is how values are separated, a comma `","`
- a set of columns that were **parsed** as various vector types
- the file has `r surveys %>% select(where(is.character)) %>% ncol()` character columns and `r surveys %>% select(where(is.numeric)) %>% ncol()` numeric columns
- we can see the names of the columns for each type
-
-
+
When working with the output of a new function, it's often a good idea to check the `class()`:
```{r class-tibble}
class(surveys)
```
-Whoa! What is this thing? It has multiple classes? Well, it's called a `tibble`, and it is the `tidyverse` version of a data.frame. It *is* a data.frame, but with some added perks. It prints out a little more nicely, it highlights `NA` values and negative values in red, and it will generally communicate with you more (in terms of warnings and errors, which is a good thing).
+Whoa!
+What is this thing?
+It has multiple classes?
+Well, it's called a `tibble`, and it is the `tidyverse` version of a data.frame.
+It *is* a data.frame, but with some added perks.
+It prints out a little more nicely, it highlights `NA` values and negative values in red, and it will generally communicate with you more (in terms of warnings and errors, which is a good thing).
:::::::::::::::::::::::::::::::::::::::::: callout
**`tidyverse` vs. base R**
-As we begin to delve more deeply into the `tidyverse`, we should briefly pause to mention some of the reasons for focusing on the `tidyverse` set of tools. In R, there are often many ways to get a job done, and there are other approaches that can accomplish tasks similar to the `tidyverse`.
+As we begin to delve more deeply into the `tidyverse`, we should briefly pause to mention some of the reasons for focusing on the `tidyverse` set of tools.
+In R, there are often many ways to get a job done, and there are other approaches that can accomplish tasks similar to the `tidyverse`.
-The phrase **base R** is used to refer to approaches that utilize functions contained in R's default packages. We have already used some base R functions, such as `str()`, `head()`, and `mean()`, and we will be using more scattered throughout this lesson. However, there are some key base R approaches we will not be teaching. These include square bracket subsetting and base plotting. You may come across code written by other people that looks like `surveys[1:10, 2]` or `plot(surveys$weight, surveys$hindfoot_length)`, which are base R commands. If you're interested in learning more about these approaches, you can check out other Carpentries lessons like the [Software Carpentry Programming with R](https://swcarpentry.github.io/r-novice-inflammation/) lesson.
+The phrase **base R** is used to refer to approaches that utilize functions contained in R's default packages.
+We have already used some base R functions, such as `str()`, `head()`, and `mean()`, and we will be using more scattered throughout this lesson.
+However, there are some key base R approaches we will not be teaching.
+These include square bracket subsetting and base plotting.
+You may come across code written by other people that looks like `surveys[1:10, 2]` or `plot(surveys$weight, surveys$hindfoot_length)`, which are base R commands.
+If you're interested in learning more about these approaches, you can check out other Carpentries lessons like the [Software Carpentry Programming with R](https://swcarpentry.github.io/r-novice-inflammation/) lesson.
-We choose to teach the `tidyverse` set of packages because they share a similar syntax and philosophy, making them consistent and producing highly readable code. They are also very flexible and powerful, with a growing number of packages designed according to similar principles and to work well with the rest of the packages. The `tidyverse` packages tend to have very clear documentation and wide array of learning materials that tend to be written with novice users in mind. Finally, the `tidyverse` has only continued to grow, and has strong support from RStudio, which implies that these approaches will be relevant into the future.
+We choose to teach the `tidyverse` set of packages because they share a similar syntax and philosophy, making them consistent and producing highly readable code.
+They are also very flexible and powerful, with a growing number of packages designed according to similar principles and to work well with the rest of the packages.
+The `tidyverse` packages tend to have very clear documentation and wide array of learning materials that tend to be written with novice users in mind.
+Finally, the `tidyverse` has only continued to grow, and has strong support from RStudio, which implies that these approaches will be relevant into the future.
::::::::::::::::::::::::::::::::::::::::::
## Manipulating data
-One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The `dplyr` and `tidyr` packages in the `tidyverse` provide a series of powerful functions for many common data manipulation tasks.
+One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data.
+The `dplyr` and `tidyr` packages in the `tidyverse` provide a series of powerful functions for many common data manipulation tasks.
We'll start off with two of the most commonly used `dplyr` functions: `select()`, which selects certain columns of a data.frame, and `filter()`, which filters out rows according to certain criteria.
:::::::::::::::::::::::::::::::::::::::::: callout
-Between `select()` and `filter()`, it can be hard to remember which operates on columns and which operates on rows. `sele`**`c`**`t()` has a **c** for **c**olumns and `filte`**`r`**`()` has an **r** for **r**ows.
+Between `select()` and `filter()`, it can be hard to remember which operates on columns and which operates on rows.
+`sele`**`c`**`t()` has a **c** for **c**olumns and `filte`**`r`**`()` has an **r** for **r**ows.
::::::::::::::::::::::::::::::::::::::::::
@@ -152,15 +182,18 @@ To select all columns except specific columns, put a `-` in front of the column
select(surveys, -record_id, -year)
```
-`select()` also works with numeric vectors for the order of the columns. To select the 3rd, 4th, 5th, and 10th columns, we could run the following code:
+`select()` also works with numeric vectors for the order of the columns.
+To select the 3rd, 4th, 5th, and 10th columns, we could run the following code:
```{r select-vector}
select(surveys, c(3:5, 10))
```
-You should be careful when using this method, since you are being less explicit about which columns you want. However, it can be useful if you have a data.frame with many columns and you don't want to type out too many names.
+You should be careful when using this method, since you are being less explicit about which columns you want.
+However, it can be useful if you have a data.frame with many columns and you don't want to type out too many names.
-Finally, you can select columns based on whether they match a certain criteria by using the `where()` function. If we want all numeric columns, we can ask to `select` all the columns `where` the class `is numeric`:
+Finally, you can select columns based on whether they match a certain criteria by using the `where()` function.
+If we want all numeric columns, we can ask to `select` all the columns `where` the class `is numeric`:
```{r select-where}
select(surveys, where(is.numeric))
@@ -176,25 +209,30 @@ select(surveys, where(anyNA))
#### `filter()`
-The `filter()` function is used to select rows that meet certain criteria. To get all the rows where the value of `year` is equal to 1985, we would run the following:
+The `filter()` function is used to select rows that meet certain criteria.
+To get all the rows where the value of `year` is equal to 1985, we would run the following:
```{r filter}
filter(surveys, year == 1985)
```
-The `==` sign means "is equal to". There are several other operators we can use: >, >=, <, <=, and != (not equal to). Another useful operator is `%in%`, which asks if the value on the lefthand side is found anywhere in the vector on the righthand side. For example, to get rows with specific `species_id` values, we could run:
+The `==` sign means "is equal to".
+There are several other operators we can use: >, >=, \<, \<=, and !\= (not equal to).
+Another useful operator is `%in%`, which asks if the value on the lefthand side is found anywhere in the vector on the righthand side.
+For example, to get rows with specific `species_id` values, we could run:
```{r filter-in}
filter(surveys, species_id %in% c("RM", "DO"))
```
-We can also use multiple conditions in one `filter()` statement. Here we will get rows with a year less than or equal to 1988 and whose hindfoot length values are not `NA`. The `!` before the `is.na()` function means "not".
+We can also use multiple conditions in one `filter()` statement.
+Here we will get rows with a year less than or equal to 1988 and whose hindfoot length values are not `NA`.
+The `!` before the `is.na()` function means "not".
```{r filter-multiple}
filter(surveys, year <= 1988 & !is.na(hindfoot_length))
```
-
::::::::::::::::::::::::::::::::::::: challenge
## Challenge 1: Filtering and selecting
@@ -223,15 +261,20 @@ surveys_selected <- select(surveys, year, month, species_id, plot_id)
## The pipe: `%>%`
-What happens if we want to both `select()` and `filter()` our data? We have a couple options. First, we could use **nested** functions:
+What happens if we want to both `select()` and `filter()` our data?
+We have a couple options.
+First, we could use **nested** functions:
```{r filter-select-nested}
filter(select(surveys, -day), month >= 7)
```
-R will evaluate statements from the inside out. First, `select()` will operate on the `surveys` data.frame, removing the column `day`. The resulting data.frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
+R will evaluate statements from the inside out.
+First, `select()` will operate on the `surveys` data.frame, removing the column `day`.
+The resulting data.frame is then used as the first argument for `filter()`, which selects rows with a month greater than or equal to 7.
-Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once. An alternative approach is to create **intermediate** objects:
+Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once.
+An alternative approach is to create **intermediate** objects:
```{r filter-select-intermediate}
surveys_noday <- select(surveys, -day)
@@ -240,7 +283,9 @@ filter(surveys_noday, month >= 7)
This approach is easier to read, since we can see the steps in order, but after enough steps, we are left with a cluttered mess of intermediate objects, often with confusing names.
-An elegant solution to this problem is an operator called the **pipe**, which looks like `%>%`. You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows). Here's how you could use a pipe to select and filter in one step:
+An elegant solution to this problem is an operator called the **pipe**, which looks like `%>%`.
+You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows).
+Here's how you could use a pipe to select and filter in one step:
```{r filter-select-pipe}
surveys %>%
@@ -248,13 +293,21 @@ surveys %>%
filter(month >= 7)
```
-What it does is take the thing on the lefthand side and insert it as the first argument of the function on the righthand side. By putting each of our functions onto a new line, we can build a nice, readable **pipeline**. It can be useful to think of this as a little assembly line for our data. It starts at the top and gets piped into a `select()` function, and it comes out modified somewhat. It then gets sent into the `filter()` function, where it is further modified, and then the final product gets printed out to our console. It can also be helpful to think of `%>%` as meaning "and then". Since many `tidyverse` functions have verbs for names, a pipeline can be read like a sentence.
-
+What it does is take the thing on the lefthand side and insert it as the first argument of the function on the righthand side.
+By putting each of our functions onto a new line, we can build a nice, readable **pipeline**.
+It can be useful to think of this as a little assembly line for our data.
+It starts at the top and gets piped into a `select()` function, and it comes out modified somewhat.
+It then gets sent into the `filter()` function, where it is further modified, and then the final product gets printed out to our console.
+It can also be helpful to think of `%>%` as meaning "and then".
+Since many `tidyverse` functions have verbs for names, a pipeline can be read like a sentence.
:::::::::::::::::::::::::::::::::::::::::::: instructor
-It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing. If your cursor is on any line of a pipeline, running that line will run the whole thing.
+
+It's worth showing the learners that you can run a **pipeline** without highlighting the whole thing.
+If your cursor is on any line of a pipeline, running that line will run the whole thing.
You can also show that by highlighting a section of a pipeline, you can run only the first X steps of it.
+
::::::::::::::::::::::::::::::::::::::::::::
If we want to store this final product as an object, we use an assignment arrow at the start:
@@ -265,13 +318,17 @@ surveys_sub <- surveys %>%
filter(month >= 7)
```
-A good approach is to build a pipeline step by step prior to assignment. You add functions to the pipeline as you go, with the results printing in the console for you to view. Once you're satisfied with your final result, go back and add the assignment arrow statement at the start. This approach is very interactive, allowing you to see the results of each step as you build the pipeline, and produces nicely readable code.
+A good approach is to build a pipeline step by step prior to assignment.
+You add functions to the pipeline as you go, with the results printing in the console for you to view.
+Once you're satisfied with your final result, go back and add the assignment arrow statement at the start.
+This approach is very interactive, allowing you to see the results of each step as you build the pipeline, and produces nicely readable code.
::::::::::::::::::::::::::::::::::::: challenge
## Challenge 2: Using pipes
-Use the surveys data to make a data.frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988. Use a pipe between the function calls.
+Use the surveys data to make a data.frame that has the columns `record_id`, `month`, and `species_id`, with data from the year 1988.
+Use a pipe between the function calls.
:::::::::::::::::::::::: solution
@@ -281,22 +338,25 @@ surveys_1988 <- surveys %>%
select(record_id, month, species_id)
```
-Make sure to `filter()` before you `select()`. You need to use the `year` column for filtering rows, but it is discarded in the `select()` step. You also need to make sure to use `==` instead of `=` when you are filtering rows where `year` is equal to 1988.
+Make sure to `filter()` before you `select()`.
+You need to use the `year` column for filtering rows, but it is discarded in the `select()` step.
+You also need to make sure to use `==` instead of `=` when you are filtering rows where `year` is equal to 1988.
::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::
-
## Making new columns with `mutate()`
-Another common task is creating a new column based on values in existing columns. For example, we could add a new column that has the weight in kilograms instead of grams:
+Another common task is creating a new column based on values in existing columns.
+For example, we could add a new column that has the weight in kilograms instead of grams:
```{r mutate}
surveys %>%
mutate(weight_kg = weight / 1000)
```
-You can create multiple columns in one `mutate()` call, and they will get created in the order you write them. This means you can even reference the first new column in the second new column:
+You can create multiple columns in one `mutate()` call, and they will get created in the order you write them.
+This means you can even reference the first new column in the second new column:
```{r mutate-multiple}
surveys %>%
@@ -306,7 +366,10 @@ surveys %>%
-We can also use multiple columns to create a single column. For example, it's often good practice to keep the components of a date in separate columns until necessary, as we've done here. This is because programs like Excel can do automatic things with dates in a way that is not reproducible and sometimes hard to notice. However, now that we are working in R, we can safely put together a date column.
+We can also use multiple columns to create a single column.
+For example, it's often good practice to keep the components of a date in separate columns until necessary, as we've done here.
+This is because programs like Excel can do automatic things with dates in a way that is not reproducible and sometimes hard to notice.
+However, now that we are working in R, we can safely put together a date column.
To put together the columns into something that looks like a date, we can use the `paste()` function, which takes arguments of the items to paste together, as well as the argument `sep`, which is the character used to separate the items.
@@ -315,7 +378,8 @@ surveys %>%
mutate(date = paste(year, month, day, sep = "-"))
```
-Since our new column gets moved all the way to the end, it doesn't end up printing out. We can use the `relocate()` function to put it after our `year` column:
+Since our new column gets moved all the way to the end, it doesn't end up printing out.
+We can use the `relocate()` function to put it after our `year` column:
```{r relocate}
surveys %>%
@@ -323,7 +387,12 @@ surveys %>%
relocate(date, .after = year)
```
-Now we can see that we have a character column that contains our date string. However, it's not truly a date column. Dates are a type of numeric variable with a defined, ordered scale. To turn this column into a proper date, we will use a function from the `tidyverse`'s `lubridate` package, which has lots of useful functions for working with dates. The function `ymd()` will parse a date string that has the order year-month-day. Let's load the package and use `ymd()`.
+Now we can see that we have a character column that contains our date string.
+However, it's not truly a date column.
+Dates are a type of numeric variable with a defined, ordered scale.
+To turn this column into a proper date, we will use a function from the `tidyverse`'s `lubridate` package, which has lots of useful functions for working with dates.
+The function `ymd()` will parse a date string that has the order year-month-day.
+Let's load the package and use `ymd()`.
```{r lubridate}
library(lubridate)
@@ -341,7 +410,8 @@ surveys %>%
-Now we can see that our `date` column has the type `date` as well. In this example, we created our column with two separate lines in `mutate()`, but we can combine them into one:
+Now we can see that our `date` column has the type `date` as well.
+In this example, we created our column with two separate lines in `mutate()`, but we can combine them into one:
```{r making-date}
# using nested functions
@@ -356,13 +426,14 @@ surveys %>%
relocate(date, .after = year)
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 3: Plotting date
-Because the `ggplot()` function takes the data as its first argument, you can actually pipe data straight into `ggplot()`. Try building a pipeline that creates the date column and plots weight across date.
+Because the `ggplot()` function takes the data as its first argument, you can actually pipe data straight into `ggplot()`.
+Try building a pipeline that creates the date column and plots weight across date.
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r date-plot-challenge-answer}
surveys %>%
@@ -379,9 +450,13 @@ This isn't necessarily the most useful plot, but we will learn some techniques t
## The split-apply-combine approach
-Many data analysis tasks can be achieved using the split-apply-combine approach: you split the data into groups, apply some analysis to each group, and combine the results in some way. `dplyr` has a few convenient functions to enable this approach, the main two being `group_by()` and `summarize()`.
+Many data analysis tasks can be achieved using the split-apply-combine approach: you split the data into groups, apply some analysis to each group, and combine the results in some way.
+`dplyr` has a few convenient functions to enable this approach, the main two being `group_by()` and `summarize()`.
-`group_by()` takes a data.frame and the name of one or more columns with categorical values that define the groups. `summarize()` then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group. The syntax for `summarize()` is similar to `mutate()`, where you define new columns based on values of other columns. Let's try calculating the mean weight of all our animals by sex.
+`group_by()` takes a data.frame and the name of one or more columns with categorical values that define the groups.
+`summarize()` then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group.
+The syntax for `summarize()` is similar to `mutate()`, where you define new columns based on values of other columns.
+Let's try calculating the mean weight of all our animals by sex.
```{r group-by-summarize}
surveys %>%
@@ -389,7 +464,10 @@ surveys %>%
summarize(mean_weight = mean(weight, na.rm = T))
```
-You can see that the mean weight for males is slightly higher than for females, but that animals whose sex is unknown have much higher weights. This is probably due to small sample size, but we should check to be sure. Like `mutate()`, we can define multiple columns in one `summarize()` call. The function `n()` will count the number of rows in each group.
+You can see that the mean weight for males is slightly higher than for females, but that animals whose sex is unknown have much higher weights.
+This is probably due to small sample size, but we should check to be sure.
+Like `mutate()`, we can define multiple columns in one `summarize()` call.
+The function `n()` will count the number of rows in each group.
```{r summarize-multiple}
surveys %>%
@@ -398,7 +476,9 @@ surveys %>%
n = n())
```
-You will often want to create groups based on multiple columns. For example, we might be interested in the mean weight of every species + sex combination. All we have to do is add another column to our `group_by()` call.
+You will often want to create groups based on multiple columns.
+For example, we might be interested in the mean weight of every species + sex combination.
+All we have to do is add another column to our `group_by()` call.
```{r group-by-multiple}
surveys %>%
@@ -407,7 +487,14 @@ surveys %>%
n = n())
```
-Our resulting data.frame is much larger, since we have a greater number of groups. We also see a strange value showing up in our `mean_weight` column: `NaN`. This stands for "Not a Number", and it often results from trying to do an operation a vector with zero entries. How can a vector have zero entries? Well, if a particular group (like the AB species ID + `NA` sex group) has **only** `NA` values for weight, then the `na.rm = T` argument in `mean()` will remove **all** the values prior to calculating the mean. The result will be a value of `NaN`. Since we are not particularly interested in these values, let's add a step to our pipeline to remove rows where weight is `NA` **before** doing any other steps. This means that any groups with only `NA` values will disappear from our data.frame before we formally create the groups with `group_by()`.
+Our resulting data.frame is much larger, since we have a greater number of groups.
+We also see a strange value showing up in our `mean_weight` column: `NaN`.
+This stands for "Not a Number", and it often results from trying to do an operation a vector with zero entries.
+How can a vector have zero entries?
+Well, if a particular group (like the AB species ID + `NA` sex group) has **only** `NA` values for weight, then the `na.rm = T` argument in `mean()` will remove **all** the values prior to calculating the mean.
+The result will be a value of `NaN`.
+Since we are not particularly interested in these values, let's add a step to our pipeline to remove rows where weight is `NA` **before** doing any other steps.
+This means that any groups with only `NA` values will disappear from our data.frame before we formally create the groups with `group_by()`.
```{r filter-group-by}
surveys %>%
@@ -417,7 +504,9 @@ surveys %>%
n = n())
```
-That looks better! It's often useful to take a look at the results in some order, like the lowest mean weight to highest. We can use the `arrange()` function for that:
+That looks better!
+It's often useful to take a look at the results in some order, like the lowest mean weight to highest.
+We can use the `arrange()` function for that:
```{r arrange}
surveys %>%
@@ -439,21 +528,27 @@ surveys %>%
arrange(desc(mean_weight))
```
-You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.` These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level. If you look at the resulting data.frame printed out in your console, you will see these lines:
+You may have seen several messages saying `summarise() has grouped output by 'species_id'. You can override using the .groups argument.`
+These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level.
+If you look at the resulting data.frame printed out in your console, you will see these lines:
```
# A tibble: 46 × 4
# Groups: species_id [18]
```
-They tell us we have a data.frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups. We will see something similar if we use `group_by()` alone:
+They tell us we have a data.frame with 46 rows, 4 columns, and a group variable `species_id`, for which there are 18 groups.
+We will see something similar if we use `group_by()` alone:
```{r group-by-alone}
surveys %>%
group_by(species_id, sex)
```
-What we get back is the entire `surveys` data.frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations. Groups are often maintained throughout a pipeline, and if you assign the resulting data.frame to a new object, it will also have those groups. This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group. Therefore, it is a good habit to remove the groups at the end of a pipeline containing `group_by()`:
+What we get back is the entire `surveys` data.frame, but with the grouping variables added: 67 groups of `species_id` + `sex` combinations.
+Groups are often maintained throughout a pipeline, and if you assign the resulting data.frame to a new object, it will also have those groups.
+This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group.
+Therefore, it is a good habit to remove the groups at the end of a pipeline containing `group_by()`:
```{r ungroup}
surveys %>%
@@ -467,7 +562,9 @@ surveys %>%
Now our data.frame just says `# A tibble: 46 × 4` at the top, with no groups.
-While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame. For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is. For this, we can use `group_by()` and `mutate()` together:
+While it is common that you will want to get the one-row-per-group summary that `summarise()` provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame.
+For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is.
+For this, we can use `group_by()` and `mutate()` together:
```{r group-by-mutate}
surveys %>%
@@ -477,7 +574,9 @@ surveys %>%
weight_diff = weight - mean_weight)
```
-Since we get all our columns back, the new columns are at the very end and don't print out in the console. Let's use `select()` to just look at the columns of interest. Inside `select()` we can use the `contains()` function to get any column containing the word "weight" in the name:
+Since we get all our columns back, the new columns are at the very end and don't print out in the console.
+Let's use `select()` to just look at the columns of interest.
+Inside `select()` we can use the `contains()` function to get any column containing the word "weight" in the name:
```{r select-contains}
surveys %>%
@@ -488,15 +587,17 @@ surveys %>%
select(species_id, sex, contains("weight"))
```
-What happens with the `group_by()` + `mutate()` combination is similar to using `summarize()`: for each group, the mean weight is calculated. However, instead of reporting only one row per group, the mean weight for each group is added to each row in that group. For each row in a group (like DM species ID + M sex), you will see the same value in `mean_weight`.
+What happens with the `group_by()` + `mutate()` combination is similar to using `summarize()`: for each group, the mean weight is calculated.
+However, instead of reporting only one row per group, the mean weight for each group is added to each row in that group.
+For each row in a group (like DM species ID + M sex), you will see the same value in `mean_weight`.
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge 4: Making a time series
1. Use the split-apply-combine approach to make a `data.frame` that counts the total number of animals of each sex caught on each day in the `surveys` data.
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r sex-counts-challenge-answer}
@@ -515,7 +616,7 @@ surveys_daily_counts <- surveys %>%
2. Now use the data.frame you just made to plot the daily number of animals of each sex caught over time. It's up to you what `geom` to use, but a `line` plot might be a good choice. You should also think about how to differentiate which data corresponds to which sex.
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
```{r time-series-challenge-answer}
surveys_daily_counts %>%
@@ -529,7 +630,8 @@ surveys_daily_counts %>%
## Reshaping data with `tidyr`
-Let's say we are interested in comparing the mean weights of each species across our different plots. We can begin this process using the `group_by()` + `summarize()` approach:
+Let's say we are interested in comparing the mean weights of each species across our different plots.
+We can begin this process using the `group_by()` + `summarize()` approach:
```{r mean-weight-by-plot}
sp_by_plot <- surveys %>%
@@ -541,9 +643,15 @@ sp_by_plot <- surveys %>%
sp_by_plot
```
-That looks great, but it is a bit difficult to compare values across plots. It would be nice if we could reshape this data.frame to make those comparisons easier. Well, the `tidyr` package from the `tidyverse` has a pair of functions that allow you to reshape data by pivoting it: `pivot_wider()` and `pivot_longer()`. `pivot_wider()` will make the data wider, which means increasing the number of columns and reducing the number of rows. `pivot_longer()` will do the opposite, reducing the number of columns and increasing the number of rows.
+That looks great, but it is a bit difficult to compare values across plots.
+It would be nice if we could reshape this data.frame to make those comparisons easier.
+Well, the `tidyr` package from the `tidyverse` has a pair of functions that allow you to reshape data by pivoting it: `pivot_wider()` and `pivot_longer()`.
+`pivot_wider()` will make the data wider, which means increasing the number of columns and reducing the number of rows.
+`pivot_longer()` will do the opposite, reducing the number of columns and increasing the number of rows.
-In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species. We will use `pivot_wider()` to reshape our data in this way. It takes 3 arguments:
+In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species.
+We will use `pivot_wider()` to reshape our data in this way.
+It takes 3 arguments:
1. the name of the data.frame
2. `names_from`: which column should be used to generate the names of the new columns?
@@ -553,7 +661,8 @@ Any columns not used for `names_from` or `values_from` will not be pivoted.
{alt='Diagram depicting the behavior of `pivot_wider()` on a small tabular dataset.'}
-In our case, we want the new columns to be named from our `plot_id` column, with the values coming from the `mean_weight` column. We can pipe our data.frame right into `pivot_wider()` and add those two arguments:
+In our case, we want the new columns to be named from our `plot_id` column, with the values coming from the `mean_weight` column.
+We can pipe our data.frame right into `pivot_wider()` and add those two arguments:
```{r pivot-wider}
sp_by_plot_wide <- sp_by_plot %>%
@@ -563,31 +672,44 @@ sp_by_plot_wide <- sp_by_plot %>%
sp_by_plot_wide
```
-Now we've got our reshaped data.frame. There are a few things to notice. First, we have a new column for each `plot_id` value. There is one old column left in the data.frame: `species_id`. It wasn't used in `pivot_wider()`, so it stays, and now contains a single entry for each unique `species_id` value.
+Now we've got our reshaped data.frame.
+There are a few things to notice.
+First, we have a new column for each `plot_id` value.
+There is one old column left in the data.frame: `species_id`.
+It wasn't used in `pivot_wider()`, so it stays, and now contains a single entry for each unique `species_id` value.
-Finally, a lot of `NA`s have appeared. Some species aren't found in every plot, but because a data.frame has to have a value in every row and every column, an `NA` is inserted. We can double-check this to verify what is going on.
+Finally, a lot of `NA`s have appeared.
+Some species aren't found in every plot, but because a data.frame has to have a value in every row and every column, an `NA` is inserted.
+We can double-check this to verify what is going on.
-Looking in our new pivoted data.frame, we can see that there is an `NA` value for the species `BA` in plot `1`. Let's take our `sp_by_plot` data.frame and look for the `mean_weight` of that species + plot combination.
+Looking in our new pivoted data.frame, we can see that there is an `NA` value for the species `BA` in plot `1`.
+Let's take our `sp_by_plot` data.frame and look for the `mean_weight` of that species + plot combination.
```{r pivot-wider-check}
sp_by_plot %>%
filter(species_id == "BA" & plot_id == 1)
```
-We get back 0 rows. There is no `mean_weight` for the species `BA` in plot `1`. This either happened because no `BA` were ever caught in plot `1`, or because every `BA` caught in plot `1` had an `NA` weight value and all the rows got removed when we used `filter(!is.na(weight))` in the process of making `sp_by_plot`. Because there are no rows with that species + plot combination, in our pivoted data.frame, the value gets filled with `NA`.
+We get back 0 rows.
+There is no `mean_weight` for the species `BA` in plot `1`.
+This either happened because no `BA` were ever caught in plot `1`, or because every `BA` caught in plot `1` had an `NA` weight value and all the rows got removed when we used `filter(!is.na(weight))` in the process of making `sp_by_plot`.
+Because there are no rows with that species + plot combination, in our pivoted data.frame, the value gets filled with `NA`.
-There is another `pivot_` function that does the opposite, moving data from a wide to long format, called `pivot_longer()`. It takes 3 arguments: `cols` for the columns you want to pivot, `names_to` for the name of the new column which will contain the old column names, and `values_to` for the name of the new column which will contain the old values.
+There is another `pivot_` function that does the opposite, moving data from a wide to long format, called `pivot_longer()`.
+It takes 3 arguments: `cols` for the columns you want to pivot, `names_to` for the name of the new column which will contain the old column names, and `values_to` for the name of the new column which will contain the old values.
{alt='Diagram depicting the behavior of `pivot_longer()` on a small tabular dataset.'}
-We can pivot our new wide data.frame to a long format using `pivot_longer()`. We want to pivot all the columns except `species_id`, and we will use `PLOT` for the new column of plot IDs, and `MEAN_WT` for the new column of mean weight values.
+We can pivot our new wide data.frame to a long format using `pivot_longer()`.
+We want to pivot all the columns except `species_id`, and we will use `PLOT` for the new column of plot IDs, and `MEAN_WT` for the new column of mean weight values.
```{r pivot-longer}
sp_by_plot_wide %>%
pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")
```
-One thing you will notice is that all those `NA` values that got generated when we pivoted wider. However, we can filter those out, which gets us back to the same data as `sp_by_plot`, before we pivoted it wider.
+One thing you will notice is that all those `NA` values that got generated when we pivoted wider.
+However, we can filter those out, which gets us back to the same data as `sp_by_plot`, before we pivoted it wider.
```{r pivot-longer-filter}
sp_by_plot_wide %>%
@@ -599,9 +721,11 @@ Data are often recorded in spreadsheets in a wider format, but lots of `tidyvers
## Exporting data
-Let's say we want to send the wide version of our `sb_by_plot` data.frame to a colleague who doesn't use R. In this case, we might want to save it as a CSV file.
+Let's say we want to send the wide version of our `sb_by_plot` data.frame to a colleague who doesn't use R.
+In this case, we might want to save it as a CSV file.
-First, we might want to modify the names of the columns, since right now they are bare numbers, which aren't very informative. Luckily, `pivot_wider()` has an argument `names_prefix` which will allow us to add "plot_" to the start of each column.
+First, we might want to modify the names of the columns, since right now they are bare numbers, which aren't very informative.
+Luckily, `pivot_wider()` has an argument `names_prefix` which will allow us to add "plot\_" to the start of each column.
```{r pivot-wider-prefix}
sp_by_plot %>%
@@ -609,7 +733,8 @@ sp_by_plot %>%
names_prefix = "plot_")
```
-That looks better! Let's save this data.frame as a new object.
+That looks better!
+Let's save this data.frame as a new object.
```{r assign-pivot-wider}
surveys_sp <- sp_by_plot %>%
@@ -619,7 +744,8 @@ surveys_sp <- sp_by_plot %>%
surveys_sp
```
-Now we can save this data.frame to a CSV using the `write_csv()` function from the `readr` package. The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
+Now we can save this data.frame to a CSV using the `write_csv()` function from the `readr` package.
+The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension `.csv`.
```{r write-csv}
write_csv(surveys_sp, "data/cleaned/surveys_meanweight_species_plot.csv")
@@ -627,7 +753,7 @@ write_csv(surveys_sp, "data/cleaned/surveys_meanweight_species_plot.csv")
If we go look into our `data/cleaned_data` folder, we will see this new CSV file.
-::::::::::::::::::::::::::::::::::::: keypoints
+::::::::::::::::::::::::::::::::::::: keypoints
- use `filter()` to subset rows and `select()` to subset columns
- build up pipelines one step at a time before assigning the result
diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md
index 4a9ce1452..5f6e8544e 100644
--- a/instructors/instructor-notes.md
+++ b/instructors/instructor-notes.md
@@ -19,70 +19,52 @@ and may not be relevant to this redesigned version.**
## Dataset
The data used for this lesson are in the figshare repository at:
-[https://doi.org/10.6084/m9.figshare.1314459](https://doi.org/10.6084/m9.figshare.1314459)
-
-This lesson uses mostly `combined.csv`. The 3 other csv files: `plots.csv`,
-`species.csv` and `surveys.csv` are only needed for the lesson on databases.
-
-`combined.csv` is downloaded directly in the episode "Starting with Data" and
-does not need to be downloaded before hand. It however requires that there is a
-decent internet connection in the room where the workshop is being taught. To
-facilitate the download process, the chunk of code that includes the URL where
-the csv file lives, and where the file should go and be named is included in the
-code handout (see next paragraph). Using this approach ensures that the file
-will be where the lesson expects it to be, and teaches good/reproducible
-practice of automating the download. If the learners haven't created the `data/`
-directory and/or are not in the correct working directory, the `download.file`
-command will produce an error. Therefore, it is important to use the stickies at
-this point.
+
+
+This lesson uses mostly `combined.csv`.
+The 3 other csv files: `plots.csv`, `species.csv` and `surveys.csv` are only needed for the lesson on databases.
+
+`combined.csv` is downloaded directly in the episode "Starting with Data" and does not need to be downloaded before hand.
+It however requires that there is a decent internet connection in the room where the workshop is being taught.
+To facilitate the download process, the chunk of code that includes the URL where the csv file lives, and where the file should go and be named is included in the code handout (see next paragraph).
+Using this approach ensures that the file will be where the lesson expects it to be, and teaches good/reproducible practice of automating the download.
+If the learners haven't created the `data/` directory and/or are not in the correct working directory, the `download.file` command will produce an error.
+Therefore, it is important to use the stickies at this point.
## The handout
-The [code handout](files/code-handout.R)
-(a link to download it is also available on the top bar of the lesson website)
-is useful for Data Carpentry workshops. It includes an outline of the lesson
-content, the text for the challenges, the links for the files that need to be
-downloaded for the lesson, and pieces of code that may be difficult to type for
-learners with no programming experience/who are unfamiliar with R's syntax. We
-encourage you to distribute it to the learners at the beginning of the
-lesson. As an instructor, we encourage you to do the live coding directly in
-this file, so the participants can follow along.
+The [code handout](files/code-handout.R) (a link to download it is also available on the top bar of the lesson website) is useful for Data Carpentry workshops.
+It includes an outline of the lesson content, the text for the challenges, the links for the files that need to be downloaded for the lesson, and pieces of code that may be difficult to type for learners with no programming experience/who are unfamiliar with R's syntax.
+We encourage you to distribute it to the learners at the beginning of the lesson.
+As an instructor, we encourage you to do the live coding directly in this file, so the participants can follow along.
## R Version
-With the release of R 4.0.0 in early 2020, an important change has been made
-to R: The default for `stringsAsFactors` is now `FALSE` instead of `TRUE`.
-As a result, the `read.csv()` and `data.frame()` functions do not automatically
-convert character columns to factors anymore (you can read more about it
-[in this post on the R developer blog](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html)).
+With the release of R 4.0.0 in early 2020, an important change has been made to R: The default for `stringsAsFactors` is now `FALSE` instead of `TRUE`.
+As a result, the `read.csv()` and `data.frame()` functions do not automatically convert character columns to factors anymore (you can read more about it [in this post on the R developer blog](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html)).
-This change should not cause any problems with this lesson, independent of
-whether R >4.0 is used or not, because it uses
-`read_csv()` from the **`tidyverse`** package throughout. Other than
-`read.csv()` from base R, `read_csv()` never converts character columns to
-factors, regardless of the R version.
+This change should not cause any problems with this lesson, independent of whether R >4.0 is used or not, because it uses `read_csv()` from the **`tidyverse`** package throughout.
+Other than `read.csv()` from base R, `read_csv()` never converts character columns to factors, regardless of the R version.
-Nevertheless, it is recommended that learners install a version of R ≥4.0.0,
-and instructors and helpers should be aware of this potential source of error.
+Nevertheless, it is recommended that learners install a version of R ≥4.0.0, and instructors and helpers should be aware of this potential source of error.
## RStudio and Multiple R Installs
-Some learners may have previous R installations. On Mac, if a new install
-is performed, the learner's system will create a symbolic link, pointing to the
-new install as 'Current.' Sometimes this process does not occur, and, even
-though a new R is installed and can be accessed via the R console, RStudio does
-not find it. The net result of this is that the learner's RStudio will be
-running an older R install. This will cause package installations to fail. This
-can be fixed at the terminal. First, check for the appropriate R installation in
-the library;
+Some learners may have previous R installations.
+On Mac, if a new install is performed, the learner's system will create a symbolic link, pointing to the new install as 'Current.'
+Sometimes this process does not occur, and, even though a new R is installed and can be accessed via the R console, RStudio does not find it.
+The net result of this is that the learner's RStudio will be running an older R install.
+This will cause package installations to fail.
+This can be fixed at the terminal.
+First, check for the appropriate R installation in the library:
```
ls -l /Library/Frameworks/R.framework/Versions/
```
-We are currently using R 4.0.x. If it isn't there, they will need to install it.
-If it is present, you will need to set the symbolic link to Current to point to
-the 4.0.x directory:
+We are currently using R 4.0.x.
+If it isn't there, they will need to install it.
+If it is present, you will need to set the symbolic link to Current to point to the 4.0.x directory:
```
ln -s /Library/Frameworks/R.framework/Versions/3.6.x /Library/Frameworks/R.framework/Version/Current
@@ -92,26 +74,22 @@ Then restart RStudio.
## Issues with Fonts on MacOS
-On older versions of MacOS, it may happen that axis labels do not show up when calling `plot()`
-(section "renaming factors" in "Starting with Data"). This issue might be due to the default font
-Arial being deactivated, so that R cannot find it. To resolve this issue, go to Finder,
-Search for Font Book and open it. Look for the Arial font and, if it is greyed out, turn it on.
+On older versions of MacOS, it may happen that axis labels do not show up when calling `plot()` (section "renaming factors" in "Starting with Data").
+This issue might be due to the default font Arial being deactivated, so that R cannot find it.
+To resolve this issue, go to Finder, Search for Font Book and open it.
+Look for the Arial font and, if it is greyed out, turn it on.
-If the problem occurs with `ggplot2` plots, an alternative workaround is to change the default
-theme for the R session, so that ggplot uses a *serif* font. Since Arial is a *sans-serif*
-font, R will try to load a different font. This can be done with
-`theme_update(text = element_text(family = "serif"))`.
+If the problem occurs with `ggplot2` plots, an alternative workaround is to change the default theme for the R session, so that ggplot uses a *serif* font.
+Since Arial is a *sans-serif* font, R will try to load a different font.
+This can be done with `theme_update(text = element_text(family = "serif"))`.
## Required packages
-Save yourself some aggrevation, and have everyone check and see if they can
-install all these packages before you start the first day.
-See the "Install required R packages" section on the homepage of the course
-website for package installation instructions.
+Save yourself some aggrevation, and have everyone check and see if they can install all these packages before you start the first day.
+See the "Install required R packages" section on the homepage of the course website for package installation instructions.
Sometimes learners are unable to install the **`tidyverse`** package.
-In that case, they can try to install the individual packages that are actually
-needed:
+In that case, they can try to install the individual packages that are actually needed:
```
install.packages(c("readr", "lubridate", "dplyr", "tidyr", "ggplot2", "dbplyr"))
@@ -121,80 +99,51 @@ install.packages(c("readr", "lubridate", "dplyr", "tidyr", "ggplot2", "dbplyr"))
### Before we start
-- The main goal here is to help the learners be comfortable with the RStudio
- interface. We use RStudio because it helps make using R more organized and
- user friendly.
-- The "Why learning R?" section contains suggestions of what you could tell your
- learners about the benefits of learning R. However, it's best if you can talk
- here about what has worked for you personally.
-- Go very slowly in the "Getting setup section". Make sure everyone is following
- along (remind learners to use the stickies). Plan with the helpers at this
- point to go around the room, and be available to help. It's important to make
- sure that learners are in the correct working directory, and that they create
- a `data_raw` (all lowercase) subfolder.
-- The seeking help section is relatively long, and while it's useful to
- demonstrate a couple of ways to get help from within R, you may want to mostly
- point the workshop participants to this useful reference so that they can
- refer to it after the workshop.
-- In the "where to ask for help section?", you may want to emphasize the first
- point about how workshops are a great way to create community of learners that
- can help each others during and after the workshop.
+- The main goal here is to help the learners be comfortable with the RStudio interface.
+ We use RStudio because it helps make using R more organized and user friendly.
+- The "Why learning R?" section contains suggestions of what you could tell your learners about the benefits of learning R.
+ However, it's best if you can talk here about what has worked for you personally.
+- Go very slowly in the "Getting setup section". Make sure everyone is following along (remind learners to use the stickies).
+ Plan with the helpers at this point to go around the room, and be available to help.
+ It's important to make sure that learners are in the correct working directory, and that they create a `data_raw` (all lowercase) subfolder.
+- The seeking help section is relatively long, and while it's useful to demonstrate a couple of ways to get help from within R, you may want to mostly point the workshop participants to this useful reference so that they can refer to it after the workshop.
+- In the "where to ask for help section?", you may want to emphasize the first point about how workshops are a great way to create community of learners that can help each others during and after the workshop.
### Intro to R
-- When going over the section on assignments, make
- sure to pause for at least 30 seconds when asking "What do you think is the
- current content of the object weight\_lb? 126.5 or 220?". For learners with no
- programming experience, this is a new and important concept.
-- Given that the concept of missing data is an important feature of the R
- language, it is worth spending enough time on it.
+- When going over the section on assignments, make sure to pause for at least 30 seconds when asking "What do you think is the current content of the object weight\_lb? 126.5 or 220?".
+ For learners with no programming experience, this is a new and important concept.
+- Given that the concept of missing data is an important feature of the R language, it is worth spending enough time on it.
### Starting with data
The two main goals for this lessons are:
-- To make sure that learners are comfortable with working with data frames, and
- can use the bracket notation to select slices/columns
-- To expose learners to factors. Their behavior is not necessarily intuitive,
- and so it is important that they are guided through it the first time they are
- exposed to it. The content of the lesson should be enough for learners to
- avoid common mistakes with them.
-- If the learners are not familiar with the ecology terminology used in the data
- set, it might be a good idea to briefly review it here. Especially the terms
- *genus* and *plot* have caused some confusion to learners in the past.
- It might help to point out that the plural of genus is *genera*, and that
- `plot_id` and `plot_type` in the data set refer to the ID and type of a plot
- of land that was surveyed by the researchers in the study.
+- To make sure that learners are comfortable with working with data frames, and can use the bracket notation to select slices/columns
+- To expose learners to factors.
+ Their behavior is not necessarily intuitive, and so it is important that they are guided through it the first time they are exposed to it.
+ The content of the lesson should be enough for learners to avoid common mistakes with them.
+- If the learners are not familiar with the ecology terminology used in the data set, it might be a good idea to briefly review it here.
+ Especially the terms *genus* and *plot* have caused some confusion to learners in the past.
+ It might help to point out that the plural of genus is *genera*, and that `plot_id` and `plot_type` in the data set refer to the ID and type of a plot of land that was surveyed by the researchers in the study.
### Manipulating data
- For this lesson make sure that learners are comfortable using pipes.
-- There is also sometimes some confusion on what the arguments of `group_by`
- should be.
+- There is also sometimes some confusion on what the arguments of `group_by` should be.
- This lesson uses the tidyr package to reshape data for plotting
-- After this lesson students should be familiar with the spread() and gather()
- functions available in tidyr
-- While working with the example for mutate(), it is difficult to see the
- "weight" columns on a zoomed in RStudio screen. Including a select()
- command to select the columns "weight\_kg" and "weight\_lb" makes it easier
- to view how the "weight" columns are changed.
-- It is crucial that learners use the function `read_csv()` from tidyverse,
- not `read.csv()` from base R. Using the wrong function will cause unexpected
- results further down the line, especially in the section on working with
- factors.
-- Note: If students end up with 30521 rows for `surveys_complete` instead of
- the expected 30463 rows at the end of the chapter, then they have likely used
- `read.csv()` and not `read_csv()` to import the data.
-- When explaining `view()`, consider mentioning that is a function of the
- **`tibble`** package, and that the base function `View()` can also be used to
- view a data frame.
+- After this lesson students should be familiar with the spread() and gather() functions available in tidyr
+- While working with the example for mutate(), it is difficult to see the "weight" columns on a zoomed in RStudio screen.
+ Including a select() command to select the columns "weight\_kg" and "weight\_lb" makes it easier to view how the "weight" columns are changed.
+- It is crucial that learners use the function `read_csv()` from tidyverse, not `read.csv()` from base R.
+ Using the wrong function will cause unexpected results further down the line, especially in the section on working with factors.
+- Note: If students end up with 30521 rows for `surveys_complete` instead of the expected 30463 rows at the end of the chapter, then they have likely used `read.csv()` and not `read_csv()` to import the data.
+- When explaining `view()`, consider mentioning that is a function of the **`tibble`** package, and that the base function `View()` can also be used to view a data frame.
### Visualizing data
-- This lesson is a broad overview of ggplot2 and focuses on (1) getting familiar
- with the layering system of ggplot2, (2) using the argument `group` in the
- `aes()` function, (3) basic customization of the plots.
-- It maybe worthwhile to mention that we can also specify colors by color HEX code ([http://colorbrewer2.org](https://colorbrewer2.org))
+- This lesson is a broad overview of ggplot2 and focuses on (1) getting familiar with the layering system of ggplot2, (2) using the argument `group` in the `aes()` function, (3) basic customization of the plots.
+- It maybe worthwhile to mention that we can also specify colors by color HEX code ()
```
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "#FF0000")
@@ -202,56 +151,40 @@ The two main goals for this lessons are:
### R and SQL
-- Ideally this lesson is best taught at the end of the workshop (as a capstone
- example) to illustrate how the tools covered can integrate with each
- others. Depending on the audience, and the pace of the workshop, it can be
- shown as a demonstration rather than a typically lesson.
-- The explanation of how dplyr's verb syntax is translated into SQL statements,
- and the section on laziness are optional and don't need to be taught in detail
- during a workshop. They can be useful after a workshop for learners interested
- in learning more about the topics or for instructors to answer questions from
- the workshop participants.
+- Ideally this lesson is best taught at the end of the workshop (as a capstone example) to illustrate how the tools covered can integrate with others.
+ Depending on the audience, and the pace of the workshop, it can be shown as a demonstration rather than a typically lesson.
+- The explanation of how dplyr's verb syntax is translated into SQL statements, and the section on laziness are optional and don't need to be taught in detail during a workshop.
+ They can be useful after a workshop for learners interested in learning more about the topics or for instructors to answer questions from the workshop participants.
## Potential issues \& solutions
-As it stands, the solutions to all the challenges are commented out in the Rmd
-files. If you want to double check your answer, you can look at the source code
-of the Rmd files on GitHub.
+As it stands, the solutions to all the challenges are commented out in the Rmd files.
+If you want to double check your answer, you can look at the source code of the Rmd files on GitHub.
## Technical Tips and Tricks
-Show how to use the 'zoom' button to blow up graphs without constantly resizing
-windows
+Show how to use the 'zoom' button to blow up graphs without constantly resizing windows
Sometimes a package will not install, try a different CRAN mirror
- Tools > Global Options > Packages > CRAN Mirror
-Alternatively you can go to CRAN and download the package and install from ZIP
-file
+Alternatively you can go to CRAN and download the package and install from ZIP file
- Tools > Install Packages > set to 'from Zip/TAR'
-It is important that R, and the R packages be installed locally, not on a
-network drive. If a learner is using a machine with multiple users where their
-account is not based locally this can create a variety of issues (This often
-happens on university computers). Hopefully the learner will realize these
-issues before hand, but depending on the machine and how the IT folks that
-service the computer have things set up, it may be very difficult to impossible
-to make R work without their help.
+It is important that R, and the R packages be installed locally, not on a network drive.
+If a learner is using a machine with multiple users where their account is not based locally this can create a variety of issues (This often happens on university computers).
+Hopefully the learner will realize these issues before hand, but depending on the machine and how the IT folks that service the computer have things set up, it may be very difficult to impossible to make R work without their help.
-If learners are having issues with one package, they may have issues with
-another. It is often easier to [make sure they have all the necessary packages installed](#required-packages)
-at one time, rather then deal with these issues over and over.
+If learners are having issues with one package, they may have issues with another.
+It is often easier to [make sure they have all the necessary packages installed](#required-packages) at one time, rather then deal with these issues over and over.
-In lesson 2 starting with data, one might not have the appropriate folder "data\_raw" in their working directory causing an error. This is a good time to go over reading an error, and a brief introduction of how to identify your working directory `getwd()` as well as setting your working directory `setwd("/somedirectory")` and if needed creating a directory within your script `dir.create("/some_new_directory")`, or simply creating it within a file explorer works if short on time.
+In lesson 2 starting with data, one might not have the appropriate folder "data\_raw" in their working directory causing an error.
+This is a good time to go over reading an error, and a brief introduction of how to identify your working directory `getwd()` as well as setting your working directory `setwd("/somedirectory")` and if needed creating a directory within your script `dir.create("/some_new_directory")`, or simply creating it within a file explorer works if short on time.
## Other Resources
-If you encounter a problem during a workshop, feel free to contact the
-maintainers by email or
-[open an issue](https://github.com/fishtree-attempt/R-ecology-lesson/issues/new).
+If you encounter a problem during a workshop, feel free to contact the maintainers by email or [open an issue](https://github.com/fishtree-attempt/R-ecology-lesson/issues/new).
-For a more in-depth coverage of topics of the workshops, you may want to read
-"[R for Data Science](https://r4ds.had.co.nz/)" by Hadley Wickham and Garrett
-Grolemund.
+For a more in-depth coverage of topics of the workshops, you may want to read " [R for Data Science](https://r4ds.had.co.nz/)" by Hadley Wickham and Garrett Grolemund.
diff --git a/learners/extra-challenges.Rmd b/learners/extra-challenges.Rmd
index c24709f50..d60cb9a67 100644
--- a/learners/extra-challenges.Rmd
+++ b/learners/extra-challenges.Rmd
@@ -4,7 +4,6 @@ teaching: 45
exercises: 3
---
-
```{r setup, include=FALSE}
knitr::opts_chunk$set(dpi = 200, out.height = 600, out.width = 600, R.options = list(max.print = 100))
```
@@ -14,11 +13,12 @@ library(tidyverse)
surveys <- read_csv("data/cleaned/surveys_complete_77_89.csv")
```
-::::::::::::::::::::::::::::::::::::: challenge
+::::::::::::::::::::::::::::::::::::: challenge
## Challenge: `ggplot2` syntax
-There are some issues with these `ggplot2` examples. Can you figure out what is wrong with each one?
+There are some issues with these `ggplot2` examples.
+Can you figure out what is wrong with each one?
```{r, eval=FALSE}
ggplot(data = surveys,
@@ -26,10 +26,11 @@ ggplot(data = surveys,
geom_point()
```
+:::::::::::::::::::::::: solution
-:::::::::::::::::::::::: solution
-
-Our points don't actually turn out blue, because we defined the color inside of `aes()`. `aes()` is used for translating variables from the data into plot elements, like color. There is no variable in the data called "blue".
+Our points don't actually turn out blue, because we defined the color inside of `aes()`.
+`aes()` is used for translating variables from the data into plot elements, like color.
+There is no variable in the data called "blue".
::::::::::::::::::::::::
@@ -39,7 +40,7 @@ ggplot(data = surveys,
geom_point()
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
Variable names inside `aes()` should not be wrapped in quotes.
@@ -51,7 +52,7 @@ ggplot(data = surveys,
+ geom_point()
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
When adding things like `geom_` or `scale_` functions to a `ggplot()`, you have to end a line with `+`, not begin a line with it.
@@ -62,7 +63,7 @@ ggplot(data = surveys, x = weight, y = hindfoot_length) +
geom_point()
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
When translating variables from the data, like `weight` and `hindfoot_length`, to elements of the plot, like `x` and `y`, you must put them inside `aes()`.
@@ -75,9 +76,10 @@ ggplot(data = surveys,
scale_color_continuous(type = "viridis")
```
-:::::::::::::::::::::::: solution
+:::::::::::::::::::::::: solution
-`species_id` is a categorical variable, but `scale_color_continuous()` supplies a continuous color scale. `scale_color_discrete()` would give a discrete/categorical scale.
+`species_id` is a categorical variable, but `scale_color_continuous()` supplies a continuous color scale.
+`scale_color_discrete()` would give a discrete/categorical scale.
::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::
diff --git a/learners/reference.md b/learners/reference.md
index bf858cba6..b432f6a3e 100644
--- a/learners/reference.md
+++ b/learners/reference.md
@@ -100,5 +100,3 @@ Cheat sheet of functions used in the lessons
- `inner_join()` # perform an inner join between two tables
- `src_sqlite()` # connect dplyr to a SQLite database file
- `copy_to()` # copy a data frame as a table into a database
-
-
diff --git a/learners/setup.md b/learners/setup.md
index 13d367134..29fabf8db 100644
--- a/learners/setup.md
+++ b/learners/setup.md
@@ -4,51 +4,47 @@ title: Setup
## Preparations
-Data Carpentry's teaching is hands-on, and to follow this lesson
-learners must have R and RStudio installed on their computers. They also need
-to be able to install a number of R packages, create directories, and download
-files.
+Data Carpentry's teaching is hands-on, and to follow this lesson learners must have R and RStudio installed on their computers.
+They also need to be able to install a number of R packages, create directories, and download files.
-To avoid troubleshooting during the lesson, learners should follow the
-instructions below to download and install everything beforehand.
-If the computer is managed by their organization's IT department
-they might need help from an IT administrator.
+To avoid troubleshooting during the lesson, learners should follow the instructions below to download and install everything beforehand.
+If the computer is managed by their organization's IT department they might need help from an IT administrator.
### Install R and RStudio
-R and RStudio are two separate pieces of software:
+R and RStudio are two separate pieces of software:
+
+- **R** is a programming language and software used to run code written in R.
+- **RStudio** is an integrated development environment (IDE) that makes using R easier. In this course we use RStudio to interact with R.
-* **R** is a programming language and software used to run code written in R.
-* **RStudio** is an integrated development environment (IDE) that makes using R easier. In this course we use RStudio to interact with R.
-
If you don't already have R and RStudio installed, follow the instructions for your operating system below.
-You have to install R before you install RStudio.
+You have to install R before you install RStudio.
::::::: spoiler
## For Windows
-* Download R from the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm).
-* Run the `.exe` file that was just downloaded
-* Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
-* Under *Installers* select **Windows Vista 10/11 - RSTUDIO-xxxx.yy.z-zzz.exe** (where x = year, y = month, and z represent version numbers)
-* Double click the file to install it
-* Once it's installed, open RStudio to make sure it works and you don't get any error messages.
-
+- Download R from the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm).
+- Run the `.exe` file that was just downloaded
+- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
+- Under *Installers* select **Windows Vista 10/11 - RSTUDIO-xxxx.yy.z-zzz.exe** (where x = year, y = month, and z represent version numbers)
+- Double click the file to install it
+- Once it's installed, open RStudio to make sure it works and you don't get any error messages.
+
:::::::::::::::::::::::::
:::::::::::::::: spoiler
## For MacOS
-* Download R from the [CRAN website](https://cran.r-project.org/bin/macosx/).
-* Select the `.pkg` file for the latest R version
-* Double click on the downloaded file to install R
-* It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed by some packages)
-* Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
-* Under *Installers* select **Mac OS 13+ - RSTUDIO-xxxx.yy.z-zzz.dmg** (where x = year, y = month, and z represent version numbers)
-* Double click the file to install RStudio
-* Once it's installed, open RStudio to make sure it works and you don't get any error messages.
+- Download R from the [CRAN website](https://cran.r-project.org/bin/macosx/).
+- Select the `.pkg` file for the latest R version
+- Double click on the downloaded file to install R
+- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed by some packages)
+- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
+- Under *Installers* select **Mac OS 13+ - RSTUDIO-xxxx.yy.z-zzz.dmg** (where x = year, y = month, and z represent version numbers)
+- Double click the file to install RStudio
+- Once it's installed, open RStudio to make sure it works and you don't get any error messages.
::::::::::::::::
@@ -56,12 +52,12 @@ You have to install R before you install RStudio.
## For Linux
-* Click on your distribution in the [Linux folder of the CRAN website](https://cran.r-project.org/bin/linux/). Linux Mint users should follow instructions for Ubuntu.
-* Go through the instructions for your distribution to install R.
-* Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
-* Select the relevant installer for your Linux system (Ubuntu/Debian or Fedora)
-* Double click the file to install RStudio
-* Once it's installed, open RStudio to make sure it works and you don't get any error messages.
+- Click on your distribution in the [Linux folder of the CRAN website](https://cran.r-project.org/bin/linux/). Linux Mint users should follow instructions for Ubuntu.
+- Go through the instructions for your distribution to install R.
+- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download)
+- Select the relevant installer for your Linux system (Ubuntu/Debian or Fedora)
+- Double click the file to install RStudio
+- Once it's installed, open RStudio to make sure it works and you don't get any error messages.
::::::::::::::::
@@ -69,25 +65,35 @@ You have to install R before you install RStudio.
If you already have R and RStudio installed, first check if your R version is up to date:
-* When you open RStudio your R version will be printed in the console on the bottom left. Alternatively, you can type `sessionInfo()` into the console. If your R version is 4.0.0 or later, you don't need to update R for this lesson. If your version of R is older than that, download and install the latest version of R from the R project website [for Windows](https://cran.r-project.org/bin/windows/base/), [for MacOS](https://cran.r-project.org/bin/macosx/), or [for Linux](https://cran.r-project.org/bin/linux/)
-* It is not necessary to remove old versions of R from your system, but if you wish to do so you can check [How do I uninstall R?](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f)
-* After installing a new version of R, you will have to reinstall all your packages with the new version. For Windows, there is a package called `installr` that can help you with upgrading your R version and migrate your package library. A similar package called `pacman` can help with updating R packages across
-To update RStudio to the latest version, open RStudio and click on
-`Help > Check for Updates`. If a new version is available follow the
-instruction on screen. By default, RStudio will also automatically notify you
-of new versions every once in a while.
+- When you open RStudio your R version will be printed in the console on the bottom left.
+ Alternatively, you can type `sessionInfo()` into the console.
+ If your R version is 4.0.0 or later, you don't need to update R for this lesson.
+ If your version of R is older than that, download and install the latest version of R from the R project website [for Windows](https://cran.r-project.org/bin/windows/base/), [for MacOS](https://cran.r-project.org/bin/macosx/), or [for Linux](https://cran.r-project.org/bin/linux/)
+- It is not necessary to remove old versions of R from your system, but if you wish to do so you can check [How do I uninstall R?](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f)
+- After installing a new version of R, you will have to reinstall all your packages with the new version.
+ For Windows, there is a package called `installr` that can help you with upgrading your R version and migrate your package library.
+ A similar package called `pacman` can help with updating R packages across
+ To update RStudio to the latest version, open RStudio and click on `Help > Check for Updates`.
+ If a new version is available follow the
+ instruction on screen.
+ By default, RStudio will also automatically notify you of new versions every once in a while.
::::::::::::::::::::::::::::: callout
-The changes introduced by new R versions are usually backwards-compatible. That is, your old code should still work after updating your R version. However, if breaking changes happen, it is useful to know that you can have multiple versions of R installed in parallel and that you can switch between them in RStudio by going to `Tools > Global Options > General > Basic`.
+The changes introduced by new R versions are usually backwards-compatible.
+That is, your old code should still work after updating your R version.
+However, if breaking changes happen, it is useful to know that you can have multiple versions of R installed in parallel and that you can switch between them in RStudio by going to `Tools > Global Options > General > Basic`.
-While this may sound scary, it is **far more common** to run into issues due to using out-of-date versions of R or R packages. Keeping up with the latest versions of R, RStudio, and any packages you regularly use is a good practice.
+While this may sound scary, it is **far more common** to run into issues due to using out-of-date versions of R or R packages.
+Keeping up with the latest versions of R, RStudio, and any packages you regularly use is a good practice.
:::::::::::::::::::::::::::::
### Install required R packages
-During the course we will need a number of R packages. Packages contain useful R code written by other people. We will use the packages `tidyverse`, and `ratdat`.
+During the course we will need a number of R packages.
+Packages contain useful R code written by other people.
+We will use the packages `tidyverse`, and `ratdat`.
To try to install these packages, open RStudio and copy and paste the following command into the console window (look for a blinking cursor on the bottom left), then press the Enter (Windows and Linux) or Return (MacOS) to execute the command.
@@ -97,7 +103,7 @@ install.packages(c("tidyverse", "ratdat"))
Alternatively, you can install the packages using RStudio's graphical user interface by going to `Tools > Install Packages` and typing the names of the packages separated by a comma.
-R tries to download and install the packages on your machine.
+R tries to download and install the packages on your machine.
When the installation has finished, you can try to load the packages by pasting the following code into the console:
@@ -106,22 +112,26 @@ library(tidyverse)
library(ratdat)
```
-If you do not see an error like `there is no package called ‘...’` you are good to go!
+If you do not see an error like `there is no package called '...'` you are good to go!
### Updating R packages
-Generally, it is recommended to keep your R version and all packages up to date, because new versions bring improvements and important bugfixes. To update the packages that you have installed, click `Update` in the `Packages` tab in the bottom right panel of RStudio, or go to `Tools > Check for Package Updates...`
+Generally, it is recommended to keep your R version and all packages up to date, because new versions bring improvements and important bugfixes.
+To update the packages that you have installed, click `Update` in the `Packages` tab in the bottom right panel of RStudio, or go to `Tools > Check for Package Updates...`
You should update **all of the packages** required for the lesson, even if you installed them relatively recently.
-Sometimes, package updates introduce changes that break your old code, which can be very frustrating. To avoid this problem, you can use a package called `renv`. It locks the package versions you have used for a given project and makes it straightforward to reinstall those exact package version in a new environment, for example after updating your R version or on another computer. However, the details are outside of the scope of this lesson.
+Sometimes, package updates introduce changes that break your old code, which can be very frustrating.
+To avoid this problem, you can use a package called `renv`.
+It locks the package versions you have used for a given project and makes it straightforward to reinstall those exact package version in a new environment, for example after updating your R version or on another computer.
+However, the details are outside of the scope of this lesson.
### Download the data
-We will download the data directly from R during the lessons. However, if you are expecting problems with the network, it may be better to download the data beforehand and store it on your machine.
+We will download the data directly from R during the lessons.
+However, if you are expecting problems with the network, it may be better to download the data beforehand and store it on your machine.
The data files for the lesson can be downloaded manually:
- - [cleaned data](../episodes/data/cleaned/surveys_complete_77_89.csv) and
- - [zip file of raw data](../episodes/data/new_data.zip).
-
+- [cleaned data](../episodes/data/cleaned/surveys_complete_77_89.csv) and
+- [zip file of raw data](../episodes/data/new_data.zip).