|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: "A Primer on R" |
| 4 | +nav_order: 2 |
| 5 | +format: docusaurus-md |
| 6 | +--- |
| 7 | + |
| 8 | +# Introduction |
| 9 | +# The `tidyverse` |
| 10 | +# The pipe (`%>%`) |
| 11 | +# `haven::read_dta()` and the `labelled` package |
| 12 | +# Functions for Data Munging |
| 13 | +## select/rename, mutate/summarise, filter, group_by() |
| 14 | +# tidyselect, stringr and Regular Expressions, pick() |
| 15 | +# `glue()` |
| 16 | +# Repeating Yourself: Anonymous functions, across, map, and rename_with |
| 17 | +# Reshaping |
| 18 | +# Mutating and Filtering Joins |
| 19 | + |
| 20 | + |
| 21 | +In this tutorial, we will learn how to reshape data from long to wide (and vice versa) using the `tidyverse` package in `R`. We will use data on cohort member's height and weight collected in Sweeps 2-7 to demonstrate the process. |
| 22 | + |
| 23 | +```{r} |
| 24 | +#| warning: false |
| 25 | +# Load Packages |
| 26 | +library(tidyverse) # For data manipulation |
| 27 | +library(haven) # For importing .dta files |
| 28 | +library(glue) # For creating strings |
| 29 | +``` |
| 30 | + |
| 31 | +```{r} |
| 32 | +#| include: false |
| 33 | +# setwd(Sys.getenv("mcs_fld")) |
| 34 | +``` |
| 35 | + |
| 36 | +# Reshaping from Wide to Long |
| 37 | + |
| 38 | +We begin by loading the data from each sweep and merging these together into a single wide format data frame; see [Combining Data Across Sweeps](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html) for more details. Note, the names of the height and weight variables in Sweep 5 (`ECHTCMA0` and `ECWTCMAO`) diverge slightly from the rubric used for other sweeps (`[A-G]CHTCM00` and `[A-G]CWTCM00` where `[A-G]` denotes sweep), hence the need for the complex regular expression in `read_dta(col_select = ...)` function call. To simplify the names of the columns in the wide dataset, we rename the Sweep 5 variables so they follow the rubric for Sweeps 2-4 and 6-7. |
| 39 | + |
| 40 | +```{r} |
| 41 | +fups <- c(0, 3, 5, 7, 11, 14, 17) |
| 42 | +
|
| 43 | +load_height_wide <- function(sweep){ |
| 44 | + fup <- fups[sweep] |
| 45 | + prefix <- LETTERS[sweep] |
| 46 | + |
| 47 | + glue("{fup}y/mcs{sweep}_cm_interview.dta") %>% |
| 48 | + read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>% |
| 49 | + rename(cnum = matches("CNUM00")) |
| 50 | +} |
| 51 | +
|
| 52 | +df_wide <- map(2:7, load_height_wide) %>% |
| 53 | + reduce(~ full_join(.x, .y, by = c("MCSID", "cnum"))) %>% |
| 54 | + rename(ECHTCM00 = ECHTCMA0, ECWTCMA00 = ECWTCMA0) |
| 55 | +
|
| 56 | +str(df_wide) |
| 57 | +``` |
| 58 | + |
| 59 | +`df_wide` has 14 columns. Besides, the identifiers, `MCSID` and `cnum`, there are 12 columns for height and weight measurements at each sweep. Each of these 12 columns is prefixed by a single letter indicating the sweep. We can reshape the dataset into long format (one row per person x sweep combination) using the `pivot_longer()` function so that the resulting data frame has five columns: two person identifiers, a variable for sweep, and variables for height and weight. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.)(.*)"`, which breaks the column name into two pieces: the first character (`"(.)"`) and the rest of the name (`"(.*)"`). As noted, the first character holds information on sweep. In `names_to`, `.value` is a placeholder for the second piece of the column name. |
| 60 | + |
| 61 | +```{r} |
| 62 | +df_long <- df_wide %>% |
| 63 | + pivot_longer(cols = matches("C(H|W)TCM00"), |
| 64 | + names_to = c("sweep", ".value"), |
| 65 | + names_pattern = "(.)(.*)") |
| 66 | +
|
| 67 | +df_long |
| 68 | +``` |
| 69 | + |
| 70 | +# Reshaping from Long to Wide |
| 71 | +We can also reshape the data from long to wide format using the `pivot_wider()` function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the `values_from` argument, provide the new column names in the `names_from` argument, and use the `names_glue` argument to specify the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the variable name. |
| 72 | + |
| 73 | +```{r} |
| 74 | +df_long %>% |
| 75 | + pivot_wider(names_from = sweep, |
| 76 | + values_from = matches("C(W|H)T"), |
| 77 | + names_glue = "{sweep}{.value}") |
| 78 | +``` |
| 79 | + |
| 80 | +# Reshape a Cleaned Dataset from Long to Wide |
| 81 | +It is likely that you will not just need to reshape raw data, but cleaned data too. In the next two sections we offer advice on naming variables so that they are easy to select and reshape in long or wide formats. First, let's clean the long dataset by converting the `cnum` and `sweep` columns to integers, creating a new column for follow-up time, and creating new `height` and `weight` variables that replace negative values in the raw height and weight data with `NA` (as well as giving these variables more easy-to-understand names). |
| 82 | + |
| 83 | + |
| 84 | +```{r} |
| 85 | +df_long_clean <- df_long %>% |
| 86 | + mutate(cnum = as.integer(cnum), |
| 87 | + sweep = match(sweep, LETTERS), |
| 88 | + fup = fups[sweep], |
| 89 | + height = ifelse(CHTCM00 > 0, CHTCM00, NA), |
| 90 | + weight = ifelse(CWTCM00 > 0, CWTCM00, NA)) %>% |
| 91 | + select(MCSID, cnum, fup, height, weight) |
| 92 | +``` |
| 93 | + |
| 94 | +To reshape the clean data from long to wide format, we can use the `pivot_wider()` function as before. This time, we specify the columns to be reshaped using the `names_from` argument, provide the new column names in the `values_from` argument, and use the `names_glue` argument to specify the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the variable name. |
| 95 | + |
| 96 | + |
| 97 | +```{r} |
| 98 | +df_wide_clean <- df_long_clean %>% |
| 99 | + pivot_wider(names_from = fup, |
| 100 | + values_from = c(height, weight), |
| 101 | + names_glue = "{.value}_{fup}y") |
| 102 | +
|
| 103 | +df_wide_clean |
| 104 | +``` |
| 105 | + |
| 106 | +# Reshape a Cleaned Dataset from Long to Wide |
| 107 | +Finally, we can reshape the clean wide dataset back to long format using the `pivot_longer()` function. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.*)_(.*)y"`, which breaks the column name into two pieces: the variable name (`"(.*)"`), and the follow-up time (`"(.*)y"`). We also use the `names_transform` argument to convert the follow-up time to an integer. |
| 108 | + |
| 109 | +```{r} |
| 110 | +df_wide_clean %>% |
| 111 | + pivot_longer(cols = matches("_.*y$"), |
| 112 | + names_to = c(".value", "fup"), |
| 113 | + names_pattern = "(.*)_(\\d+)y$", |
| 114 | + names_transform = list(fup = as.integer)) |
| 115 | +``` |
0 commit comments