IntRo/07_transform.qmd at master · andreamazzella/IntRo · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "07: Transforming data"
subtitle: "Introduction to R for health data"
author: Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
---

------------------------------------------------------------------------

## Content

-   Data transformation
    -   calculate new variables
    -   categorise continuous variables
    -   combine levels of a categorical variable
    -   create a dummy variable
    -   combine values of two categorical variables

------------------------------------------------------------------------

## Recap from topic 06_subset

From default dataset `Theoph`, identify at which time point participant 11 had a plasma theophylline concentration higher than 7.5 mg/L (without scrolling through the dataset manually).

------------------------------------------------------------------------

Load packages

```{r}
library(tidyverse)
```

```{r}
dm <- read_rds("data/processing/diabetes.rds")
```

------------------------------------------------------------------------

## Data transformation

During data analysis, it's very common to need to derive new information from the raw data. For example:

-   The body mass index (BMI) is a continuous variable, but we might want to analyse it divided in clinical categories.
-   We have data on disability with three levels: able-bodied, mild disability, severe disability, but we want to change it to a binary variable (disability yes/no).

------------------------------------------------------------------------

### Calculate new variables

In `dplyr`, function `mutate()` is used to add new variables (or change existing ones).

The first argument is the dataset name, then you have the new variable name, a single `=` sign, and then the formula to use to calculate the values in the new variable. Formulae can incorporate values from other columns.

As an example, let's convert height from its current unit (cm) into meters.

The following code reads: take dataset "dm", change it by adding a new column called "height_m", and populate this new column values with the corresponding "height" value on the same row, divided by 100.

```{r}
dm |>
  mutate(height_m = height / 100)
```

By default, new variables are added on the right-hand side of the data frame. You can change this with the `.before =` or `.after =` argument.

Like with all `dplyr` functions, this change is temporary; if you want this column to be permanently added to the dataset, you need to assign it. In this case, I generally recommend creating a new object with a meaningful name, but occasionally you might want to overwrite the existing data frame.

```{r}
# Creating a new object
dm_metric <- dm |>
  mutate(height_m = height / 100)

# Overwriting
dm <- dm |>
  mutate(height_m = height / 100)
```

Remember that when you assign something, you can't see any output. If you want to check whether it worked, you can have a look at the dataset.

Whenever you calculate a new variable, I recommend having a quick sanity check to make sure that it did what you intended.

```{r}
# Sanity check
dm |>
  count(height, height_m)
```

#### Exercise 1

1.  Convert weight into pounds (1 kg = 2.20462 lbs)
2.  Create a new data frame that contains this column.

(Note: unlike Stata, if you make a mistake and you need to recalculate the variable, you don't need to drop the old one first. When you assign it again, R will overwrite the old variable.)

```{r}

```

------------------------------------------------------------------------

### Categorise continuous variables

We might also want to transform a continuous variable into a categorical one, for example to reflect clinically meaningful groups.

Imagine we want to categorise BMI into underweight (BMI lower than 18.5), normal weight (18.5 - 25), overweight (25-30) and obese (above 30). Let's see what the minimum and maximum BMI are:

```{r}
summary(dm$bmi)
```

We can then use `mutate()`, this time with helper function `cut()` - we need to specify the `breaks`, i.e., which values delimit groups.

```{r}
dm_bmi <- dm |>
    mutate(bmi_group = cut(bmi, breaks = c(0, 18.5, 25, 30, Inf), right = FALSE))

# Check that the new variable was created correctly
dm_bmi |>
  summarise(
    n(),
    min(bmi),
    max(bmi),
    .by = bmi_group
    )
```

By default `cut()`:

-   creates a left-open and right-closed interval, i.e., `(25,30]` excludes 25 and includes 30.
-   creates labels in mathematical notation.

We can add custom labels to each group by adding the argument `labels =`. NB: labels should have 1 less `labels` than `breaks`.

```{r}
dm_bmi <- dm |>
  mutate(
    bmi_group_label = cut(
      bmi,
      right = FALSE,
      breaks = c(0, 18.5, 25, 30, 40),
      labels = c("underweight", "regular weight", "overweight", "obese")
      )
    )

dm_bmi |>
  summarise(
    n(),
    min(bmi),
    max(bmi),
    .by = bmi_group_label
    )
```

NB: categorising continuous variables, and especially dichotomising them, is potentially highly problematic - see this paper: Turner E, Dobson J & Pocock S. (2010). [Categorisation of continuous risk factors in epidemiological publications: A survey of current practice](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2972292/).

#### Exercise 2

Categorise weekly exercise into groups of your choosing.

```{r}


```

------------------------------------------------------------------------

### Combine levels of a categorical variable

Sometimes we might need to combine two or more levels of a categorical variable.

For example, the `continent` variable contains both South and North America; imagine we want to combine them into a single level, Americas.

```{r}
# Explore levels
dm |>
  count(continent)
```

To do this, we can use `mutate()` to create the new variable and `fct_collapse()` to combine the levels of the old one. This function, from package `forcats`, takes these arguments:

-   the old variable
-   the first manually defined value, followed by `=`, followed by a vector of the old values.
-   (any other new manually defined names)

```{r}
# Regroup
dm_macro <- dm |>
    mutate(
      macrocontinent = fct_collapse(
        continent,
        america = c("n_america", "s_america"),
        eurasia = c("europe", "asia")
        )
      )

# Check it worked
dm_macro |>
  count(continent, macrocontinent)
```

You might also want to lump together the levels with few observations - for example, there are only 16 people in this dataset doing swimming, yoga, or gymnastics.

```{r}
# Check before
dm |>
  count(exercise_type, sort = TRUE)
```

Let's lump them into an "other" category. For this, we can use the `fct_lump_` functions from `forcats`.

```{r}
# Lump
dm <- dm |>
  mutate(exercise_type_2 = fct_lump_min(exercise_type, 10, other_level = "OTHER"))

# Sanity check
dm |>
  count(exercise_type, exercise_type_2, sort = TRUE)
```

------------------------------------------------------------------------

### Create a dummy variable

If you want to reduce the levels to two (dummy variable), I recommend creating a logical variable (i.e., that contains only TRUE or FALSE), rather than a "yes"/"no" or 1/0 variable.

For example, the "disability" variable has three levels: "able-bodied", "mild" and "severe".

```{r}
# Data exploration
dm |>
  count(disability)
```

Let's turn this into a `disabled` variable that can be either TRUE (= yes) or FALSE (= no).

```{r}
# Regroup
dm <- dm |>
  mutate(disabled = disability == "mild" | disability == "severe")

# Check it worked
dm |>
  count(disability, disabled)
```

NB: do *not* use the `%in%` shortcut in these circumstances, otherwise missing values will be incorrectly recoded as FALSE!

#### Exercise 3

Pick one:

-   A: Create a new variable with three levels: eats meat, eats fish, doesn't eat either, using `diet` as reference.
-   B: Create a new binary variable that indicates whether a patient is inactive, using `exercise` as reference.

```{r}

```

------------------------------------------------------------------------

### Combine values of two categorical variables

We can also create a new categorical variables that summarises two other categorical variables.

For example, we can create a new variable, `frailty`, that reflects the combination of disability and lack of exercise.

```{r}
# Cross-tabulation
dm |> count(disability, exercise)
```

Suppose we want to specify that someone with a mild disability is moderately frail if they don't exercise, and only mildly frail if they do. To do this, we may use `case_when()`.

This allows us to check for many logical statements (on the left-hand side of `~` and give specific values when these are met (on the right-hand side).

```{r}
# Create the new variable
dm <- dm |>
    mutate(frailty = case_when(
      disability == "severe"                    ~ "severely frail",
      disability == "mild" & exercise == "none" ~ "moderately frail",
      disability == "mild" & exercise != "none" ~ "mildly frail",
      disability == "able-bodied"               ~ "not frail"
    ))

# Check it worked
dm |> count(disability, exercise, frailty)
```

You can also add a catch-all at the end, for all rows that haven't matched any of the listed conditions. This is done by using argument `.default = ` at the end.

------------------------------------------------------------------------

#### Exercise 4

1.  Import the `bristol.rds` dataset (from the `processing` folder).
2.  Explore the data.
3.  Categorise the type of stool in the Bristol scale into three levels: "constipation" (type 1-2), "normal" (type 3-5), "diarrhoea" (type 6-7).
4.  Save a copy as `bristol_v2.rds`.

```{r}
# Import


# Explore


# Categorise


# Export


```

------------------------------------------------------------------------

## Learning more

-   R for Data Science (2ed) chapter 3: [Data transformation](https://r4ds.hadley.nz/data-transform.html)

------------------------------------------------------------------------

## Solutions

```{r}
#| label: solution_1
dm_imperial <- dm |>
  mutate(pounds = weight * 2.20462)
```

```{r}
#| label: solution_2
# Summarise age
dm |>
  ggplot(aes(exercise_weekly)) +
  geom_histogram()

# Categorise
dm <- dm |>
  mutate(exercise_grp = cut(
    exercise_weekly,
    breaks = c(0, 1, seq(30, 120, by = 30), 240),
    labels = c("0 min", "20-29 min", "30-59 min", "1-1.4 h", "1.5-1.9 h", "2 h or more"),
    right = FALSE)
    )

# Sanity check
dm |>
  summarise(min(exercise_weekly),
            max(exercise_weekly),
            n(),
            .by = exercise_grp) |>
  arrange(exercise_grp)
```

```{r}
#| label: solution_3

# Check the levels of the existing variables
dm |>
  count(diet)
dm |>
  count(exercise)

dm <- dm |>
  mutate(
    # Option A
    diet_new = fct_collapse(diet, no_meat = c("vegan", "vegetar")),
    # Option B
    inactive = exercise == "none")

# Check the recoding worked
dm |>
  count(diet, diet_new)

dm |>
  count(exercise, inactive)
```

```{r}
#| label: solution_4
# Import
bristol <- read_rds("data/processing/bristol.rds")

# Explore
View(bristol)
glimpse(bristol)
bristol |>
  count(bristol_type)

# Categorise
bristol <- bristol |>
    mutate(
      stool_type = case_when(bristol_type <= 2 ~ "constipation",
                             between(bristol_type, 3, 5) ~ "normal",
                             bristol_type >= 6 ~ "diarrhoea"
                             )
      )

# Check
bristol |>
  count(bristol_type, stool_type)

# Export
# bristol |>
#   write_rds("data/processed/bristol_v2.rds")
```

------------------------------------------------------------------------