datahandling/09_data_preparation.Rmd at master · umatter/datahandling · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# Data Preparation

Importing a dataset properly is just the first of several milestones until an analysis-ready dataset is generated. In some cases, cleaning the raw data is a necessary step to facilitate/enable proper parsing of the data set to import it. However, most of the cleaning/preparation ('wrangling') with the data follows after properly parsing structured data. Many aspects of data wrangling are specific to certain datasets and an entire curriculum could be filled with different approaches and tools to address specific problems. Moreover, proficiency in data wrangling is generally a matter of experience in working with data, gained over many years. Here, we focus on two quite general and broadly applicable techniques that are central to cleaning and preparing a dataset for analysis: Simple string operations (find/replace parts of text strings) and reshaping rectangular data (wide to long/long to wide). The former is focused on individual variables at a time, while the latter typically happens at the level of the entire dataset.

## Cleaning data with basic string operations
Recall that most of the data we read into R for analytic purposes is a collection of raw text (structured with special characters). When parsing the data to read it into R with high-level functions such as the ones provided in the `readr`-package, both the structure and the data types are considered. The resulting `data.frame`/`tibble` might thus contain variables (different columns) of type `character`, `factor`, or `integer`, etc. At this stage, it often happens that the raw data is not clean enough for the parser to recognize the data types in each column correctly, and it resorts to just parsing it as `character`. Indeed, if we have to deal with a very messy dataset, it can make a lot of sense to constrain the parser such that it reads each column a as `character`.

We first load this package as we rely on functions provided in the `tidyverse`.

```{r message=FALSE, warning=FALSE}
library(tidyverse)
```


Let's create a sample dataset to illustrate some of the typical issues regarding unclean data that we might encounter in empirical economic research (and many similar domains of data analysis).^[The option `stringsAsFactors = FALSE` ensures that all of the columns in this data frame are of type `character`.]

```{r}
messy_df <- data.frame(last_name = c("Wayne", "Trump", "Karl Marx"),
                       first_name = c("John", "Melania", ""),
                       gender = c("male", "female", "Man"),
                       date = c("2018-11-15", "2018.11.01", "2018/11/02"),
                       income = c("150,000", "250000", "10000"),
                       stringsAsFactors = FALSE)
```

Assuming we have managed to read this dataset from a local file (with all columns as type `character`), the next step is to clean each of the columns such that the dataset is ready for analysis. Thereby we want to make sure that each variable (column) is set to a meaningful data type, once it is cleaned. The *cleaning* of the parsed data is often easier to do when the data is of type `character`. Once it is cleaned, however, we can set it to a type that is more useful for the analysis part. For example, a column containing numeric values in the final dataset should be stored as `numeric` or `integer`, so we can perform math operations on it later on (compute sums, means, etc.).

### Find/replace character strings, recode factor levels
Our dataset contains a typical categorical variable: `gender`. In R, storing such variables as type `factor` is good practice. Without really looking at the data values, we might thus be inclined to do the following:

```{r}
messy_df$gender <- as.factor(messy_df$gender)
messy_df$gender
```

The column is now of type `factor`. And we see that R defined the factor variable such that an observation can be one of three categories ('levels'): `female`, `male`, or `Man`. In terms of content, that probably does not make too much sense. If we were to analyze the data later and compute the sample's share of males, we would only count one instead of two. Hence, we better *recode* the gender variable of male subjects as `male` and not `Man`. How can this be done programmatically?

One approach is to select all entries in `messy_df$gender` that are equal to `"Man"` and replace these entries with `"male"`.
```{r}
messy_df$gender[messy_df$gender == "Man"] <- "male"
messy_df$gender
```

Note, however, that this approach is not perfect because R still considers `Man` as a valid possible category in this column. This can have consequences for certain analyses we might want to run on this dataset later on.^[If we perform the same operation on this variable *before* coercing it to a `factor`, this problem does not occur.] Alternatively, we can use a function `fct_recode()` (provided in `tidyverse`), specifically for such operations with factors.

```{r}
messy_df$gender <- fct_recode(messy_df$gender, "male" = "Man")
messy_df$gender
```

The latter can be very useful when several factor levels need to be recoded at once. Note that in both cases, the underlying logic is that we search for strings that are identical to `"Man"` and replace those values with `"male"`. Now, the gender variable is ready for analysis.


### Removing individual characters from a string

The `income` column contains numbers, so let's try to set this column to type `integer`.

```{r}
as.integer(messy_df$income)
```

R is warning us that something did not go well when executing this code. We see that the first value of the original column has been replaced with `NA` ('Not Available'/'Not Applicable'/'No Answer'). The reason is that the original value contained a comma (`,`), a special character. The function `as integer()` does not know how to translate such a symbol to a number. Hence, the original data, value cannot be translated into a number (integer). In order to resolve this issue, we have to remove the comma (`,`) from this string. Or, more precisely, we will locate this specific character *within* the string and replace it with an empty string (`""`) To do so, we'll use the function `str_replace()` (for 'string replace').

```{r}
messy_df$income <- str_replace(messy_df$income, pattern = ",", replacement = "")
```
Now we can successfully set the column as type integer.

```{r}
messy_df$income <- as.integer(messy_df$income)
```


### Splitting strings

From looking at the `last_name` and `first_name` columns of our messy dataset, it becomes clear that the last row is not accurately coded. `Karl` should show up in the `first_name` column. In order to correct this, we have to extract a part of one string and store this sub-string in another variable. There are several ways to do this. Here, it probably makes sense to split the original string into two parts, as the white space between `Karl` and `Marx` indicates the separation of first and last names. For this, we can use the function `str_split()`.

First, we split the strings at every occurrence of white space (`" "`). Setting the option `simplify=TRUE`, we get a matrix containing the individual sub-strings after the splitting.
```{r}
splitnames <- str_split(messy_df$last_name, pattern = " ", simplify = TRUE)
splitnames
```

As the first two observations did not contain any white space, there was nothing to split there, and the function simply returned empty strings `""`. In a second step, we replace empty observations in the `first_name` column with the corresponding values in `splitnames`.

```{r}
problem_cases <- messy_df$first_name == ""
messy_df$first_name[problem_cases] <- splitnames[problem_cases, 1]
```

Finally, we must correct the `last_name` column by replacing the respective values.

```{r}
messy_df$last_name[problem_cases] <- splitnames[problem_cases, 2]
messy_df
```


### Parsing dates

Finally, we take a look at the `date`-column of our dataset. For many data preparation steps as well as visualization and analysis, it is advantageous to have times and dates properly parsed as type `Date`. In practice, dates and times are often particularly messy because no unique standard has been used to define the format in the data collection phase. This also seems to be the case in our dataset. In order to work with dates, we load the `lubridate` package.

```{r message=FALSE}
library(lubridate)
```

This package provides several functions to parse and manipulate date and time data. From the' date' column, we see that the format is year, month, and day. Thus, we can use the `ymd()`-function provided in the `lubridate`-package to parse the column as `Date` type.

```{r}
messy_df$date <- ymd(messy_df$date)
```

Note how this function automatically recognizes how different special characters have been used in different observations to separate years from months/days.

Now, our dataset is cleaned up and ready to go.

```{r}
messy_df
```

```{r}
str(messy_df)
```


## Reshaping datasets

Besides cleaning and standardizing individual data columns, preparing a dataset for analysis often involves bringing the entire dataset in the right 'shape.' Typically, we mean this in a table-like (two-dimensional) format such as `data. frames` and `tibbles`, data with repeated observations for the same unit can be displayed/stored in either *long* or *wide* format. It is often seen as good practice to prepare data for analysis in *long* ('tidy') format. This way we ensure that we follow the ('tidy') paradigm of using the rows for individual observations and the columns to describe these observations.^[Depending on the dataset, however, an argument can be made that storing the data in wide format might be more efficient (using up less memory) than long format.] Tidying/reshaping a dataset in this way thus involves transforming columns into rows (i.e., *melting* the dataset). In the following, we first have a close look at what this means conceptually and then apply this technique in two examples.

### Tidying messy datasets.

Consider the following stylized example [@wickham_2014].


```{r echo=FALSE, warning=FALSE, message=FALSE, purl=FALSE}
rawdata <- read_csv("data/treatments.csv")
```


```{r echo=FALSE, purl=FALSE}
kable(rawdata)
```


The table shows observations of three individuals participating in an experiment. In this experiment, the subjects might have been exposed to treatment a and/or treatment b. Their reaction to either treatment is measured in numeric values (the results of the experiment). From looking at the raw data in its current shape, this is not really clear. While we see which numeric value corresponds to which person and treatment, it is unclear what this value is. One might, for example, wrongly assume that the numeric values refer to the treatment intensity of a and b. Such interpretation would align with the idea of columns containing variables and rows of observations. But, considering what the numeric values stand for, we realize that the columns are not *names of variables* but *values* of a variable (the categorical variable `treatment`, with levels `a` and `b`).

Now consider the same data in 'tidy' format (variables in columns and observations in rows).


```{r echo=FALSE, warning=FALSE, message=FALSE}
tidydata <- pivot_longer(data = rawdata, c(treatmenta, treatmentb), names_to = "treatment", values_to = "result" )
tidydata$treatment <- gsub("treatment", "", tidydata$treatment)
```


```{r echo=FALSE, purl=FALSE}
kable(tidydata)
```


This *long*/*tidy* shape of the dataset has several advantages. First, it is now clear what the numeric values refer to. Second, it is much easier to filter/select the observations in this format.


### Pivoting from 'wide to long'

In the `tidyverse` context, we call the transformation of columns to rows 'pivoting from wide to long'. That is, we pivot columns into keys (or names) and values. A typical situation where this has to be done in applied data analysis is when a dataset contains several observations over time for the same subjects. The following figure illustrates the basic concept. On the left, you see a wide data frame, which is not in line with the tidy data concept. Reshaping it to long format yields the new data frame on the right.


```{r widetolong, echo=FALSE, out.width = "40%", fig.align='center', fig.cap= "(ref:widetolong)", purl=FALSE}
include_graphics("img/reshape.png")
```
(ref:widetolong) Reshaping of a data frame from wide to long format.


To illustrate how *pivoting from wide to long* works in practice, consider the following example dataset (extending on the example above).


```{r}
wide_df <- data.frame(last_name = c("Wayne", "Trump", "Marx"),
                       first_name = c("John", "Melania", "Karl"),
                       gender = c("male", "female", "male"),
                       income.2018 = c("150000", "250000", "10000"),
                      income.2017 = c( "140000", "230000", "15000"),
                      stringsAsFactors = FALSE)
wide_df
```

The last two columns contain information on the same variable (`income`), but for different years. We thus want to pivot these two columns into a new `year` and `income` column, ensuring that columns correspond to variables and rows correspond to observations. For this, we call the `pivot_longer()`-function as follows:

```{r}
long_df <- pivot_longer(wide_df, c(income.2018, income.2017), names_to = "year", values_to = "income")
long_df
```

We can further clean the `year` column to only contain the respective numeric values.

```{r}
long_df$year <- str_replace(long_df$year, "income.", "")
long_df
```


### Pivoting from 'long to wide' ("spreading")

As we want to adhere to the 'tidy' paradigm of keeping our data in long format, transforming 'long to wide' is less common. However, it might be necessary if the dataset at hand is particularly messy. The following example illustrates such a situation.


```{r}
weird_df <- data.frame(last_name = c("Wayne", "Trump", "Marx",
                                     "Wayne", "Trump", "Marx",
                                     "Wayne", "Trump", "Marx"),
                       first_name = c("John", "Melania", "Karl",
                                      "John", "Melania", "Karl",
                                      "John", "Melania", "Karl"),
                       gender = c("male", "female", "male",
                                  "male", "female", "male",
                                  "male", "female", "male"),
                       value = c("150000", "250000", "10000",
                                 "2000000", "5000000", "NA",
                                 "50", "25", "NA"),
                       variable = c("income", "income", "income",
                                    "assets", "assets", "assets",
                                    "age", "age", "age"),
                       stringsAsFactors = FALSE)
weird_df
```

While the data is somehow in a long format, the rule that each column should correspond to a variable (and vice versa) is ignored. Data on income, assets, and the age of the individuals in the dataset, are all put in the same column. We can call the function `pivot_wider()` with the two parameters `names` and `value` to correct this.

```{r}
tidy_df <- pivot_wider(weird_df, names_from = "variable", values_from = "value")
tidy_df
```

## Tutorial: Hotel Bookings Time Series

This tutorial guides you step-by-step through the cleaning script (with a few adaptions) of [tidytuesday's Hotel Bookings repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11), dealing with the preparation and analysis of two datasets with *hotel demand data*. Along the way, you also get in touch with the [janitor package](https://github.com/sfirke/janitor). For details about the two datasets, see the [paper](https://www.sciencedirect.com/science/article/pii/S2352340918315191#f0010) by @nuno_etal2019, and for the original research contribution related to these datasets see the [paper](https://ieeexplore.ieee.org/document/8260781) by @nunes_etal2017.

@nuno_etal2019 summarizes the content of the datasets as follows: "One of the hotels (H1) is a resort hotel, and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is real data, all data elements pertaining to hotel or customer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields."

The aim of the tutorial is to get the data in the form needed for the following plot.

```{r echo=FALSE, purl=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(feasts)

# fix variables
url_h1 <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/H1.csv"
url_h2 <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/H2.csv"


# resort hotel
h1 <- read_csv(url_h1) %>%
  janitor::clean_names() %>%
  mutate(hotel = "Resort Hotel") %>%
  select(hotel, everything())

# city hotel
h2 <- read_csv(url_h2) %>%
  janitor::clean_names() %>%
  mutate(hotel = "City Hotel") %>%
  select(hotel, everything())

hotel_df <- bind_rows(h1, h2)

hotel_plot <- hotel_df %>%
  filter(hotel == "City Hotel") %>%
  mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
         date = parse_date(date, format = "%Y-%B-%d")) %>%
  select(date, everything()) %>%
  arrange(date) %>%
  count(date) %>%
  rename(daily_bookings = n) %>%
  tsibble::as_tsibble() %>%
  model(STL(daily_bookings ~ season(window = Inf))) %>%
  components() %>% autoplot()

hotel_plot
```

The first few rows and columns of the final dataset should combine the two source datasets and look as follows:

```{r}
head(hotel_df)
```


## Set up and import
All the tools we need for this tutorial are provided in `tidyverse` and `janitor`, and the data is directly available from the [tidytuesday GitHub repository](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11). The original data is provided in CSV format.

```{r message=FALSE, warning=FALSE}

# SET UP --------------

# load packages
library(tidyverse)
library(janitor) # install.packages("janitor") (if not yet installed)

# fix variables
url_h1 <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/H1.csv"
url_h2 <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/H2.csv"

## DATA IMPORT -----------------

h1 <- read_csv(url_h1)
h2 <- read_csv(url_h2)

```


In the next step, we clean the column names and add columns to clarify which of the two hotels the corresponding observations belong to (see dataset description above). Finally, we stack the observations (rows) together in one tibble/data.frame.

```{r}
## CLEAN DATA -------------------------

# use the janitor-package clean_names function. see ?clean_names for details
h1 <- clean_names(h1)
h2 <- clean_names(h2)

# add a column to clarify the origin of observation
h1 <- mutate(h1, hotel="Resort Hotel")
h2 <- mutate(h2, hotel="City Hotel")

# stack observations
hotel_df <- bind_rows(h1,h2)

# inspect the first observations
head(hotel_df)

```