-
Notifications
You must be signed in to change notification settings - Fork 11
Expand file tree
/
Copy path05-Additional-content---Reading-Data-In.Rmd
More file actions
373 lines (298 loc) · 12.2 KB
/
05-Additional-content---Reading-Data-In.Rmd
File metadata and controls
373 lines (298 loc) · 12.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: Reading Data In
teaching: 35
exercises: 10
questions:
- "How can I get data into R?"
objectives:
- "Read data in from plain text and Excel files"
- "Control import parameters to cope with unusual data formats"
keypoints:
- "Use `read_csv()` or `read_tsv()` to read in plain text data"
- "Use `read_excel()` from the `readxl` package to read in Excel files"
source: Rmd
---
```{r setup, include=FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("05-")
```
So far, we have been using datasets that are already available within R. Chances are though that the
data you want to work with exists separately, perhaps as a spreadsheet or CSV file. In order to
start working with this data, we need to learn how to read it in to R.
## Common file types
There are many different ways that data could be stored, but for now we will focus on
reading in tabular data. This is data that is already in a
[data frame]({{ page.root }}{% link _episodes/07-Dataframes.md %})-like structure and so we will
want to read it in to R as a data frame.
Two of the most common ways you might find data like this are:
#### Plain text files
These are text files where the columns of data are separated by some delimiting character. Examples
of these might include
* CSV (**c**omma **s**eparated **v**alue) files that use a comma to delimit the columns
~~~
carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
~~~
* TSV (**t**ab **s**eparated **v**alue) files, which use a tab instead
~~~
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
~~~
#### Spreadsheet files
All spreadsheet programs (Excel, Calc, Numbers, etc.) will have a way to export data into one
of the plain text formats above. This will usually be the best way to get data into R. For excel
files (.xls or .xlsx) however, it is possible to read them in directly which we will demonstrate later
## Gapminder data {#gapminder}
For this section, we will be reading in data from the [Gapminder](https://www.gapminder.org/) organisation,
which records various statistics for 142 countries betwen 1952 and 2007. This
data is available as an [R package](https://cran.r-project.org/web/packages/gapminder/index.html),
but we have prepared [csv]({{ page.root }}{% link data/gapminder.csv %}),
[tsv]({{ page.root }}{% link data/gapminder.tsv %}), and
[excel]({{ page.root }}{% link data/gapminder.xlsx %}) versions for you to practice with.
> ## Challenge 1
>
> Download the three versions of the Gapminder above and save them to your project folder.
>
> Open one of the files and describe what statistics are recorded
> > ## Solution to Challenge 1
> > Using the ideas discussed previously about [project structure]({{ page.root }}{% link _episodes/01-R-and-RStudio.md %}),
> > we will save the files into a `data` directory within our project. We can then access them with
> > a relative path `data/gapminder.csv` (for the csv example)
> >
> > Opening a file we can see that there are six columns of data: a country name and continent, the
> > year that the data was recorded, and the life expectancy, population and GDP per capita.
> {: .solution}
{: .challenge}
To load this data into R, we will use the `read_csv` function from the [`readr`](http://readr.tidyverse.org) package (which will
be loaded automatically if you have preciously run `library(tidyverse)`, but here we will
load it separately).
```{r read-01}
library(readr)
gapminder_csv <- read_csv("data/gapminder.csv")
gapminder_csv
```
Here, we can see that `read_csv` provides us with some information on what it thinks the data types
are as it reads the file in (two columns of character data and four columns of double in this case).
We can also see that it has imported the data in the
[tibble format]({{ page.root }}{% link _episodes/07-Dataframes.md %}) discussed previously.
Similarly, for the tsv file there is a `read_tsv` function
> ## Challenge 2
> Read in the `gapminder.tsv` file using `read_tsv`
>
> Confirm that reading from csv or tsv files produce the same output (you might find the `all.equal`
> function useful)
>
> > ## Solution to Challenge 2
> >
> > ```{r ch2-sol, message = FALSE}
> > gapminder_tsv <- read_tsv("data/gapminder.tsv")
> >
> > gapminder_tsv
> >
> > all.equal(gapminder_csv, gapminder_tsv)
> > ```
> {: .solution}
{: .challenge}
## Control of import parameters
Both `read_csv` and `read_tsv` have a number of parameters that may be needed to help import a file.
You can read more about these in the help file for the functions, but some more commonly encountered
situations are:
#### Ignoring comments or metadata from the top of a file
Sometimes files will have metadata information included along with the data. Use `skip` to ignore a
set number of lines from the top of the file or `comment` if metadata lines are indicated by a
specific character such as `#`.
Eg. a file such as:
~~~~
#Metadata information detailing data collection process
#Not part of the data, but included to explain it
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 9240934 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
~~~~
Can be read in using:
~~~~
read_tsv("data/commented_file.tsv", comment = "#")
read_tsv("data/commented_file.tsv", skip = 2)
~~~~
{: .language-r}
Both lines above produce the same output
#### Providing column names
By default, `readr` assumes that the first row of data is a "header" - that is defines the names of the columnes for the data frame. If this is not
the case you can either specify the names manually with a character vector, or have the column names generated
automatically
Eg.
~~~~
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 9240934 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
~~~~
Can be read in using:
~~~~
read_tsv("data/no_names.tsv", col_names = F)
~~~~
{: .language-r}
~~~~
# A tibble: 3 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
~~~~
{: .output}
or
~~~~
read_tsv("data/no_names.tsv", col_names = c("country", "continent","year","lifeExp","pop","gdpPercap"))
~~~~
{: .language-r}
~~~~
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
~~~~
{: .output}
#### Dealing with missing data
By default, blank columns or ones containing "NA" are considered missing. If your data uses a different
value for missing data you will need to specify it.
Eg.
~~~~
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 - 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
~~~~
~~~~
read_tsv("data/missing.tsv")
~~~~
{: .language-r}
~~~~
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 - 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
~~~~
{: .output}
Using the default parameters, `pop` is incorrectly read in as a character column because
missing data is indicated by `-`
~~~~
read_tsv("data/missing.tsv", na = "-")
~~~~
{: .language-r}
~~~~
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 NA 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
~~~~
Now the `pop` data is correctly recognised as a number, and the missing data is correctly recorded
as `NA`.
#### Setting explicit column types
Sometimes, the default column types might not be what you intend. In that case you can set them
explicitly by either providing a string representing the column types
> From the `read_tsv` help file:
> c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date,
> T = date time, t = time, ? = guess, or _/- to skip the column
{: .quotation}
Eg. setting `country` to be a factor, `pop` as an integer, and not reading in the `lifeExp` column.
The remaining columns are left as 'guess', in which case `readr` tries to determine their type automatically.
```{r setting-col-types-1}
read_tsv("data/gapminder.tsv", col_types = "f??-i?")
```
> ## Note:
> If you choose to use a type string like this, you must provide a type for each colum in the data
> or R will throw a warning
{: .callout}
Or by providing a column specification, like how `read_csv` did to explain it's default parsing
when we [first ran it](#gapminder)
Importing the same types as above in this format would look like this:
```{r setting-col-types-2}
# Default is col_guess()
# So we just need to specify the
# columns that are different
gapminder_spec <- cols(
country = col_factor(),
pop = col_integer(),
gdpPercap = col_skip()
)
read_tsv("data/gapminder.tsv", col_types = gapminder_spec)
```
Knowing how these few parameters work will enable you to read most csv/tsv files you come across into R.
> ## Challenge 4
>
> Read in the gapminder tsv again, but make the `continent` column a factor and the `year` column a
> date. (Hint: Use the `cols()` specification and "%Y" is the date format you will need)
>
> Check whether the newly imported data is still the same as the csv data imported previously
>
> > ## Solution to Challenge 4
> >
> > ```r
> > spec <- cols(
> > continent = col_factor(),
> > year = col_date(format = "%Y")
> > )
> >
> > gapminder_tsv <- read_tsv("data/gapminder.tsv", col_types = spec)
> >
> > all.equal(gapminder_csv, gapminder_tsv)
> > ```
> > ~~~~
> > [1] "Incompatible type for column `country`: x character, y factor" "Incompatible type for column `year`: x numeric, y Date"
> > ~~~~
> > {: .output}
> {: .solution}
{: .challenge}
## Reading excel files
For reading in excel files, we will use the [`readxl`](https://readxl.tidyverse.org) package. This
package is part of the tidyverse, but is not loaded with `library(tidyverse)`, so we will have to do
it ourselves.
```{r load-readxl}
library(readxl)
```
We can now use the `read_excel` function from `readxl` to read in our data.
```{r read-excel}
gapminder_excel <- read_excel("data/gapminder.xlsx")
```
Since `read_excel` uses most of the same options as `read_csv`/`read_tsv` covered above, you already
know how to use it. Some excel specific options you may need to use include
#### Specifying the worksheet to extract
The gapminder data is all in the first sheet in the file (called 'gapminder'). This means that
~~~~
read_excel("data/gapminder.xlsx")
read_excel("data/gapminder.xlsx", sheet = 1)
read_excel("data/gapminder.xlsx", sheet = "gapminder")
~~~~
{: .language-r}
will all produce the same result.
#### Extracting data from a specific range
Perhaps you are only interested in data from a particular region of the worksheet. This can be
imported using an excel range pattern.
```{r excel-import-range}
read_excel("data/gapminder.xlsx", range = "A1:E4")
```
> ## `read_csv` and `read.csv`
>
> You may have noticed that there is also a `read.csv` function. This default behaviour of this
> function is that all strings are read in as factors, which can be a common source of mistakes for
> newer R users. For this reason, we will stick with using the tidyverse equivalents where possible.
>
> Regardless of the method chosen. After importing your data, it's always a good choice to check
> it afterwards using the [data frame]({{ page.root }}{% link _episodes/07-Dataframes.md %})
> exploration methods
{: .callout}