Skip to content

Getting data into R

ngageorange edited this page Mar 30, 2016 · 12 revisions

Getting data into R

Using in-built function

We can import Comma Separated Values (CSV) files into R very easily. These files can be generated by Microsoft Excel, Apple Numbers and Google Sheets usually through a File -> Export process.

Once a sheet has been exported, it can be imported into R:

 data <- read.csv(LOCATION_OF_FILE, header=TRUE)

This will create a variable called data in the global namespace with your sheet. Columns can then be referenced by:

 data$column_name

Using Google Sheets

We will need to teach people to install the googlesheets library for importing data from Google Sheets.

install.packages("dplyr")
install.packages("googlesheets")

Once the package has been installed, load it into your R script:

library("dplyr"
library("googlesheets")

Now list all the sheets available (local and ones shared to your account as shown https://docs.google.com/spreadsheets/) using

gs_ls()

On the first run, it will ask you to authorise the package 'googlespreadsheet' to view your account by copying a link from the console and pasting it into a browser. Once you have authorised it, you will receive a code from your browser which you can then paste into the console. A list of all your sheets will then be printed in the console.

To access the sheet:

sheet <- gs_title(YOUR_TITLE)

Where YOUR_TITLE is the name of the sheet as shown in the gs_ls command. gs_title will authorise the use of that sheet.

To get access to the worksheet:

data <- sheet %>% gs_read(ws = NAME_OF_WORKSHEET)

data variable now contains a dataframe much like importing from CSV. You can access columns by:

data$column_name

V basic cleaning on the way in

  • to lower
  • ??data types

Testing the data is what was expected

  • head(), tail()
  • stem()
  • summary()

https://ramnathv.github.io/pycon2014-r/explore/README.html

Beginners

Types of data:

  • Numeric
  • Integers
  • Strings
  • Date/Time objects

Improvers

I'm going to use the above link to focus a 20 minute tutorial on 3 common mistakes:

  1. Column headers are values, not variable names
  2. Multiple variables are stored in one column
  3. Variables are stored in both rows and columns

There are 2 other instances of untidy data, but I won't delve too much into it.

I've already written a little powerpoint presentation that I have presented to the SHOs regarding tidy data so it'll be an extension of that.

Column headers are values, not variable names

  • Example with Income and Religion table

Multiple variables are stored in one column

Variables are stored in both rows and columns

Clone this wiki locally