Skip to content

r for newbies

Steve Harris (drstevok-test1) edited this page Jan 4, 2016 · 5 revisions

R for newbies

How to get started in R. Best to have an objective since then you'll know that you've achieved something. Let's plot a graph. You'll be done in 20 minutes[^1].

Here's what we're going to cover

Install R and RStudio
Find your way around RStudio
Three keys to understanding R
Write your R script


Download and install R and RStudio

  1. Download and install R from here

  2. Download and install RStudio. This is a nice shiny interface for R, and the easiest way to use it. Download it here. There should be an 'installer' for your operating system.

Start RStudio and have a look around

  • TODO(2016-01-03): insert screen shot of vanilla install of RStudio here

The screen should be divided in quadrants or panes. The two most important are labelled Source, and Console.

Console

The console is R! Type anything here, and it will be interpreted by R.

Try typing 2+2

 > 2+2
[1] 4

There are 4 things to explain in the little code snippet above.

  1. The command prompt > (or greater than sign to you and me) is simply R prompting you to enter some text
  2. The expression 2+2 is the sum that we asked R to perform.
  3. We'll come back to the [1] on the next line in a second.
  4. R prints the answer 4

The number in square brackets is actually R 'numbering' your answer for you. There's only one answer so it is 'numbered' 1.[^3]

Re-assuring as it is that R knows that 2+2=4, you were probably hoping for a little more. Typing directly into R is great, we want to teach you reproducible research. The scientific method requires that we document our work, but we can't reproduce your typing unless we record it somewhere.

Source

The solution is to create a file, write your commmands in that file, and then tell R to work through the commands in that file. Switch to the pane labelled source, and again type 2-2. When you get to the end of the line, hit (on Windows ). This sends the last line you wrote from the 'source' document, to the console. You should now see that R can add and substract.

> 2+2
[1] 4
> 2-2
[1] 0

Now save the file you have written as labbook_YYMMDD.R (replace YYMMDD with today's date e.g. labbook_160103.R). You must use the .R extension to indicate that this is an R script, but you can, of course, choose any name you wish.[^2]

Three keys to understanding R

  • comments
  • functions
  • data frames

Comments

We're good to go! Well almost. There is one other crucial skill you need, and that is 'commenting' or annotating your scripts. Writing 2+2 is fine for R, but you need to remind yourself and others of why you are doing things.

If you saw these two lines of code, it would be surprising if you knew what R was doing.

7/5
7 %% 5

It is obviously much better to write this:

# This is division
7/5    
# This is the 'modulus' function (divides and returns the remainder)
7 %% 5

The # sign tells R to ignore that line (it's for a future 'you' who might read this, and wonder what the 'past' you was up to!). You can even put a # after some code to add a specific comment for that line. For example,

# This is division
7/5     # Answer should be 1.4
# This is the 'modulus' function (divides and returns the remainder)
7 %% 5 # Anwer should be 2

Functions

Functions are the workhorses of R. You use R because it can 'do' things such a plot graphs, average numbers, print tables, or build statistical models. These are all functions. You want to know the square root of 9?

> sqrt(9)
[1] 3

That was (surprise, surprise) the square root function. It has a name (sqrt), and we pass it an 'argument' (in this case 9). An argument is something that the function needs to do its job. The function then 'returns' a result. Rewriting the above using this vocabulary:

> function_name(argument)
[1] return value

Functions such as mean() will return the mean of a set of numbers. Don't try this yet because you'll be annoyed that simply typing mean(1,2,3,4,5,6) doesn't work.

R has a bunch of functions that are its core (known as base R) that allow you to do maths, stats, load, manipulate, and save data, and make many graphs. In addition, there is a wide community of academics, and enthusiasts who contribute functions. These are typically wrapped together in something called a package. When you want to use these additional functions, you just tell R to load the package into its memory by typing library(package_name_here). For example, ggplot2 is a package of great plotting and graphing functions, so you would type library(ggplot2) to make these functions available.

  • library and packages a bundles of pre-written functions

Data frames

I am guessing that you would be pretty comfortable with a table of data that looked like this.

intials height weight
dj 190 88
hs 183 80
gj 182 110
sm 175 95

A data frame is just the R version of a that table. It has 3 columns that are labelled 'initials', 'height', and 'weight'.

Now, please don't be confused when I tell you that it has 4 not 5 rows. We don't count the first row because these are the labels. We can specify any cell in the table by giving it's row and column number in that order. We always write these references in square brackets so [1,2] refers to the first row and the second column of our table: that is the height of person 'dj' which we can see is 190.

It is preferable to think of any table of data as just 'columns' of data that are aligned and bound together. We have three columns here.

  1. Column 1 contains the initials of the people measured: dj,hs,gj,sm
  2. Column 2 contains their heights (in cm): 190,183,182,175
  3. Column 3 contains their weights (in kg): 88,80,110,95

The formal term for each column is a vector. To create a column (or vector) of data we use the c() function.

height <- c(190,183,182,175)

This simply tells R that these 4 numbers are the same 'type of thing', and that the order we have provided is important too. The funny backwards arrow <- tells R to give these 4 numbers a name. If we type that name, we will see that R has 'remembered' these numbers.

> height <- c(190,183,182,175)
> height
[1] 190 183 182 175

If you think of this vector as a column, then you shouldn't be surprised to learn that typing height[2] will return the second 'row' of the column.

> height[2]
[1] 183

Now if we were to create 'columns' of initials, and weights we could the bind the columns together into our table.

weight <- c(88,80,110,95)
initials <- c("dj", "hs", "gj", "sm")

Note R needs a way of distinguishing names of things (e.g. height) from bits of data that could be names so we must put quotes around the the items in the initials column. If you forget and typed c(dj, hs, gj, sm) then R would go off looking for things named dj, hs, etc. which is not what we want.

To bind these data together we use the data.frame function, and assign a name ddata[^4] so that we can access these data again.

> ddata <- data.frame(initials, height, weight)

Typing ddata will print our table

> ddata
  initials height weight
1       dj    190     88
2       hs    183     80
3       gj    182    110
4       sm    175     95

And typing ddata[1,2] will print the contents of the cell in row 1 and column 2.

> ddata[1,2]
[1] 190

Putting this all together

Do you remember my saying not to be surprised that typing mean(1,2,3,4,5,6) didn't work. Now I can explain why. The mean() function takes a vector of numbers as an argument. So you can either make the vector (aka column) on the fly using the c() function inside the mean() function:

> mean(c(190,183,182,175))
[1] 182.5

Or since we have already named this vector 'height', then it is easier (and clearer) to write:

> mean(height)
[1] 182.5

If you do try without first bunching your numbers together into a single argument ...

> mean(190,183,182,175)
[1] 190

... then R takes just the first number in the list provided (remember it is only expecting one argument), and gives you the mean of that. The remaining three numbers are ignored.

Your script

Here is the script. To save typing, you can just copy and paste this into your own file. I have used lots of comments to explain what each line does, but there is also a more detailed explanation of each part of the script below.

Get some data

Write your R script

ddata <- curl_fetch_memory("http:/url_to_your_data_here") The data we need is stored on the web at http:/url_to_your_data_here. The function 'curl_fetch_memory' will load data from a website. We tell curl_fetch_memory which website we want by 'passing' it the information as an 'argument'.


[^1]: This assumes you don't run into problems downloading and installing R and R studio which shouldn't be a problem there are computers that like to say 'no'. [^2]: I recommend this naming scheme because it's not a bad idea to start each day's work in a clean 'labbook' so that you record your progress. Good stuff can then be extracted from the labbook, named more specifically, and saved with other related files in a specific project folder. [^3]: I could have just said that R always works with vectors, and that the numbering simply refers to the position in the vector. This probably wouldn't have confused someone who reads the footnotes, but it might have confused others. If you want try typing 1:100 at in the console at the command prompt. This is shorthand for 1,2,3 ... 98,99,100 (i.e. all the numbers from 1 to 100). R prints this, and to help you read the output prints where it is up to (as a number in square brackets) everytime it starts a new line. [^4]: ddata is not a typo. Naming things is famously hard, and good names like 'data' are often used by the R programming language itself. I have the habit therefore of doubling the first letter of things that I create to help me remind me when something is 'mine'. For example, R has a function table(), but I might name my table ttable. Easy!

Clone this wiki locally