You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anything created in R is an object. You can assign values to objects using the assignment operator <-:
x<-"hello world"#assigns the words "hello world" to the object x#this is a comment
Note that comments may be included in the code after a #. The text after # is not evaluated when the code is run; they can be written directly after the code or in a separate line.
To see the value of an object, simply type its name into the console and hit enter:
x#print the value of x to the console
## [1] "hello world"
You can also explicitly tell R to print the value of an object:
print(x) #print the value of x to the console
## [1] "hello world"
Note that because we assign characters in this case (as opposed to e.g., numeric values), we need to wrap the words in quotation marks, which must always come in pairs. Although RStudio automatically adds a pair of quotation marks (i.e., opening and closing marks) when you enter the opening marks it could be that you end up with a mismatch by accident (e.g., x <- "hello). In this case, R will show you the continuation character “+”. The same could happen if you did not execute the full command by accident. The "+" means that R is expecting more input. If this happens, either add the missing pair, or press ESCAPE to abort the expression and try again.
To change the value of an object, you can simply overwrite the previous value. For example, you could also assign a numeric value to "x" to perform some basic operations:
x<-2#assigns the value of 2 to the object x
print(x)
## [1] 2
x==2#checks whether the value of x is equal to 2
## [1] TRUE
x!=3#checks whether the value of x is NOT equal to 3
## [1] TRUE
x<3#checks whether the value of x is less than 3
## [1] TRUE
x>3#checks whether the value of x is greater than 3
## [1] FALSE
Note that the name of the object is completely arbitrary. We could also define a second object "y", assign it a different value and use it to perform some basic mathematical operations:
y<-5#assigns the value of 2 to the object xx==y#checks whether the value of x to the value of y
## [1] FALSE
x*y#multiplication of x and y
## [1] 10
x+y#adds the values of x and y together
## [1] 7
y^2+3*x#adds the value of y squared and 3x the value of x together
## [1] 31
Object names
Please note that object names must start with a letter and can only contain letters, numbers, as well as the ., and _ separators. It is important to give your objects descriptive names and to be as consistent as possible with the naming structure. In this tutorial we will be using lower case words separated by underscores (e.g., object_name). There are other naming conventions, such as using a . as a separator (e.g., object.name), or using upper case letters (objectName). It doesn't really matter which one you choose, as long as you are consistent.
Data types
The most important types of data are:
Data type
Description
Numeric
Approximations of the real numbers, $\normalsize\mathbb{R}$ (e.g., mileage a car gets: 23.6, 20.9, etc.)
Integer
Whole numbers, $\normalsize\mathbb{Z}$ (e.g., number of sales: 7, 0, 120, 63, etc.)
Character
Text data (strings, e.g., product names)
Factor
Categorical data for classification (e.g., product groups)
Logical
TRUE, FALSE
Date
Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.)
Variables can be converted from one type to another using the appropriate functions (e.g., as.numeric(),as.integer(),as.character(), as.factor(),as.logical(), as.Date()). For example, we could convert the object y to character as follows:
y<- as.character(y)
print(y)
## [1] "5"
Notice how the value is in quotation marks since it is now of type character.
Entering a vector of data into R can be done with the c(x1,x2,..,x_n) ("concatenate") command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator <-. You can choose names arbitrarily (but the first character of a name cannot be a number). Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first. Instead of converting a variable we can also create a new one and use an existing one as input. In this case we omit the as. and simply use the name of the type (e.g. factor()). There is a subtle difference between the two: When converting a variable, with e.g. as.factor(), we can only pass the variable we want to convert without additional arguments and R determines the factor levels by the existing unique values in the variable or just returns the variable itself if it is a factor already. When we specifically create a variable (just factor(), matrix(), etc.), we can and should set the options of this type explicitly. For a factor variable these could be the labels and levels, for a matrix the number of rows and columns and so on.
#Numeric:top10_track_streams<- c(163608, 126687, 120480, 110022, 108630, 95639, 94690, 89011, 87869, 85599)
#Character:top10_artist_names<- c("Axwell /\\ Ingrosso", "Imagine Dragons", "J. Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", "French Montana", "Calvin Harris", "Liam Payne", "Lauv") # Characters have to be put in ""#Factor variable with two categories:top10_track_explicit<- c(0,0,0,0,0,0,1,1,0,0)
top10_track_explicit<-factor(top10_track_explicit,
levels=0:1,
labels= c("not explicit", "explicit"))
#Factor variable with more than two categories:top10_artist_genre<- c("Dance","Alternative","Latino","Dance","Dance","Dance","Hip-Hop/Rap","Dance","Pop","Pop")
top10_artist_genre<- as.factor(top10_artist_genre)
#Date:top_10_track_release_date<- as.Date(c("2017-05-24", "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", "2017-05-19"))
#Logicaltop10_track_explicit_1<- c(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE)
In order to "call" a vector we can now simply enter its name:
In order to check the type of a variable the class() function is used.
class(top_10_track_release_date)
## [1] "Date"
Data structures
Now let's create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames.
Data frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: "F", "M" instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where "data frames" come into play. Data frames can have different types of data in each column. For example, we can combine the vectors created above in one data frame using data.frame(). This creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).
Hint: You may also use the View()-function to view the data in a table format (like in SPSS or Excel), i.e. enter the command View(data). Note that you can achieve the same by clicking on the small table icon next to the data frame in the "Environment"-window on the right in RStudio.
Sometimes it is convenient to return only specific values instead of the entire data frame. There are a variety of ways to identify the elements of a data frame. One easy way is to explicitly state, which rows and columns you wish to view. The general form of the command is data.frame[rows,columns]. By leaving one of the arguments of data.frame[rows,columns] blank (e.g., data.frame[rows,]) we tell R that we want to access either all rows or columns, respectively. Here are some examples:
#creates a new data frame that only contains tracks from genre "Dance" music_data_dance<- subset(music_data,top10_artist_genre=="Dance")
music_data_dance
str() displays the internal structure of an R object. In the case of a data frame, it returns the class (e.g., numeric, factor, etc.) of each variable, as well as the number of observations and the number of variables.
str(music_data) # returns the structure of the data frame
To call a certain column in a data frame, we may also use the $ notation. For example, this returns all values associated with the variable "top10_track_streams":
Assume that you wanted to add an additional variable to the data frame. You may use the $ notation to achieve this:
# Create new variable as the log of the number of streams music_data$log_streams<- log(music_data$top10_track_streams)
# Create an ascending count variable which might serve as an IDmusic_data$obs_number<-1:nrow(music_data)
head(music_data)
You can also rename variables in a data frame, e.g., using the rename()-function from the plyr package. In the following code "::" signifies that the function "rename" should be taken from the package "plyr". This can be useful if multiple packages have a function with the same name. Calling a function this way also means that you can access a function without loading the entire package via library().
::: {.infobox_orange .hint data-latex="{hint}"}
Note that the data handling approach explained in this chapter uses the so-called 'base R' dialect. There are other dialects in R, which are basically different ways of achieving the same thing. Two other popular dialects in R are 'data.table' and the 'tidyverse' see e.g., here and here. Once you become more advanced, you may want to look into the other dialects to achieve certain tasks more efficiently. For now, it is sufficient to be aware that there are other approaches to data handling and each dialect has it's strengths and weaknesses. We will be mostly using 'base R' for the tutorial on this website.
:::