addressing Trevor's comments in issue #24

Melissa Lee · Melissa Lee · commit 9505df95a36e · 2020-12-01T18:51:54.000-05:00
diff --git a/01-reading.Rmd b/01-reading.Rmd
@@ -232,10 +232,10 @@ As with plain text files, you should always explore the data file before importi
 ## Reading data from a database
 
 Another very common form of data storage to be read into R for data analysis is the relational database. There are many relational database management systems, such as
-SQLite, MySQL, PosgreSQL, Oracle, and many more. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
+[SQLite](https://www.sqlite.org/index.html), [MySQL](https://www.mysql.com/), [PostgreSQL](https://www.postgresql.org/), [Oracle](https://www.oracle.com/ca-en/index.html), and many more. These different relational database management systems each have their own advantages and limitations. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
 to analyze data from a database; 
 several packages have been written 
-that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. In this book, we will 
+that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. These different relational database management systems have their own advantages, limitations, and excels in particular scenarios. In this book, we will 
 give examples of how to do this using R with SQLite and PostgreSQL databases.
 
 ### Reading data from a SQLite database
@@ -256,7 +256,7 @@ tables <- dbListTables(con_lang_data)
 tables
 ```
 
-We only get one table name returned form calling `dbListTables`, and this tells us that there is only one table in this database. To reference a table in the database so we can do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package:
+We only get one table name returned from calling `dbListTables`, which tells us that there is only one table in this database. To reference a table in the database to do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package. The package `dbplyr` allows us to work with data stored in databases as if they were local data frames, which is useful because we can do a lot with big datasets without actually having to bring these vast amounts of data into your computer! 
 
 ```{r}
 library(dbplyr)
@@ -267,7 +267,7 @@ lang_db
 Although it looks like we just got a data frame from the database, we didn't! It's a *reference*, showing us data that is still in the SQLite database (note the first two lines of the output). 
 It does this because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically, the database will not even be 
 stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell 
-it to do so using the `collect` function from the `dbplyr` library.
+it to do so using the `collect` function from the `dbplyr` library. 
 
 Here we will filter for only rows in the Aboriginal languages category according to the 2016 Canada Census, and then use `collect` to finally bring this data into R as a data frame. 
 
@@ -298,10 +298,8 @@ tail(aboriginal_lang_db)
 ```
 ## Error: tail() is not supported by sql sources
 ```
-
-Additionally, some operations will 
-not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database 
-reference object, it is advisable to then bring it into your local machine's memory using `collect` as a data frame.
+Additionally, some operations will not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database reference object, it is advisable to bring it into your local machine's memory using `collect` as a data frame. Warning: Usually, databases are very big! Reading the object into your local machine may give an error or take a lot of time to run so be careful if you plan to do this! 
+ 
 
 ### Reading data from a PostgreSQL database 
 
@@ -412,6 +410,20 @@ min(avg_rating_data)
 
 We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie...
 
+
+**Why should we bother with databases at all?**
+
+Opening a database stored in a .db file involved a lot more effort than just opening a .csv, .tsv, or any of the other plaintext or Excel formats. It was a bit of a pain to use a database in that setting since we had to use `dbplyr` to translate `tidyverse`-like commands (`filter`, `select`, `head`, etc.) into SQL commands that the database understands. Not all `tidyverse` commands can currently be translated with SQLite databases. For example, we can compute a mean with an SQLite database but can't easily compute a median. So you might be wondering why should we use databases at all? 
+
+Databases are beneficial in a large-scale setting:
+
+- they enable storing large data sets across multiple computers with automatic redundancy and backups
+- they allow multiple users to access them simultaneously and remotely without conflicts and errors
+- they provide mechanisms for ensuring data integrity and validating input
+- they provide security to keep data safe
+For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/). Can you imagine if Google stored all of the data from those queries in a single .csv file!? Chaos would ensue! 
+
+
 ## Writing data from R to a `.csv` file
 
 At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file