Skip to content

Commit 9505df9

Browse files
Melissa LeeMelissa Lee
authored andcommitted
addressing Trevor's comments in issue #24
1 parent 8f554a6 commit 9505df9

File tree

1 file changed

+20
-8
lines changed

1 file changed

+20
-8
lines changed

01-reading.Rmd

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -232,10 +232,10 @@ As with plain text files, you should always explore the data file before importi
232232
## Reading data from a database
233233

234234
Another very common form of data storage to be read into R for data analysis is the relational database. There are many relational database management systems, such as
235-
SQLite, MySQL, PosgreSQL, Oracle, and many more. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
235+
[SQLite](https://www.sqlite.org/index.html), [MySQL](https://www.mysql.com/), [PostgreSQL](https://www.postgresql.org/), [Oracle](https://www.oracle.com/ca-en/index.html), and many more. These different relational database management systems each have their own advantages and limitations. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
236236
to analyze data from a database;
237237
several packages have been written
238-
that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. In this book, we will
238+
that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. These different relational database management systems have their own advantages, limitations, and excels in particular scenarios. In this book, we will
239239
give examples of how to do this using R with SQLite and PostgreSQL databases.
240240

241241
### Reading data from a SQLite database
@@ -256,7 +256,7 @@ tables <- dbListTables(con_lang_data)
256256
tables
257257
```
258258

259-
We only get one table name returned form calling `dbListTables`, and this tells us that there is only one table in this database. To reference a table in the database so we can do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package:
259+
We only get one table name returned from calling `dbListTables`, which tells us that there is only one table in this database. To reference a table in the database to do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package. The package `dbplyr` allows us to work with data stored in databases as if they were local data frames, which is useful because we can do a lot with big datasets without actually having to bring these vast amounts of data into your computer!
260260

261261
```{r}
262262
library(dbplyr)
@@ -267,7 +267,7 @@ lang_db
267267
Although it looks like we just got a data frame from the database, we didn't! It's a *reference*, showing us data that is still in the SQLite database (note the first two lines of the output).
268268
It does this because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically, the database will not even be
269269
stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell
270-
it to do so using the `collect` function from the `dbplyr` library.
270+
it to do so using the `collect` function from the `dbplyr` library.
271271

272272
Here we will filter for only rows in the Aboriginal languages category according to the 2016 Canada Census, and then use `collect` to finally bring this data into R as a data frame.
273273

@@ -298,10 +298,8 @@ tail(aboriginal_lang_db)
298298
```
299299
## Error: tail() is not supported by sql sources
300300
```
301-
302-
Additionally, some operations will
303-
not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database
304-
reference object, it is advisable to then bring it into your local machine's memory using `collect` as a data frame.
301+
Additionally, some operations will not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database reference object, it is advisable to bring it into your local machine's memory using `collect` as a data frame. Warning: Usually, databases are very big! Reading the object into your local machine may give an error or take a lot of time to run so be careful if you plan to do this!
302+
305303

306304
### Reading data from a PostgreSQL database
307305

@@ -412,6 +410,20 @@ min(avg_rating_data)
412410

413411
We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie...
414412

413+
414+
**Why should we bother with databases at all?**
415+
416+
Opening a database stored in a .db file involved a lot more effort than just opening a .csv, .tsv, or any of the other plaintext or Excel formats. It was a bit of a pain to use a database in that setting since we had to use `dbplyr` to translate `tidyverse`-like commands (`filter`, `select`, `head`, etc.) into SQL commands that the database understands. Not all `tidyverse` commands can currently be translated with SQLite databases. For example, we can compute a mean with an SQLite database but can't easily compute a median. So you might be wondering why should we use databases at all?
417+
418+
Databases are beneficial in a large-scale setting:
419+
420+
- they enable storing large data sets across multiple computers with automatic redundancy and backups
421+
- they allow multiple users to access them simultaneously and remotely without conflicts and errors
422+
- they provide mechanisms for ensuring data integrity and validating input
423+
- they provide security to keep data safe
424+
For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/). Can you imagine if Google stored all of the data from those queries in a single .csv file!? Chaos would ensue!
425+
426+
415427
## Writing data from R to a `.csv` file
416428

417429
At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file

0 commit comments

Comments
 (0)