You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 01-reading.Rmd
+20-8Lines changed: 20 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -232,10 +232,10 @@ As with plain text files, you should always explore the data file before importi
232
232
## Reading data from a database
233
233
234
234
Another very common form of data storage to be read into R for data analysis is the relational database. There are many relational database management systems, such as
235
-
SQLite, MySQL, PosgreSQL, Oracle, and many more. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
235
+
[SQLite](https://www.sqlite.org/index.html), [MySQL](https://www.mysql.com/), [PostgreSQL](https://www.postgresql.org/), [Oracle](https://www.oracle.com/ca-en/index.html), and many more. These different relational database management systems each have their own advantages and limitations. Almost all employ SQL (*structured query language*) to pull data from the database. Thankfully, you don't need to know SQL
236
236
to analyze data from a database;
237
237
several packages have been written
238
-
that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. In this book, we will
238
+
that allows R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. These different relational database management systems have their own advantages, limitations, and excels in particular scenarios. In this book, we will
239
239
give examples of how to do this using R with SQLite and PostgreSQL databases.
We only get one table name returned form calling `dbListTables`, and this tells us that there is only one table in this database. To reference a table in the database so we can do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package:
259
+
We only get one table name returned from calling `dbListTables`, which tells us that there is only one table in this database. To reference a table in the database to do things like select columns and filter rows, we use the `tbl` function from the `dbplyr` package. The package `dbplyr` allows us to work with data stored in databases as if they were local data frames, which is useful because we can do a lot with big datasets without actually having to bring these vast amounts of data into your computer!
260
260
261
261
```{r}
262
262
library(dbplyr)
@@ -267,7 +267,7 @@ lang_db
267
267
Although it looks like we just got a data frame from the database, we didn't! It's a *reference*, showing us data that is still in the SQLite database (note the first two lines of the output).
268
268
It does this because databases are often more efficient at selecting, filtering and joining large data sets than R. And typically, the database will not even be
269
269
stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell
270
-
it to do so using the `collect` function from the `dbplyr` library.
270
+
it to do so using the `collect` function from the `dbplyr` library.
271
271
272
272
Here we will filter for only rows in the Aboriginal languages category according to the 2016 Canada Census, and then use `collect` to finally bring this data into R as a data frame.
273
273
@@ -298,10 +298,8 @@ tail(aboriginal_lang_db)
298
298
```
299
299
## Error: tail() is not supported by sql sources
300
300
```
301
-
302
-
Additionally, some operations will
303
-
not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database
304
-
reference object, it is advisable to then bring it into your local machine's memory using `collect` as a data frame.
301
+
Additionally, some operations will not work to extract columns or single values from the reference given by the `tbl` function. Thus, once you have finished your data wrangling of the `tbl` database reference object, it is advisable to bring it into your local machine's memory using `collect` as a data frame. Warning: Usually, databases are very big! Reading the object into your local machine may give an error or take a lot of time to run so be careful if you plan to do this!
302
+
305
303
306
304
### Reading data from a PostgreSQL database
307
305
@@ -412,6 +410,20 @@ min(avg_rating_data)
412
410
413
411
We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie...
414
412
413
+
414
+
**Why should we bother with databases at all?**
415
+
416
+
Opening a database stored in a .db file involved a lot more effort than just opening a .csv, .tsv, or any of the other plaintext or Excel formats. It was a bit of a pain to use a database in that setting since we had to use `dbplyr` to translate `tidyverse`-like commands (`filter`, `select`, `head`, etc.) into SQL commands that the database understands. Not all `tidyverse` commands can currently be translated with SQLite databases. For example, we can compute a mean with an SQLite database but can't easily compute a median. So you might be wondering why should we use databases at all?
417
+
418
+
Databases are beneficial in a large-scale setting:
419
+
420
+
- they enable storing large data sets across multiple computers with automatic redundancy and backups
421
+
- they allow multiple users to access them simultaneously and remotely without conflicts and errors
422
+
- they provide mechanisms for ensuring data integrity and validating input
423
+
- they provide security to keep data safe
424
+
For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/). Can you imagine if Google stored all of the data from those queries in a single .csv file!? Chaos would ensue!
425
+
426
+
415
427
## Writing data from R to a `.csv` file
416
428
417
429
At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file
0 commit comments