@@ -481,11 +481,9 @@ relational databases and use the R programming language
481
481
to obtain data. In this book, we will give examples of how to do this
482
482
using R with SQLite and PostgreSQL databases.
483
483
484
- ### Connecting to a database
484
+ ### Reading data from a SQLite database
485
485
486
- #### Reading data from a SQLite database
487
-
488
- SQLite \index{database!SQLite} is probably the simplest relational database
486
+ SQLite \index{database!SQLite} is probably the simplest relational database system
489
487
that one can use in combination with R. SQLite databases are self-contained and
490
488
usually stored and accessed locally on one computer. Data is usually stored in
491
489
a file with a ` .db ` extension. Similar to Excel files, these are not plain text
@@ -495,50 +493,50 @@ The first thing you need to do to read data into R from a database is to
495
493
connect to the database. We do that using the ` dbConnect ` function from the
496
494
` DBI ` (database interface) package. \index{database!connect} This does not read
497
495
in the data, but simply tells R where the database is and opens up a
498
- communication channel.
496
+ communication channel that R can use to send SQL commands to the database .
499
497
500
498
``` {r}
501
499
library(DBI)
502
500
503
- con_lang_data <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
501
+ conn_lang_data <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
504
502
```
505
503
506
- Often relational databases have many tables; thus, anytime
507
- you want to access data from a
508
- relational database, you need to know the table names . You can get the names of
504
+ Often relational databases have many tables; thus, in order to retrieve
505
+ data from a database, you need to know the name of the table
506
+ in which the data is stored . You can get the names of
509
507
all the tables in the database using the ` dbListTables ` \index{database!tables}
510
508
function:
511
509
512
510
``` {r}
513
- tables <- dbListTables(con_lang_data )
511
+ tables <- dbListTables(conn_lang_data )
514
512
tables
515
513
```
516
514
517
- We only get one table name returned from calling ` dbListTables ` , which tells us
515
+ The ` dbListTables ` function returned only one name , which tells us
518
516
that there is only one table in this database. To reference a table in the
519
- database to do things like select columns and filter rows, we use the ` tbl `
520
- function \index{database!tbl} from the ` dbplyr ` package. The package ` dbplyr `
521
- \index{dbplyr|see{database}}\index{database!dbplyr} allows us to work with data
522
- stored in databases as if they were local data frames, which is useful because
523
- we can do a lot with big data sets without actually having to bring these vast
524
- amounts of data into your computer!
517
+ database (so that we can perform operations like selecting columns and filtering rows) , we
518
+ use the ` tbl ` function \index{database!tbl} from the ` dbplyr ` package. The object returned
519
+ by the ` tbl ` function \index{dbplyr|see{database}}\index{database!dbplyr} allows us to work with data
520
+ stored in databases as if they were just regular data frames; but secretly, behind
521
+ the scenes, ` dbplyr ` is turning your function calls (e.g., ` select ` and ` filter ` )
522
+ into SQL queries!
525
523
526
524
``` {r}
527
525
library(dbplyr)
528
526
529
- lang_db <- tbl(con_lang_data , "lang")
527
+ lang_db <- tbl(conn_lang_data , "lang")
530
528
lang_db
531
529
```
532
530
533
531
Although it looks like we just got a data frame from the database, we didn't!
534
- It's a * reference* , showing us data that is still in the SQLite database. It
535
- does this because databases are often more efficient at selecting, filtering
536
- and joining large data sets than R. And typically, the database will not even
537
- be stored on your computer but rather a more powerful machine somewhere on the
532
+ It's a * reference* ; the data is still stored only in the SQLite database. The
533
+ ` dbplyr ` package works this way because databases are often more efficient at selecting, filtering
534
+ and joining large data sets than R. And typically the database will not even
535
+ be stored on your computer, but rather a more powerful machine somewhere on the
538
536
web. So R is lazy and waits to bring this data into memory until you explicitly
539
- tell it to using the ` collect ` \index{database!collect} function from the
540
- ` dbplyr ` package. Figure \@ ref(fig:01-ref-vs-tibble) highlights the difference
541
- between a ` tibble ` object in R and the output we just got . Notice in the table
537
+ tell it to using the ` collect ` \index{database!collect} function.
538
+ Figure \@ ref(fig:01-ref-vs-tibble) highlights the difference
539
+ between a ` tibble ` object in R and the output we just created . Notice in the table
542
540
on the right, the first two lines of the output indicate the source is SQL. The
543
541
last line doesn't show how many rows there are (R is trying to avoid performing
544
542
expensive query operations), whereas the output for the ` tibble ` object does.
@@ -548,40 +546,50 @@ knitr::include_graphics("img/ref_vs_tibble.jpeg")
548
546
```
549
547
550
548
We can look at the SQL commands that are sent to the database when we write
551
- ` tbl(con_lang_data , "lang") ` in R with the ` show_query ` function from the
549
+ ` tbl(conn_lang_data , "lang") ` in R with the ` show_query ` function from the
552
550
` dbplyr ` package. \index{database!show\_ query}
553
551
554
552
``` {r}
555
- show_query(tbl(con_lang_data , "lang"))
553
+ show_query(tbl(conn_lang_data , "lang"))
556
554
```
557
555
558
- From the output above, we can see the SQL code sent to the database. When we
559
- write ` tbl(con_lang_data, "lang") ` in R, in the background, the function is
560
- translating the R code into SQL, asking the database, and then translating the
561
- response back to us. So instead of us needing to know the SQL code ourselves
562
- and switching back and forth between R and SQL code, the ` dbplyr ` package does
563
- that for us.
556
+ The output above shows the SQL code that is sent to the database. When we
557
+ write ` tbl(conn_lang_data, "lang") ` in R, in the background, the function is
558
+ translating the R code into SQL, sending that SQL to the database, and then translating the
559
+ response for us. So ` dbplyr ` does all the hard work of translating from R to SQL and back for us;
560
+ we can just stick with R!
564
561
565
- Now we will filter for only rows in the Aboriginal languages category according
566
- to the 2016 Canada Census, and then use ` collect ` to finally bring this data
567
- into R as a data frame. \index{filter}
562
+ With our ` lang_db ` table reference for the 2016 Canadian Census data in hand, we
563
+ can mostly continue onward as if it were a regular data frame. For example,
564
+ we can use the ` filter ` function
565
+ to obtain only certain rows. Below we filter the data to include only Aboriginal languages.
568
566
569
567
``` {r}
570
568
aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages")
571
569
aboriginal_lang_db
572
570
```
573
571
572
+ Above you can again see the hints that this data is not actually stored in R yet:
573
+ the source is a ` lazy query [?? x 6] ` and the output says ` ... with more rows ` at the end
574
+ (both indicating that R does not know how many rows there are in total!),
575
+ and a database type ` sqlite 3.36.0 ` is listed.
576
+ In order to actually retrieve this data in R as a data frame,
577
+ we use the ` collect ` function. \index{filter}
578
+ Below you will see that after running ` collect ` , R knows that the retrieved
579
+ data has 67 rows, and there is no database listed any more.
580
+
574
581
``` {r}
575
582
aboriginal_lang_data <- collect(aboriginal_lang_db)
576
583
aboriginal_lang_data
577
584
```
578
585
579
- Why bother to use the ` collect ` function? The data looks pretty similar in both
580
- outputs shown above. And ` dbplyr ` provides lots of functions similar to
581
- ` filter ` that you can use to directly feed the database reference (what ` tbl `
582
- gives you) into downstream analysis functions (e.g., ` ggplot2 ` for data
583
- visualization and ` lm ` for linear regression modeling). However, this does not
584
- work in * every* case; look what happens when we try to use ` nrow ` to count rows
586
+ Aside from knowing the number of rows, the data looks pretty similar in both
587
+ outputs shown above. And ` dbplyr ` provides many more functions (not just ` filter ` )
588
+ that you can use to directly feed the database reference (` lang_db ` ) into
589
+ downstream analysis functions (e.g., ` ggplot2 ` for data visualization).
590
+ But ` dbplyr ` does not provide * every* function that we need for analysis;
591
+ we do eventually need to call ` collect ` .
592
+ For example, look what happens when we try to use ` nrow ` to count rows
585
593
in a data frame: \index{nrow}
586
594
587
595
``` {r}
@@ -600,16 +608,17 @@ tail(aboriginal_lang_db)
600
608
Additionally, some operations will not work to extract columns or single values
601
609
from the reference given by the ` tbl ` function. Thus, once you have finished
602
610
your data wrangling of the ` tbl ` database reference object, it is advisable to
603
- bring it into your local machine's memory using ` collect ` as a data frame .
604
- Usually, databases are very big! Reading the object into your local machine may
605
- give an error or take a lot of time to run so be careful if you plan to do
606
- this. This is one reason we may want to filter rows or select columns first
607
- before reading it in !
611
+ bring it into R as a data frame using ` collect ` .
612
+ But be very careful using ` collect ` : databases are often * very * big,
613
+ and reading an entire table into R might take a long time to run or even possibly
614
+ crash your machine. So make sure you use ` filter ` and ` select ` on the database table
615
+ to reduce the data to a reasonable size before using ` collect ` to read it into R !
608
616
609
- #### Reading data from a PostgreSQL database
617
+ ### Reading data from a PostgreSQL database
610
618
611
619
PostgreSQL (also called Postgres) \index{database!PostgreSQL} is a very popular
612
- and open-source option for relational database software. Unlike SQLite,
620
+ and open-source option for relational database software.
621
+ Unlike SQLite,
613
622
PostgreSQL uses a client–server database engine, as it was designed to be used
614
623
and accessed on a network. This means that you have to provide more information
615
624
to R when connecting to Postgres databases. The additional information that you
@@ -630,19 +639,17 @@ be able to connect to a database using this information.
630
639
631
640
``` {r, eval = FALSE}
632
641
library(RPostgres)
633
- can_mov_db_con <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db",
642
+ conn_mov_data <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db",
634
643
host = "fakeserver.stat.ubc.ca", port = 5432,
635
644
user = "user0001", password = "abc123")
636
645
```
637
646
638
- ### Interacting with a database
639
-
640
647
After opening the connection, everything looks and behaves almost identically
641
648
to when we were using an SQLite database in R. For example, we can again use
642
649
` dbListTables ` to find out what tables are in the ` can_mov_db ` database:
643
650
644
651
``` {r, eval = FALSE}
645
- dbListTables(can_mov_db_con )
652
+ dbListTables(conn_mov_data )
646
653
```
647
654
648
655
```
@@ -655,13 +662,13 @@ We see that there are 10 tables in this database. Let's first look at the
655
662
database:
656
663
657
664
``` {r, eval = FALSE}
658
- ratings_db <- tbl(can_mov_db_con , "ratings")
665
+ ratings_db <- tbl(conn_mov_data , "ratings")
659
666
ratings_db
660
667
```
661
668
662
669
```
663
670
# Source: table<ratings> [?? x 3]
664
- # Database: postgres [user0001@r7k3-mds1 .stat.ubc.ca:5432/can_mov_db]
671
+ # Database: postgres [user0001@fakeserver .stat.ubc.ca:5432/can_mov_db]
665
672
title average_rating num_votes
666
673
<chr> <dbl> <int>
667
674
1 The Grand Seduction 6.6 150
@@ -688,7 +695,7 @@ avg_rating_db
688
695
689
696
```
690
697
# Source: lazy query [?? x 1]
691
- # Database: postgres [user0001@r7k3-mds1 .stat.ubc.ca:5432/can_mov_db]
698
+ # Database: postgres [user0001@fakeserver .stat.ubc.ca:5432/can_mov_db]
692
699
average_rating
693
700
<dbl>
694
701
1 6.6
@@ -740,16 +747,18 @@ that setting since we had to use `dbplyr` to translate `tidyverse`-like
740
747
commands (` filter ` , ` select ` , ` head ` , etc.) into SQL commands that the database
741
748
understands. Not all ` tidyverse ` commands can currently be translated with
742
749
SQLite databases. For example, we can compute a mean with an SQLite database
743
- but can't easily compute a median. So you might be wondering why should we use
750
+ but can't easily compute a median. So you might be wondering: why should we use
744
751
databases at all?
745
752
746
753
Databases are beneficial in a large-scale setting:
747
754
748
- - they enable storing large data sets across multiple computers with automatic redundancy and backups
749
- - they allow multiple users to access them simultaneously and remotely without conflicts and errors
755
+ - they enable storing large data sets across multiple computers with backups
750
756
- they provide mechanisms for ensuring data integrity and validating input
751
- - they provide security to keep data safe
752
- For example, [ there are billions of Google searches conducted daily] ( https://www.internetlivestats.com/google-search-statistics/ ) . Can you imagine if Google stored all of the data from those queries in a single ` .csv file ` !? Chaos would ensue!
757
+ - they provide security and data access control
758
+ - they allow multiple users to access data simultaneously and remotely without conflicts and errors.
759
+ For example, [ there are billions of Google searches conducted daily] ( https://www.internetlivestats.com/google-search-statistics/ ) .
760
+ Can you imagine if Google stored all of the data from those searches in a single `.csv
761
+ file`!? Chaos would ensue!
753
762
754
763
## Writing data from R to a ` .csv ` file
755
764
0 commit comments