Skip to content

Commit 1e0d83b

Browse files
database reorganization
1 parent 86a8f38 commit 1e0d83b

File tree

1 file changed

+69
-60
lines changed

1 file changed

+69
-60
lines changed

reading.Rmd

Lines changed: 69 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -481,11 +481,9 @@ relational databases and use the R programming language
481481
to obtain data. In this book, we will give examples of how to do this
482482
using R with SQLite and PostgreSQL databases.
483483

484-
### Connecting to a database
484+
### Reading data from a SQLite database
485485

486-
#### Reading data from a SQLite database
487-
488-
SQLite \index{database!SQLite} is probably the simplest relational database
486+
SQLite \index{database!SQLite} is probably the simplest relational database system
489487
that one can use in combination with R. SQLite databases are self-contained and
490488
usually stored and accessed locally on one computer. Data is usually stored in
491489
a file with a `.db` extension. Similar to Excel files, these are not plain text
@@ -495,50 +493,50 @@ The first thing you need to do to read data into R from a database is to
495493
connect to the database. We do that using the `dbConnect` function from the
496494
`DBI` (database interface) package. \index{database!connect} This does not read
497495
in the data, but simply tells R where the database is and opens up a
498-
communication channel.
496+
communication channel that R can use to send SQL commands to the database.
499497

500498
```{r}
501499
library(DBI)
502500
503-
con_lang_data <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
501+
conn_lang_data <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
504502
```
505503

506-
Often relational databases have many tables; thus, anytime
507-
you want to access data from a
508-
relational database, you need to know the table names. You can get the names of
504+
Often relational databases have many tables; thus, in order to retrieve
505+
data from a database, you need to know the name of the table
506+
in which the data is stored. You can get the names of
509507
all the tables in the database using the `dbListTables` \index{database!tables}
510508
function:
511509

512510
```{r}
513-
tables <- dbListTables(con_lang_data)
511+
tables <- dbListTables(conn_lang_data)
514512
tables
515513
```
516514

517-
We only get one table name returned from calling `dbListTables`, which tells us
515+
The `dbListTables` function returned only one name, which tells us
518516
that there is only one table in this database. To reference a table in the
519-
database to do things like select columns and filter rows, we use the `tbl`
520-
function \index{database!tbl} from the `dbplyr` package. The package `dbplyr`
521-
\index{dbplyr|see{database}}\index{database!dbplyr} allows us to work with data
522-
stored in databases as if they were local data frames, which is useful because
523-
we can do a lot with big data sets without actually having to bring these vast
524-
amounts of data into your computer!
517+
database (so that we can perform operations like selecting columns and filtering rows), we
518+
use the `tbl` function \index{database!tbl} from the `dbplyr` package. The object returned
519+
by the `tbl` function \index{dbplyr|see{database}}\index{database!dbplyr} allows us to work with data
520+
stored in databases as if they were just regular data frames; but secretly, behind
521+
the scenes, `dbplyr` is turning your function calls (e.g., `select` and `filter`)
522+
into SQL queries!
525523

526524
```{r}
527525
library(dbplyr)
528526
529-
lang_db <- tbl(con_lang_data, "lang")
527+
lang_db <- tbl(conn_lang_data, "lang")
530528
lang_db
531529
```
532530

533531
Although it looks like we just got a data frame from the database, we didn't!
534-
It's a *reference*, showing us data that is still in the SQLite database. It
535-
does this because databases are often more efficient at selecting, filtering
536-
and joining large data sets than R. And typically, the database will not even
537-
be stored on your computer but rather a more powerful machine somewhere on the
532+
It's a *reference*; the data is still stored only in the SQLite database. The
533+
`dbplyr` package works this way because databases are often more efficient at selecting, filtering
534+
and joining large data sets than R. And typically the database will not even
535+
be stored on your computer, but rather a more powerful machine somewhere on the
538536
web. So R is lazy and waits to bring this data into memory until you explicitly
539-
tell it to using the `collect` \index{database!collect} function from the
540-
`dbplyr` package. Figure \@ref(fig:01-ref-vs-tibble) highlights the difference
541-
between a `tibble` object in R and the output we just got. Notice in the table
537+
tell it to using the `collect` \index{database!collect} function.
538+
Figure \@ref(fig:01-ref-vs-tibble) highlights the difference
539+
between a `tibble` object in R and the output we just created. Notice in the table
542540
on the right, the first two lines of the output indicate the source is SQL. The
543541
last line doesn't show how many rows there are (R is trying to avoid performing
544542
expensive query operations), whereas the output for the `tibble` object does.
@@ -548,40 +546,50 @@ knitr::include_graphics("img/ref_vs_tibble.jpeg")
548546
```
549547

550548
We can look at the SQL commands that are sent to the database when we write
551-
`tbl(con_lang_data, "lang")` in R with the `show_query` function from the
549+
`tbl(conn_lang_data, "lang")` in R with the `show_query` function from the
552550
`dbplyr` package. \index{database!show\_query}
553551

554552
```{r}
555-
show_query(tbl(con_lang_data, "lang"))
553+
show_query(tbl(conn_lang_data, "lang"))
556554
```
557555

558-
From the output above, we can see the SQL code sent to the database. When we
559-
write `tbl(con_lang_data, "lang")` in R, in the background, the function is
560-
translating the R code into SQL, asking the database, and then translating the
561-
response back to us. So instead of us needing to know the SQL code ourselves
562-
and switching back and forth between R and SQL code, the `dbplyr` package does
563-
that for us.
556+
The output above shows the SQL code that is sent to the database. When we
557+
write `tbl(conn_lang_data, "lang")` in R, in the background, the function is
558+
translating the R code into SQL, sending that SQL to the database, and then translating the
559+
response for us. So `dbplyr` does all the hard work of translating from R to SQL and back for us;
560+
we can just stick with R!
564561

565-
Now we will filter for only rows in the Aboriginal languages category according
566-
to the 2016 Canada Census, and then use `collect` to finally bring this data
567-
into R as a data frame. \index{filter}
562+
With our `lang_db` table reference for the 2016 Canadian Census data in hand, we
563+
can mostly continue onward as if it were a regular data frame. For example,
564+
we can use the `filter` function
565+
to obtain only certain rows. Below we filter the data to include only Aboriginal languages.
568566

569567
```{r}
570568
aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages")
571569
aboriginal_lang_db
572570
```
573571

572+
Above you can again see the hints that this data is not actually stored in R yet:
573+
the source is a `lazy query [?? x 6]` and the output says `... with more rows` at the end
574+
(both indicating that R does not know how many rows there are in total!),
575+
and a database type `sqlite 3.36.0` is listed.
576+
In order to actually retrieve this data in R as a data frame,
577+
we use the `collect` function. \index{filter}
578+
Below you will see that after running `collect`, R knows that the retrieved
579+
data has 67 rows, and there is no database listed any more.
580+
574581
```{r}
575582
aboriginal_lang_data <- collect(aboriginal_lang_db)
576583
aboriginal_lang_data
577584
```
578585

579-
Why bother to use the `collect` function? The data looks pretty similar in both
580-
outputs shown above. And `dbplyr` provides lots of functions similar to
581-
`filter` that you can use to directly feed the database reference (what `tbl`
582-
gives you) into downstream analysis functions (e.g., `ggplot2` for data
583-
visualization and `lm` for linear regression modeling). However, this does not
584-
work in *every* case; look what happens when we try to use `nrow` to count rows
586+
Aside from knowing the number of rows, the data looks pretty similar in both
587+
outputs shown above. And `dbplyr` provides many more functions (not just `filter`)
588+
that you can use to directly feed the database reference (`lang_db`) into
589+
downstream analysis functions (e.g., `ggplot2` for data visualization).
590+
But `dbplyr` does not provide *every* function that we need for analysis;
591+
we do eventually need to call `collect`.
592+
For example, look what happens when we try to use `nrow` to count rows
585593
in a data frame: \index{nrow}
586594

587595
```{r}
@@ -600,16 +608,17 @@ tail(aboriginal_lang_db)
600608
Additionally, some operations will not work to extract columns or single values
601609
from the reference given by the `tbl` function. Thus, once you have finished
602610
your data wrangling of the `tbl` database reference object, it is advisable to
603-
bring it into your local machine's memory using `collect` as a data frame.
604-
Usually, databases are very big! Reading the object into your local machine may
605-
give an error or take a lot of time to run so be careful if you plan to do
606-
this. This is one reason we may want to filter rows or select columns first
607-
before reading it in!
611+
bring it into R as a data frame using `collect`.
612+
But be very careful using `collect`: databases are often *very* big,
613+
and reading an entire table into R might take a long time to run or even possibly
614+
crash your machine. So make sure you use `filter` and `select` on the database table
615+
to reduce the data to a reasonable size before using `collect` to read it into R!
608616

609-
#### Reading data from a PostgreSQL database
617+
### Reading data from a PostgreSQL database
610618

611619
PostgreSQL (also called Postgres) \index{database!PostgreSQL} is a very popular
612-
and open-source option for relational database software. Unlike SQLite,
620+
and open-source option for relational database software.
621+
Unlike SQLite,
613622
PostgreSQL uses a client–server database engine, as it was designed to be used
614623
and accessed on a network. This means that you have to provide more information
615624
to R when connecting to Postgres databases. The additional information that you
@@ -630,19 +639,17 @@ be able to connect to a database using this information.
630639

631640
```{r, eval = FALSE}
632641
library(RPostgres)
633-
can_mov_db_con <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db",
642+
conn_mov_data <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db",
634643
host = "fakeserver.stat.ubc.ca", port = 5432,
635644
user = "user0001", password = "abc123")
636645
```
637646

638-
### Interacting with a database
639-
640647
After opening the connection, everything looks and behaves almost identically
641648
to when we were using an SQLite database in R. For example, we can again use
642649
`dbListTables` to find out what tables are in the `can_mov_db` database:
643650

644651
```{r, eval = FALSE}
645-
dbListTables(can_mov_db_con)
652+
dbListTables(conn_mov_data)
646653
```
647654

648655
```
@@ -655,13 +662,13 @@ We see that there are 10 tables in this database. Let's first look at the
655662
database:
656663

657664
```{r, eval = FALSE}
658-
ratings_db <- tbl(can_mov_db_con, "ratings")
665+
ratings_db <- tbl(conn_mov_data, "ratings")
659666
ratings_db
660667
```
661668

662669
```
663670
# Source: table<ratings> [?? x 3]
664-
# Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db]
671+
# Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db]
665672
title average_rating num_votes
666673
<chr> <dbl> <int>
667674
1 The Grand Seduction 6.6 150
@@ -688,7 +695,7 @@ avg_rating_db
688695

689696
```
690697
# Source: lazy query [?? x 1]
691-
# Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db]
698+
# Database: postgres [user0001@fakeserver.stat.ubc.ca:5432/can_mov_db]
692699
average_rating
693700
<dbl>
694701
1 6.6
@@ -740,16 +747,18 @@ that setting since we had to use `dbplyr` to translate `tidyverse`-like
740747
commands (`filter`, `select`, `head`, etc.) into SQL commands that the database
741748
understands. Not all `tidyverse` commands can currently be translated with
742749
SQLite databases. For example, we can compute a mean with an SQLite database
743-
but can't easily compute a median. So you might be wondering why should we use
750+
but can't easily compute a median. So you might be wondering: why should we use
744751
databases at all?
745752

746753
Databases are beneficial in a large-scale setting:
747754

748755
- they enable storing large data sets across multiple computers with automatic redundancy and backups
749-
- they allow multiple users to access them simultaneously and remotely without conflicts and errors
750756
- they provide mechanisms for ensuring data integrity and validating input
751-
- they provide security to keep data safe
752-
For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/). Can you imagine if Google stored all of the data from those queries in a single `.csv file`!? Chaos would ensue!
757+
- they provide security mechanisms to control access to data
758+
- they allow multiple users to access data simultaneously and remotely without conflicts and errors.
759+
For example, [there are billions of Google searches conducted daily](https://www.internetlivestats.com/google-search-statistics/).
760+
Can you imagine if Google stored all of the data from those searches in a single `.csv
761+
file`!? Chaos would ensue!
753762

754763
## Writing data from R to a `.csv` file
755764

0 commit comments

Comments
 (0)