Skip to content

Commit 3846665

Browse files
Merge pull request #412 from UBC-DSCI/feat-foreward
foreword added
2 parents a37adca + fc8a005 commit 3846665

File tree

3 files changed

+69
-0
lines changed

3 files changed

+69
-0
lines changed

foreword-text.Rmd

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Foreword {-}
2+
3+
*Roger D. Peng*
4+
5+
*Johns Hopkins Bloomberg School of Public Health*
6+
7+
*2022-01-04*
8+
9+
The field of data science has expanded and grown significantly in recent years,
10+
attracting excitement and interest from many different directions. The demand for introductory
11+
educational materials has grown concurrently with the growth of the field itself, leading to
12+
a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important
13+
contribution to this fast-growing literature, but given the wide availability of materials, a
14+
reader should be inclined to ask, "What is the unique contribution of *this* book?" In order
15+
to answer that question it is useful to step back for a moment and consider the development
16+
of the field of data science over the past few years.
17+
18+
When thinking about data science, it is important to consider two questions: "What is
19+
data science?" and "How should one do data science?" The former question is under active
20+
discussion amongst a broad community of researchers and practitioners and there does
21+
not appear to be much consensus to date. However, there seems a general understanding
22+
that data science focuses on the more "active" elements—data wrangling, cleaning, and
23+
analysis—of answering questions with data. These elements are often highly
24+
problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we
25+
have seen some core elements emerge that appear to repeat themselves as useful concepts
26+
across different problems. Given the lack of clear agreement over the definition of data
27+
science, there is a strong need for a book like this one to propose a vision for what the field
28+
is and what the implications are for the activities in which members of the field engage.
29+
30+
The first important concept addressed by this book is tidy data, which is a format for
31+
tabular data formally introduced to the statistical community in a 2014 paper by Hadley
32+
Wickham. The tidy data organization strategy has proven a powerful abstract concept for
33+
conducting data analysis, in large part because of the vast toolchain implemented in the
34+
Tidyverse collection of R packages. The second key concept is the development of workflows
35+
for reproducible and auditable data analyses. Modern data analyses have only grown in
36+
complexity due to the availability of data and the ease with which we can implement complex
37+
data analysis procedures. Furthermore, these data analyses are often part of
38+
decision-making processes that may have significant impacts on people and communities. Therefore,
39+
there is a critical need to build reproducible analyses that can be studied and repeated by
40+
others in a reliable manner. Statistical methods clearly represent an important element
41+
of data science for building prediction and classification models and for making inferences
42+
about unobserved populations. Finally, because a field can succeed only if it fosters an
43+
active and collaborative community, it has become clear that being fluent in the tools of
44+
collaboration is a core element of data science.
45+
46+
This book takes these core concepts and focuses on how one can apply them to *do* data
47+
science in a rigorous manner. Students who learn from this book will be well-versed in
48+
the techniques and principles behind producing reliable evidence from data. This book is
49+
centered around the use of the R programming language within the tidy data framework,
50+
and as such employs the most recent advances in data analysis coding. The use of Jupyter
51+
notebooks for exercises immediately places the student in an environment that encourages
52+
auditability and reproducibility of analyses. The integration of git and GitHub into the
53+
course is a key tool for teaching about collaboration and community, key concepts that are
54+
critical to data science.
55+
56+
The demand for training in data science continues to increase. The availability of large
57+
quantities of data to answer a variety of questions, the computational power available to
58+
many more people than ever before, and the public awareness of the importance of data for
59+
decision-making have all contributed to the need for high-quality data science work. This
60+
book provides a sophisticated first introduction to the field of data science and provides
61+
a balanced mix of practical skills along with generalizable principles. As we continue to
62+
introduce students to data science and train them to confront an expanding array of data
63+
science problems, they will be well-served by the ideas presented here.

index.Rmd

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ output:
3434

3535
---
3636

37+
```{r preface, child="foreword-text.Rmd"}
38+
```
39+
3740
```{r preface, child="preface-text.Rmd"}
3841
```
3942

pdf/index.Rmd

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ knitr::opts_chunk$set(fig.pos = "H",
2828
2929
```
3030

31+
```{r preface, child="foreword-text.Rmd"}
32+
```
33+
3134
```{r preface, child="preface-text.Rmd"}
3235
```
3336

0 commit comments

Comments
 (0)